Tune in to the Data Games on April 16th!
Welcome to Week 3 of Boot Camp! This week is all about data in the world. What questions do you have about the world around you? How can you collect different "features" of data to help you answer those questions?
You all did an amazing job creating and submitting your word clouds last week! Here are a few of the submissions from other Young Data Scientists. Click an image to make it bigger and see what your fellow Young Data Scientists created!
Hey facilitators! Check out this guide before you walk through the lesson with your students. There is no setup required this week for the activity.
YDSQ Bootcamp Week 3: Facilitator Guide
If you are a student going through the lesson on your own, you can follow along with the steps below!
Part 1: Introducing Data Features
Imagine you are a movie director and you wanted to choose a cast that represents you and your classroom...
What would you look for when choosing actors?
Take a moment to discuss as a group.
You may have said words like gender, race, age, height, or many more. In a dataset, we call these features.
By considering these features, you are already forming an ideal dataset in your mind to answer your question.
Think back to the activity from Week 1's lesson when you built a histogram as a group...
...in this activity skin tone was the feature you were exploring!
Part 2: Exploring Multiple Features
What about if you wanted to know about more than just one feature in a dataset?
We have a dataset available all about celebrities and the different features associated with them: career, net worth, gender, age, height, and skin tone.
What if we wanted to explore the skin tone and age of celebrities? How might we visualize it?
One thing you could do is to make histograms for each feature like we did in Week 1 telling us the number of celebrities on the list that fall into each skin tone and age bucket.
As a group, what are some insights you can identify from the two histograms above?
What if you wanted to find insights about how the skin tone of these celebrities are related to their age?
We'll need a different kind of visual. Let's try making a histogram combo chart!
Fig 4. Histogram combo chart showing Average Celebrity Age by Skin Tone
On the y-axis, we have the average age instead of just the frequencies.
Are there any new insights you can make now that Skin Tone and Age are both included on the same visual?
Part 3: What is Multivariate Data?
The example in Part 2 is known as multivariate data!
When we create relationships between two or more features. Click through the slides below to learn more!
Part 4: Visualizing Multivariate Data
In Part 2, we visualized multivariate data by making a histogram combo chart (Figure 4), but there are many different types of visuals you could use.
To decide which type of visual to use consider the features you want to explore. There are two main types of features...
Numerical Features
Numerical Features have slightly different values for each unit of data.
An Example of a Numerical Value would be...
RGB Values (e.g. Skin Tones)
RGB Values have slightly different sets of numbers for each data point in the dataset.
Categorical Features
Categorical Features assign each unit of data to a set of distinct labelled groups.
An Example of a Categorical Value would be...
States or Provinces
Each state is defined and drawn with distinct geographical boundaries.
A Scatter Plot is a great option for visualizing numerical features
A Heatmap is a great option for visualizing categorical features.
Data Story: In the past nineteen years, there have been 228 issues of Vogue, with a total of 262 female cover models. This graph shows where these women fall on the skin tone spectrum.
Part 5: It's Your Turn to Explore!
Now it's your turn to find your own insights about this Celebrity dataset with the embedded report below! Just scroll down and start following the hints to interact with the report.
Don't worry yet about figuring out how to build your own reports and visuals. There is a project next week that will show you how!
If you would like a quick tutorial on how to start interacting with Data Studio on your own check out this Data Studio Guide.
Part 6: Correlation vs. Causation
We found a dataset that shows the number of shark attacks increases as the number of ice cream sales increases...
...does that mean that shark attacks are caused by ice cream sales?
What do you think?
Nope! The relationship between these two features, shark attacks and ice cream sales, show an example of correlation.
Just because two features have a relationship like ice cream sales and shark attacks do does not mean that one directly caused the other.
What's a third data feature that might be causing both shark attacks and ice cream sales to increase?
Discuss what you think with your class...
The weather!
When the temperature gets hotter there are more people swimming and eating cold treats to cool off.
Part 7: Key Takeaways
Collecting the right features in your dataset is just as important as asking the right question.
Visualizing one feature is interesting, but finding the relationships between multiple features can lead to more impactful insights!
Correlation does not equal causation. Just because there is a relationship between two features doesn't mean one feature caused the other.
Work in groups or individually to work through the lab below!
https://colab.research.google.com/drive/1VLzyUN4jviuO5zWCqqolZtJEsvPlcnzD?usp=sharing
As a group, think about the topics, data sources, and storytelling opportunities you have been brainstorming together the last few weeks. This week as a group, use your newfound knowledge of data features to:
Spend some time brainstorming a few features that would help your team get to know your topic(s) better. Make a list to send to your mentors.
Compare this list to the data sources you collected in week 2 and the stories you think you're interested in telling:
Do the datasets you've found already have all the features you want?
If not, are there other data sets available that do have those features?
Use these questions to start narrowing your topics to the one that not only you feel passionate about, but also has the existing data sets and features to allow you to tell your story!
Connect with your mentor(s) on what you find! They may be able to help identify datasets with the features you need or find relationships between multiple datasets.