Welcome. This video is designed to help

you choose a graph to visualize data. We’ll focus on some of the more common graph

types listed on this slide: barchart, mosaic plot, histogram, density plot, boxplot, scatterplot,

and linegraph. Our goal with this video is to help you conceptually

choose from among these graphs. We will not go into the “anatomy” of each

graph. But if you need a quick refresher, there are many existing resources online.

One resource is The Data Visualization Catalogue, whose URL appears on this slide. And we will not go into details of how to

construct these graphs, which depends on the software you’re using. If you’re curious,

we used the software R and the package ggplot2 to create the graphs in these examples. Choosing a graph type can be boiled down to asking two questions about your data. How many variables do I have? 1 or 2? And what type(s) of variables are they? Categorical or quantitative,

also known as continuous or numeric. Now be careful. These questions might sound simple and trivial but they can often be more subtle when your research question is not well-stated, or when you’re not careful to think about your data structure. Here is an overview table of how to choose a graph type based on your answers to those two questions. If you have one variable – in which case, you’re interested in its distribution – and the variable is categorical, then you could use a barchart, to visualize the distribution. And if it’s quantitative, then you could use a histogram, a density plot or a boxplot. If you have two variables – in which case

you’re interested in their relationship – and both are categorical, then you could use a segmented barchart or a mosaic plot. If one is quantitative and the other is categorical,

then use side-by-side boxplots. And if both are quantitative, then use a scatterplot. In the rest of this video, we’ll see examples

of each of these and also preview what you can do when you want to visualize more than 2 variables. For our examples we’ll use data from Gapminder, which you saw in the disciplinary context video with Hans Rosling. It contains data on various economic and social indicators from countries around the world, and has some fascinating visualization tools ready for you to use. You can also download data,

which we did. We’ll focus on a sample of 137 countries.

And we’ll consider 4 variables for each country: Major religion, defined as the religion adhered by more than 50% of the country’s population, excluding atheists and agnostics. The possible religions in this sample are Christianity, Eastern religions, and Islam; Number of cells phones owned per 100 people, (so that values of 100 suggest, on average, each person in that country owns 1 cell phone); Economic status: Either OECD/developed or G77/developing; and Average life expectancy in years. Let’s take a moment to think about the structure of our data: We’re describing characteristics of countries in the world. So each row or observation in our dataset is a country, and each column corresponds to one of the 4 variables of interest. Note that major religion and economic status

are both categorical; whereas number of cell phones per 100 people and average life expectancy are both quantitative. We’ll now see examples of choosing graphs

for different questions about our data. Suppose we wanted to know the distribution of major religions across countries in world. Here we have one variable and it’s categorical,

so we use a barchart. From this barchart, we see that Christianity

is the most popular major religion in these 137 countries. (In fact, it’s the major religion in 88 countries), and Eastern religions is the least popular (being the major religion in only 11 countries). Now suppose we wish to know the distribution

of number of cell phones per 100 people. Here we have one variable and it’s quantitative,

so we can use a histogram. The distribution of number of cell phones

per 100 people looks fairly symmetric and is centered around 100. That is, in many

countries, on average, most people own one cell phone, but there are a few countries where,

on average, its people own more than 1 cell phone, and also a few countries where some of its

people don’t own a phone at all. Instead of a histogram, we could also use

a density plot to visualize the distribution of number of cell phones per 100 people. This might be a new graph type for some of you, but essentially, as you can see, it’s a “smoothed” version of the histogram. Finally, there’s a third graph type that

we can use to visualize the distribution of a quantitative variable – a boxplot. From this boxplot, we see the median is at

100 cell phones per 100 people. This suggests that, in at least half of the countries, people on average own at least one cell phone. This slide just shows all three graph types

for a single quantitative variable – the histogram, density plot, and boxplot. All three are appropriate choices for showing the distribution of a single quantitative variable. The histogram and density plot show the shape of the distribution very clearly, whereas the boxplot helps us

focus on the key features of the distribution. Earlier we explored the distribution of major

religions across the world. But what if we wanted to look at this distribution by economic status? That is, how do G77 countries compare with OECD countries? Now we have two variables (major religion

and economic status) and both are categorical, so we can use a segmented barchart. We see Christianity is more popular in developed (OECD) countries than in developing (G77) countries, whereas Islam is much more popular in the (G77) developing countries than in the (OECD) developed countries. Eastern religions, shown in the green bar, are the least popular major religion in both groups of countries. Instead of a segmented barchart, we could also use a mosaic plot. The widths of the bars are proportional to the size of that group. The G77 bar is wider than the OECD

bar, meaning there are more G77 countries than OECD countries in this dataset. Earlier we explored the distribution of number of cell phones per 100 people. But what if again we wanted to explore this distribution by economic status? Again, how does the distribution compare between G77 and OECD countries? Now we still have two variables, but one is quantitative and the other is categorical, so we can use side-by-side boxplots, one boxplot of number of cell phones per 100 people, for each group of countries. We see that there is much more variability in number of cell phones per 100 people in G77 countries than in OECD countries. In fact, there are some G77 countries with as

few as 2.5 cell phones per 100 people and, and, perhaps surprisingly, some with as many as 204 cell phones per 100 people! Also, OECD countries tend to have more cell phones per 100 people than G77 countries. In addition to side-by-side boxplots, we could also make two histograms, one for G77 countries, and one for OECD countries. This works fine when you only have 2 groups, as we see here. But what if we had 4 groups? Or even 10 groups? Then it becomes harder to show all of the histograms on the same plot, and still be able to see each individual histogram. And further, comparison of multiple boxplots is much easier than comparison of multiple histograms. So generally, side-by-side boxplots are preferred for showing multiple distributions, What if we wanted to show how the number of cell phones per 100 people and average life expectancy are related? Here we have two variables and both are quantitative,

so we can use a scatterplot. From this scatterplot of number of cell phones

per 100 people vs. average life expectancy, we see a generally positive, linear relationship.

That is, as average life expectancy increases, so does the number of cell phones per 100

people. This is probably not too surprising; just be careful not to conclude causation

here! Finally, what if we wanted to see how average life expectancy changes over time in the United States? Here again we have two quantitative variables,

but one represents time. Then we can use a line graph. This line graph shows life expectancy in the

United States in each year from 1800 to 2015. We notice a period of constant life expectancy

around 40 years in the early 1800’s, followed by a steady increase in life expectancy, starting around 1875, where life expectancy was 40 years, until now, where life expectancy is around 80 years. There are also two interesting dips in average life expectancy, in the 1860’s and the 1920’s. This concludes our overview of choosing a graph type, when you have 1 or 2 variables. Before we end the video, let’s briefly explore

some ways to visualize more than two variables. One simple way is through the use of color. So for example, we just explored the relationship between number of cell phones per 100 people and average life expectancy. But what if we wanted to look at this relationship for different regions in the world, a third variable? In our scatterplot before, each dot was black. But we could introduce different colors for each of the different regions of the world, as this scatterplot shows. Or instead of different colors, we could use

different plotting symbols or shapes for each region. Or there’s even something called “faceting,” which basically makes a separate graph for

each value of the “third” variable. In this case, we have 6 scatterplots for each

of the 6 regions of the world. We see that sub-Saharan countries and some

South Asian countries tend to have lower life expectancies and fewer cell phones per 100 people than countries in other regions in the world. It might be interesting to add economic status to this graph. We’ll let you think about how you might do this. This concludes our brief review of choosing

a graph type. In summary, we covered some of the more common graphs you might encounter: The barchart, mosaic plot, histogram,

density plot, boxplot, scatterplot, and linegraph. To choose from among these graphs, we ask

two questions about our data: How many variables do I have?

And what types of variables are they? We also previewed what you might do if you want to visualize more than two variables. With this basic toolkit, you can begin to visually explore your data, and gain insights and possibly surprise findings. But this is only the tip of the iceberg for the world of data visualization. We encourage you to explore these topics on your own. The resources listed here are great places to become inspired.