Welcome to track five. In this tutorial, we’re going to explore

ideas around distributions and sampling. We’re gonna keep working

with continuous data. We’re going to use the data set of the

Academy Award winners for Best Actor and Best Actress. We’re going to be looking

at their age of award. We’re gonna be exploring

those distributions. We’re going to introduce a new tool today. So we’re going to be looking at

a visualization tool that was created right here on UC Davis. And it’s called SeeIt. For this part of the tutorial, we’re just gonna be focusing on one

variable, just that age at award. We have a list all of the people who have

won the best actor in a leading role and the best actress in a leading role Oscars. So that’s an entire population. That’s every member of that population. We’re going to be exploring some

ideas around sampling today. So our driving question today is really

about how closely does a sample represent that population and what influences whether that

sample represents that population. Let’s go get the data. The data’s stored in Google Docs. Here’s the data set. Let’s use Cmd+A to select all,

Cmd+C to copy and then we’re going to open up a new Excel

workbook and paste the data there. So we want a new blank workbook. And Cmd+V to paste. So there’s our data. I’m gonna practice my good habit

of preserving the raw data, so I’m gonna immediately

just make a copy of it. I’m gonna come down to, insert a sheet. And control V again to paste it again. I’m gonna name my sheet one or data. I name sheet two visualization. Now we’re gonna open the tool SeeIt. It’s available on the web,

it’s free, it’s open. There’s a link in the tutorial. There’s a link on the website. So let’s follow that link and open seeit. Seeits a great tool for

exploring the data. For really looking at how it looks, and

to practice our visual reasoning skills. As you can see there are some data

sets already loaded into see it. But we’re gonna load our own

because we want those ages at which the actors and actresses won

the best actor and best actress so we’re gonna come down to the bottom left

and click the plus sign to add a data set. So let’s go get the data. I’m gonna pop back to Excel and

I’m gonna click on the entire column H to highlight it, so

I clicked the gray H at the top, Cmd+C to copy it. You can see the answer active. That means it’s ready to be copied. I’m gonna switch back to see it. On this page I see that I need to enter

a title and that I need to Enter Data and then that Enter Data Set text is already

highlighted so that’s where I’m active at the moment, so if I just command and V,

it’s going to ahead and paste right there. So it’s pasted that column of data

right into that dialogue box. Let’s give it a title. Now do we need to say that the first

column is a label no we’ve only got one column so

the first column isn’t a label. Those are the only options that we’ve

got on this page so let’s come down and click the load worksheet from form. Now it’s giving us a warning that

it’s only going to be accessible for this one browser session, that we’re not

really adding it to the see-it databases. We’re just uploading it for right now. That’s great And out of the bottom of the data sets you can

now see that we’ve got Academy Awards. Let’s look at our data. I’m gonna click on that triangle next to

the title we gave this Academy Awards. I see my one variable age and

years is listed. There’s a pencil, a trash can. If I click on the pencil icon,

it pops up the dataset. Let me just check that it’s there. So let’s see it. We’ve got this work space over here on

the right that’s at the moment called empty graph. Well, let’s fill it. I’m going to click and drag on the title of my variable and

drop it into that graph. And what’s it done What have we got there? There’s a little blob for

every member of that population, every member of that dataset. It’s allowing us to

visualize that distribution. Age in years is a continuous variable, it’s measured in numbers

along a continuum. Let’s put our x axis down there. We’ve got our,

all of the ages of each individual ranging from the youngest through

the majority up to the oldest. And you can see each one of those is

representing a particular member of our data set. To explore this data, to look at

some of the descriptive statistics, we’re going to come up to the top

left corner, and click on the wrench. And it’s asking us,

do you want to group your data, and let’s start by saying

there’s four equal groups. Quartiles, So, our N, our population size is 176,

you can see that’s written down there, so it’s divided that before, and

it’s decided that it needs 44 people, 44 observations in each core tile And so it’s showing us how much of

the scale is needed in order to contain a quarter of the population,

a quarter of the observations. What you should be seeing here is,

you should be asking yourself, do these look symmetrical? Do they look even? Does it take the same amount of space

regardless of where you are on the scale? Or are some of them

highly concentrated and so they get quarterly observations with a

very small amount of the scale while some of them are really dilute, some of them

are really spread out and you need a whole big piece of the scale in order to

get a quarter of the observations. Let’s look at some more of

our descriptive statistics. We’re gonna go up to the wrench. We can add some of these

measures of central tendency. Let’s start with the mean and the median. So we’ve got a red line

that’s telling us the mean. If the word Mean and the number didn’t

pop up above that, go ahead and click on that square and that turns

it on and off, same for the Median. So these are two of the measures

of central tendency. They’re telling us the The number around

which all of the other numbers cluster. You can see that the median is right there

on that central line because the median is the number that is halfway

through the observation. So half of the observation, 88 and up, half of the observations

88 are above the median. The mean is a little bit

higher on the scale. As we drag towards these outliers. See it’s gonna let us create frequency

distributions and histograms as well. Let’s take a look at that. I’m gonna come up here to where it says

advanced mode and it’s kind of gay so it doesn’t encourage you to click on it

but you can go ahead and click on it. And now it’s red and

you’re in advanced mode. Let’s go back under that wrench and see what other options we

have in advanced mode. Let’s explore this fixed interval width. So, what this is telling

us is it’s going to create roughly a histogram where it’s going

to have a fixed interval regular amount of the scale and then it’s

going to count how many observations are in each of those fixed intervals. Let’s reduce the interval width,

let’s pop it down to five. I’m going to go ahead and

set the minimum to 20 so that those are at nice Even intervals. And I want to see,

I want to see these numbers, not just as numbers, but as height. So I’m going to choose

to show the histogram. Here we have a histogram

of this population where we have set it to fixed intervals

of five, starting at 20 and moving up and see it as calculated, how many

observations fall into each of those groups and then has drawn us this

histogram to display that data. What happens if we set

the interval even lower? How about if we set it to one? So if we set it to one. We’ve got that fixed inter of one so

it’s 20, 21, 21 to 22, 22 to 23, 23 to 24. And so it’s a lot less summarized,

it’s a lot more detailed. This is partly up to

you as to what level of summarization expresses

the detail appropriately. The larger the Integral

width that you choose. The fewer the bins,

the more clumped together the data. We have our measures of central tendency. We’re seeing our distribution,

our histogram. Do you remember the other

descriptive statistic that we need? We need a measure of Variability, right? Let’s go back and

set our groups back to four equal. So if you recall that’s these numbers will

be equal, there will be equal number of observations in each group, but

the groups won’t have the same intervals. So we’re now looking at

the difference in the intervals. And back under that wrench I’m looking for

a measure of variability. Calculating standard deviation. So I’m seeing this abbreviation,

SD here and I bet you that’s what we’re looking for. Let’s click on the word box,

or the option for box plot. Did you see what it did? Let me go back to four equal. So four equal and box plot. Jump back and

forth a couple times between those and see if you can see

Explain what it’s doing. Where is the box in

the middle being drawn? What are the edges of that box? They are exactly, Around the middle half of the data So if we divide it into four equal groups,

we’ve got a quarter down here, and then a half in the middle here,

and then a quarter up here. When we draw a box plot, we draw a box

around the middle half of the data, and then we extend these whiskers down to

the minimum And up to the maximum. Now with a box plot you’re looking

at whether or not is symmetrical, at whether this whisker is

as long as this whisker. And you can see in this case

it’s clearly not symmetrical. One thing that’s a little bit

counter intuitive is that. A long whisker actually means fewer observations so a quarter of

the observations are up here but they’re spread up all the way

along that arm along that whisker. A quarter of the observations are down

here and they’re tightly concentrated. In that whisker. I go back up to the wrench,

I’ve chosen my box plot but I want to add standard deviation. And so it’s now calculating for

us the standard deviation and it’s got it there at 11.1.