Hi! I went to see that movie The Great Gatsby
last night. I thought that was really good. I hope you don’t mind if I finish off my martini.
Anyway, one of the constantly recurring themes in this course is the necessity to get close
to your data, look at it in every possible way. In this last lesson of the first class,
we’re going to look at visualizing your data. This is what we’re going to do. We’re going
to use the Visualize panel. I’m going to open the iris dataset. You came across the iris
dataset in one of the activities, I think. I’m using it because it has numeric attributes,
four numeric attributes: sepallength,sepalwidth, petallength, petalwidth. The class are the
three kinds of iris flower: Iris-setosa, Iris-versicolor, and Iris-virginica.
Let’s go to the Visualize panel and visualize this data. There is a matrix of two dimensional
plots, a five-by-five matrix of plots. If I can select one of these plots, I’m going
to be looking at a plot of sepalwidth on
the x-axis and petalwidth on the y-axis. That’s a plot of the data. The colors correspond
to the three classes. I can actually change the colors. If I don’t like those, I could
select another color, but I’m going to leave them the way they are. I can look at individual
data points by clicking on them. This is talking about instance number 86 with a sepallength
of 6, sepalwidth of 3.4, and so on. That’s a versicolor, which is why this spot is colored
red. We can look individual instances. We can change the x- and y-axis by changing on
the menus here. Better still, if we click on this little set of bars here, these represent
the attributes. I’m going to click on this and the x-axis will change to sepallength.
Here the x-axis is sepalwidth. Here the x-axis is petallength, and so on. If I right click,
then it will change the y-axis to sepallength. So, I can quickly browse around these different
plots. There is a Jitter slider. Sometimes, points sit right on top of each other, and
jitter just adds a little bit of randomness to the x- and the y-axis. With a little bit
of jitter on here, the darker spots represent multiple instances. If I click on one of those,
I can see that that point represents three separate instances, all of class iris-setosa,
and they all have the same value of petallength and sepalwidth. Both of which are being plotted on this graph. The sepalwidth and petallength are 3.0 and 1.4 for each of the three instances. If I click another one here. This one here are two
with very similar [sepalwidths] and petallengths, both of the class versicolor. The jitter slider helps you distinguish between
points that are in fact very close together. Another thing we can do is select bits of
this dataset. I’m going to choose select rectangle here. If I draw a rectangle now, I can select
these points. If I were to submit this rectangle, then all other points would be excluded and
just these points would appear on the graph, with the access re-scaled appropriately. Here
we go. I’ve submitted that rectangle, and you can see that there’s just the red points
and green points there. I could save that if I wanted as a different dataset, or I could
reset it and maybe try another kind of selection like this, where I’m going to have some blue
points, some red and some green points and see what that looks like. This might be a
way of cleaning up outliers in your data, by selecting rectangles and saving the new dataset. That’s visualizing the dataset itself. What
about visualizing the result of a classifier. Let’s get rid of this visualize panel and
back to the Preprocess panel. I’m going to use a classifier. I’m going to use, guess
what, J48. Let’s find it under trees. I’m going to run it. Then if I right click on
this entry here in the log area, I can view classifier errors. Here we’ve got the class
plotted against the predicted class. The square boxes represent errors. If I click on one
of these, I can, of course, change the different axes if I want. I can change the x-axis and
the y-axis, but I’m going to go back to class and predictedclass. If I click on one of these
boxes, I can see where the errors are. There are two instances where the predicted class
is versicolor and the actual class is virginica. We can see these in the confusion matrix.
The actual class is virginica, and the predicted class is versicolor, that’s ‘b’. This 2 entry
in the confusion matrix is represented by these 2 instances here. If I look at another point,
say this one. Here I’ve got one instance which is in fact a setosa predicted to be a versicolor.
I can look at this plot and find out where the misclassifications are actually occurring,
the errors in the confusion matrix. Get down and dirty with your data and visualize
it. You can do all sorts of things. You can clean it up, detect outliers. You can look
at the classification errors. For example, there’s a filter that allows you to add the
classifications as a new attribute. Let’s just go and have a look at that. I’m going
to go and find a filter. We’re going to add an attribute. It’s supervised because it uses
a class. Add an attribute, and AddClassfication. Here I get to choose in the configuration
panel, the machine learning scheme. I’m going to choose J48, of course, and I’m going to
outputClassification—put that True. That’s configured it, and I’m going to apply it.
It will add a new attribute. It’s done it, and this attribute is the classification according
to J48. Weka is very powerful. You can do all sorts of things with classifiers and filters.
That’s the end of the first class. There’s a section of the book on Visualization. Please
go and do the activity associated with this lesson, and I’ll see you in the next class. Bye!