So in this lecture I shall discuss a few topics

which are very much related to pattern recognition techniques and which are in some senses they

are becoming part of the pattern recognition techniques. The topics are data condensation this is one

of the topic sand feature clustering which is again atopic in pattern recognition. But it is now being vastly used these techniques

have been are being vastly used in many other related subjects like machine learning and

there is one another thing about which I would like to talk about that is data visualization

which is very much in some way necessary to understand

the behavior of the data. But there are some problems and some troubles

in visualizing the data in higher dimensions so about which I would like to discuss little

bit. So initially let me talk about data condensation,

data condensation as you can see on the screen. Suppose there are the data set size is huge

now dealing with all the points in the data set that might be resulting in too many computations,

so one may want to reduce the size of the data set and there are also other reasons

the data sets often contain run and data. So what we would like to do is that we would

like to replace a larger tacit by a small subset of representative patterns, so that

the performance of pattern recognition algorithms when trained under reduced set should be comparable

to that obtained when trained on the entire data set. Now in this one performance of pattern generation

algorithms when trained on the reduced set that is fine I am that is understandable in

the sense that we can have some classification techniques or if we do not have any data that

is labeled data if you have unlabeled data then what we might do is that we might have

some clustering to be done on the entire data set. So instead of doing run the entire data set

we would like to do it on a reduced data set, okay. And so the third point as it is mentioned

here this is fine but then replacing a larger tacit by a small subset of representative

patterns that is the main a there are a few points which need to be discussed here, one

is what is the size of the reduced set, what is the size of the reduced set if our total

data set has let us just say 10,000 points then we can reduce it by half that is we can

have 5000 points in the reduced set. We can have 2,000 points that you set thousand

points the reduced set 500 points 100 points 50 points maybe two points in the reduced

set. So number one what is the size of the reduced

set and number two when we are going on reducing the sizes naturally when we apply the usual

pattern recognition techniques on the reduced set and we would like to have the same performance

as we can get when we use the original data set. Now when the reduction is small when the performance

would be expected to be very close to the actual performance, when the reduction becomes

more and more that means from 10,000 when we come to let us just say 1000 or 500 then

the performance may not be as good as what we expect because we have naturally reduced

too many, too many points we have naturally reduced too many points so and when you want

to make it from one 10,000 to just two points are so then it is going to be simply asked

reduction. Now we have already done some such thing like

this earlier 10,000 points reducing it to two points we have done it in clustering when

we have done k-means clustering and if you take k=2 and you divide the data set by doing

two means clustering and ultimately when the algorithm converges you will get two means

these two means we can take them to be in some way representative of the whole data

set. So we have done this thing already earlier

but we have not used this terminology as data condensation at that time we called it as

clustering we did not use the word data condensation naturally those two means they are in some

sense representing the whole data set. Now

so actually on one hand clustering is one end of this data condensation clustering is

one end of this data condensation and the another end is the entire data set whatever

data set that you have. And in between you have several different

stages where we are reducing the size of the datasets in between you have several different

shades where we are reducing the size of the data set. Now here there is another term that comes

into the mind that is scale, if 1000 points are reduced to 1 over 10,000 points are reduced

to two points then the scale is the ratio of 5,000 to 1, 1000 to 2 is same as 5,000

to 1. Whereas from 10,000 if you come to 1000 it

is 10:1 here it is 5000:1 and here it is 10:1 the same data set when we represent it by

a different scales now you would actually the same data set when it is represented by

different scales it looks in some sense different, let me just give you some examples. Look at this, this one when the scale becomes

coarser and coarser and the very last stage it might look like a like an ellipse because

the finer details here number 1 they are missing okay, and number 2 they are also getting slowly

and slowly joined you see this is your thing and it has some of these details they are

missing here and some more are missing and here some more are missing and when it comes

to this it has become like this and ultimately it has become this. It is like there is another example we are

standing here let us just say and you look at the sky during nighttimes we see too many

stars and there are some patterns that you see there and let us just say we start going

towards them we start going towards them and initially we have seen too many stars and

when you start going towards them naturally some starts they start getting faded away

and the one to which we are going nearer the details are going to become more and more

clear, right. And the closer you go there the finer representation

you are going to get and it may so happen that whatever patterns that you are seeing

from here quite a bit of maybe lost when you go nearer to them, I hope you are understanding

what I am trying to say. So the same data when you are representing

it in different scales probably different details you may be missing this is one point

that one needs to take into consideration and there are in fact a few other issues which

are related to this. Let me just tell you one of them I do not

know whether you people have used ink pens when you are in your childhood or writing

on the paper have you done it and probably so now how did you fill the tube probably

you might have taken a filler and you would have put it in the ink taken the ink like

this and put it here, right. Now you are supposed to fill up this whole

tube but then probably most of the times you have you are never completely filled it up

why, because you would see the last bubble that you are going to put you know how much

volume atleast you guess how much volume it might occupy. If you think that it is going to overflow

then you would not put it there right, are you understanding what I am trying to say

so it is confined to the size of I mean one I should say one drop of ink the amount of

volume that may occupy. So everything is limited to that size you

cannot reduce it further than that, similarly when you are doing digitization there also

to certain degree you will do the distillation. So everything is limited to that you are not

in a position to make it go beyond that okay, everything is limited to that size you are

not in a position to make it go beyond that, so with respect to that scale of digitization

there with respect to the volume of the drop of the ink when you are filling up the tube. Here also there is a scale at which you are

looking at the details so with respect to that scale there are some things which may

be changing but you would like to keep many things as they are. So totally here the problem is what is known

as multi scale data condensation at different scales you are going to reduce the sizes of

the data, what scale is needed that is the user who would say the scale that is needed

and once that scale is given you are supposed to reduce the data set to that size, so this

is the basic problem. Now there are in fact a few methods already

existing in the literature for this one of the methods is by Astrahan, what he did was

in order to for k-means algorithm he wanted to get the initial seed points Astrahan for

k-means algorithm he wanted to get the initial seed points. So what he did was now select to radii d1

and d2 and for every point in the data set find the number of other points lying within

d1 distance of that, basically for each point you consider a radius of d1 and then you find

how many points are lying in that data set within in that radius that means basically

you are measuring the density of every point, basically you are measuring the density. Now what you are going to do is that, you

are going to find that point having the highest density and once you find that point you retain

that point and you discard all points from the data set lying I mean within a distance

d2 from the selected point and repeat till the set is you go on doing it if you are trying

to get only K such points you just do it till such k such points are there and if you do

not have any such restriction you go on doing it till the whole data set is exhausted. What Astrahan was done it was he went on doing

it till he has got k-means or k representative points, so he used it for extracting the initial

seed points for k-means algorithm. So this method has been it is already existing

Astrahan’s method, but there is a small problem with this you have to select those

to radiate d1 and d2 and the results are going to change if you change the values of d1 and

d2 and there are some more methods. You have random sampling with replacement

and without replacement, now what happens is that sampling methods are fine if you are

really getting representatives I mean points from all over the data set. But if you are not getting fine if you are

not getting points from all over the dataset then what happens is that the set is not exactly

a representative of the whole thing. So you have these troubles with random sampling

methods with replacement or without replacement. And you also have similar problems with stratified

sampling you also have similar problems with stratified sampling. And there are some more methods I will not

going to all these details about these methods I will just tell you a way of doing it one

can surely use a Astrahan’s method and I will just tell a way of doing and my point

is that that is not the only way in which you can do it one can always improve up on

these methods to get better and better results. There are this learning vector quantization

and there are many other such methods. So this method that I am just going to tell

you it is a slight variant of the what is known as the k-nearest neighbor decision rule

which we have already discussed in one of the earlier classes in this lecture series. Basically what w eare going to do here is

you select an integer k, k actually we will tell you this it is that multi scale data

condensation it tells you the scale so once the scale the k is given for every point find

its distance to the kth nearest neighbor, okay. You take a point in the data set and finally

its distance to the k-nearest neighbor and if the point is xi you call the distance as

ri now select the point having the lowest value of ri select the point having the lowest

value of ri the lowest value of ri means it has something like the density is quite high

in a small radius you have many points lying so the density is quite high, then what we

do is that remove all points lying within 2ri of the selected point. Two times the radius in two times the radius

which ever point is there you just remove and repeat steps 2 and 4 till the data set

is exhausted. So you go on doing it till the data set is exhausted. This is one of the results that is obtained

when the data set I mean these are the representative points and these are discovered by this method. And there are several evaluation criteria

that are used you have something called Kullback Liebler information number okay, that is g2x

lng1(x)/g2(x) and you have log likelihood ratio that in some way gives you the distance

between two distributions and with these things you can try to evaluate the performance of

the algorithms and the last method that has said it is found to provide good results on

a few infect it is found to for provide good results on many data sets okay. And compared to several other algorithms also

compared to several other algorithms this is found to provide good results, but the

point that I want to make is that always there is a scope for improvement and my aim here

is to try to tell you what the problem is instead of trying to say what solutions are

there which is better which is worse that is not actually my aim is to just tell you

what the problem is and what are the main issues around it. And so this is one problem that this is one

problem very about which I wanted to tell you and there is another one that is clustering

of features. We have done already clustering of points

now if you want to do clustering of features basically you need to have some sort of a

distance measure are a dissimilarity measured you need to have between features a distance

measure are a dissimilarity measure you need to have between features. Now there is already a similarity measure

between features which is correlation coefficient there is already a similarity measure between

features which is correlation coefficient but the main problem with that is that. If you shift the data by some angle the whole

data by some angle ? then you would like to have the correlation value to be the same

but it is not going to remain the same at least one cannot guarantee that it will remain

the same. So this is one of the troubles and there are

a few I mean this is one of the main problems with the correlation coefficient. So basically what one would like to do is

that you would like to have some sort of a measure which even when the data is shifted

by some angle then it should not really change its value. And there is one way in which one can suggest

this one the way is one of the ways is for those two variables under consideration you

please look at it the covariance matrix for these two variables the covariance matrix

would look like. For the two variables the covariance matrix

would look like s11, s12, s12 and s22, now find eigen values and eigen vectors of this

matrix let us call the eigen values as ?1 and ?2 and ?1 is greater than ?2. Now if the variables are related let us just

say linearly then what is going to happen to the value of ?2, ?2 will be 0. If the variables are related linearly that

means there is a direct relationship between the values then this matrix will be of rank

1 since this matrix will be of rank 1 the determinant is 0 since the determinant is

0 the product of the eigen values should be 0, since this is a non-negative definite matrix

both the eigen values have to be positive I mean greater than or equal to 0, so the

second eigen value has to be equal to 0. So if the variables are linearly related then

the second eigen value is equal to 0, so somehow you can take the second eigen value to represent

the distance between two variables which is what has been done here. So once you take the second eigen value whether

you will take it in the absolute terms are you will write ?2/?1+?2 that is relatively

that you can have many ways of doing this thing either you can take just ?2 or you can

just take ?2/?1+?2 so that you can always make the distance between two variables lying

between 0 and 1 lying between 0. And in fact what is the maximum value of this

is the maximum value of ?2/?1+?2, ? 2 is if for example this cannot be more than half

can this be more than half the maximum value of ?2 is ?1right, when ?2=?1so this is ?2/

?+?2 this is 1/2 okay, then the ? 1/2 run down it is have so the maximum value of this

is it cannot be more than half. So then here we are trying to put everything

in between 0 and 1/2 by taking ?2/?1+?2. But again this is a way of defining dissimilarity

between two variables now once you define this thing then you can always do a clustering

of variable features you can always do a clustering of features which is I mean similar to the

way we have reduced the size of the number of the way we have done the data condensation

in the same way you can do freaker condensation, you take a point. And find its kth find a feature which lies

find its kth nearest neighbor find its kth nearest neighbor means see how chosen the

value of K let us just say k=10 then for this point you are going to find all features from

this point and you are going to find the distances and you are going to arrange those distances

in increasing order, you are going to arrange those distances in increasing order and you

will take the 10th such distance for the distance for which our variable which is going to provide

that that is the 10th nearest neighbor of this variable. So you can do all these things and you can

develop clustering algorithms using this sort of a similarity measure. Now this is only one such similarity measure

one can always look at other similarity measures and in fact it is very much needed that and

we need many other similarity measures which measure the similarity between two variables

are two features in one of my earlier talks I was mentioning about dependency and independency. We know the definition of independency of

two random variables, so we say that two random variables are dependent if they are not independent

but how much dependent they are we do not know. Similarly the distance between two random

variables which is what I am trying to tell you here it is not exactly a metric whatever

measure that was defined one can show that it is not a metric. I do not know whether I have given the definition

of metric corner did I give you okay, I think let me just define what a metric is. Let us just say x is a non-empty set and d

is a function, d is a function which is defined from XxX to 0 to 8 and d is said to be metric

on x, if this word metric is taken from mathematics

this gives you the definition of distance in a mathematical way. The first property is that the distance it

cannot be less than 0 the distance value it cannot be less than 0 it has to be greater

than or equal to 0 second one is that distance between x and y it should be same as distance

between y and x. And this is if distance between two points

that is equal to 0 then this those two points how to be same and distance between the same

points has to be equal to 0 and this last one is the triangular inequality distance

between x and y plus distance between y and z is greater than or equal distance between

x and z for all xyz belonging to x. Now the distance measure that is proposed

here it is not a metric that can be easily seen. Now the question is can we define a metric

between two features and if we can define one such metric does it really satisfy the

inclusion of our inclusion of distance between two features number one how to define this

a metric between two features okay, and the metric should be the definition of the metric

that is given there it should be satisfactory to all the persons who are working in this

thing in this field so there are two aspects. So these are all the problems on which people

have been working, so here also by defining a metric we are trying to define some sort

of a we are trying to measure the relationship between those two variables so by using this

definition that is ?2 or ?2/?1+?2 some works have already been done and there are in some

cases some improvements are also made on this one by modifying a few concepts. And

it is very much necessary to devise some more feature clustering algorithms in different

domains such as web mining domain and there is another domain that is bioinformatics where

in these two domains the number of the sizes of the vectors the feature vectors that is

huge and the number of points they become very small but the number of dimensions they

become really large. Now there is a third aspect on which I would

like to discuss little bit the third aspect is data visualization, when we are given points

in two dimensional plane. You created in two dimensional Euclidean space

so these are the points that are given to us we understand that this point is closer

to this then the distance between these two points is less than the distance between these

two points similarly the distance between these two points is less than the distance

between this and this. When we look at the data we understand these

distances very well we see which points are closer which points are far apart we understand

it, okay. And we somehow try to get the shape of the

data set for example or something like this we know that it is circular we somehow try

to get the shape of the data set in two dimensions when we plot like this we understand the meaning

of shape may not be always but many times we understand the meaning of distances between

points which two points are closer which two points are further apart and which distances

are smaller than which distances are larger than which distances all these things we understand

them nicely. But the moment the data is in three dimensions

are more than 3,4,5,6 since we are not in a position to plot we are not in a position

to visualize the data properly, so one of the problems that on which people have tried

to work up on is how do you visualize the data of in multiple dimensions, multi-dimensional

data in such a way that we understand those distances and there some distance is smaller

than other distance some distance is larger than other distances so these properties we

understand them well. So in order we need to represent the data

set in such a way that the property is regarding their distances are understood well by us

that is the main issue of data visualization. This is in essence connected with cohesive

self organizing map, I am not sure how many of you are ever of this cohesive self organizing

lab are you aware of it they are also the issue is the same mapped into one dimension

are two dimensions, right so that you would like to preserve the topology, topology preserving

do you know anything about it, okay. The issues are in some way related to cohesive

self-organizing map in the sense that in that map from any dimension you would like to reduce

the data set to two or three dimensions that is what Conan tried to do there and his aim

was to preserve the topology as much as possible, what is the meaning of preserving the topology

the meaning of preserving the topology is whatever I am mentioning regarding these distances

that is what he wanted to do that is if. That is if you have points say x1 to xn then

you want to define a map from this to another y1 to yn okay, so that f(xi)=yi now let us

just say this excise are in let us just say some10 dimensional space okay, and let us

just say this is yiR in two-dimensional space, xiR in 10 dimensional space yiR in two-dimensional

space. So from 10 dimensions you would like to have

a map from 10 dimensions two dimension, so that you would like to preserve the topology,

you would like to preserve the topology means if let us just say if distance between x1

and x2 is say less than distance between x3 and x4 then distance between y1 and y2 also

should be less than distance between y3 and y4. If

this is your map and suppose between x1 and x2 and x3 and x4 this relationship is satisfied

then correspondingly you want the similar relationship to be satisfied here. Now if there are n points there are totally

NC2 pairs that means NC2 distances now if you want to compare one distance with another

distance then two distances that is NC2 right, n into n-1/ 2, c2 so those many pairs you

have those many inequalities or equalities you have. And correspondingly you should have the same

equalities or inequalities to be satisfied in among this yiR’s also that is what you

want. Now there is a small twist here the twist

is you can show mathematically that it is always not possible to get it one can easily

provide contradictions where the policy preserving maps are non-existent,

you can easily provide contradictions in fact you can take your own data set I just want

you to take you take unit cube in three dimensions. How many points are there nine times 327 points

I am not unit cube so +1 and -1 so 2,2×2 so you have totally 27 points are there from

000 on each side you go +1 or -1. So you are going to get 27 points correspondingly

you try to get 27 points in two dimensions having the same equality or inequalities you

cannot get it you can try to do it you cannot get it, are you satisfied with this 27 points

you are not understanding right how the 27 has come. You see each one you have three ways of doing

it -1,0 and 1 each one you have three ways so 3x3x3, 27 right, so there are 27 points

you try to get 27 points corresponding to this in two dimensions having I mean between

any two vertices there are 27 vertices you can measure their distances and you can measure

those inequalities and corresponding inequalities you should get in r2 and you must choose 27

such points sighs having the same inequalities you try to do it I can assure you that you

cannot do it. You simply cannot do it, what Conan tried

to do was how much clothes you can go there, how much clothes you can go to that optimally

that is what he tried to do and he gave an algorithm with that you have got the self-organizing

feature map before Cohen came into the picture people tried it in too many ways some people

try to give a tree structure for I mean looking at the some people try to give a tree structure

and there are some people who just gave some faces. There is a famous article and Chernoff ‘s

faces, thesame Chernoff about which I was talking about Chernoff bounds there is a very

famous article on Chernoff faces you just go through Google on Chernoff face they are

also they try to represent data in multiple dimensions using phase. So like that you will find several different

works, but yes, I said on one hand for some things it is impossible I told you very clearly

but how close you can go to that which is what Conan try to do. And probably what coherent right to do people

can improve up on that but how to improve up on that and how to do that it is a matter

of further research, basically if topology preserving maps are always not possible there

are theorems and proofs corresponding to this statement if you go through some books on

topology you are going to get the theorems and proofs for the statement that I made. Topology is basically a subject in mathematics,

it is a subject in pure mathematics from higher dimension to lower dimension topology preserving

maps they are always not possible that this thing was proved by people long back. But data visualization is extremely important

data visualization is extremely and extremely important and okay, something is not possible

that is fine but to how much extent you can visualize are you in a position to develop

techniques whereto the extent that one can visualize these techniques will make us visualize. This is also another issue on which I mean

so I mean what is going on and especially when you have simply very high-dimensional

data present nowadays both in data mining literature and in bioinformatics literature

as a web mining where you are dealing with web pages. So data visualization has become really and

really important, I am stopping it here if you have any questions please ask me, please. So instead of data visualization like if we

just find out a manifold of the data is that an alternative. I do not want to say that it is not an alternative

I surely do not want to say that okay. Manifolds are useful in many situations I

am I should really agree that, I surely agree to that and if you can really have some idea

if you can really get some idea about the data using these manifolds it is fine there

are I mean I am always for it. But how do you generate this manifold it depends

on that depends on that very much okay, the generation of manifolds yes, I know that in

face recognition literature people have developed manifolds for I mean for changing the faces

by using manifolds they are doing many things and face recognition that I have seen those

papers. But it is not exactly data visualization you

see they have constructed a manifold for some specific purposes that is absolutely fine

and yes, if you can visualize or if you can get something more about the data than what

you know already using manifolds it is fine for me I simply do not have any problems with

that and I encourage that. It is an alternative or not I do not want

to use the word alternative you can use the word alternative when you are able to get

good success with that then you can call it alternative, but I do not know whether you

can achieve the success if you achieve the success value can call it an alternative I

do not have any problem with that okay, any other question. So can we do something like PCA and idea to

visualize means reduce the dimension and then try to visualize the data. No, I mean you can always reduce the dimension

and by reducing the dimension you are losing that information

many times you are losing the information the information that you are losing it may

be very small amount of information but you are surely losing the information yes, I mean

you can I am not saying that you should not do it because now the question is what is

the alternative. I do not see any other way of visualization

so when you are trying to do it in one way I cannot say that you should not do it if

I do not give you an alternative and I am not in a position to give you an alternative,

so you can always do that but that is not the end of it because you must understand

that you are losing the information whenever you are doing any sort of PCA, because after

all some Eigen values are greater than 0 and you are not taking them. You can say that the amount of information

lost is small that is a different matter but you are still losing the information whereas

here what we want to say is that we want to preserve those distances are you in a position

to preserve those distances when you are doing this PCA idea sort of thing are only the PCA

sort of stuff are you in a position to preserve those distances for this you cannot make any

guarantee you cannot give me any guarantee to the effect. So when you cannot give me any guarantee so

then you really do not know whether you are able to keep track of the distances or not,

so I mean you can always do that but you need to understand the limitations this is what

my point is okay, any other question, thank you.