three separate devices this is one device hey everybody I'm Sam and this is
Entiversal today we're talking about machine learning and in particular
linear discriminant analysis or as I refer it from now on
LDA we'll look at what is LDA what is it trying to achieve and the different
mathematical and ideas as a whole that it applies and how does it work I won't
be getting too much into the mathematics even though I'll show them because I
really would like us to see how how does it work why we're doing what we are
doing and what are the effects of the different approaches to our data so
first of all let me tell you that if you're looking for overview of what
machine learning is artificial intelligence how do they work different
high-level ideas and methods that are used you know neurons neural networks I
advise you to look at my previous videos on machine learning which are in my
technology and innovations playlist which is linked in the top right of your
screen in the card section and so are the videos and there you can also see a
very simple individual example with code and pretty pictures of a simple linear
regression model I explained basically how machine learning works so just to
remind ourselves is we said in the previous videos we looked at simple
linear regression and what linear regression does is it's twice to map
some kind of continuous input variable or variables in higher dimensional space
to an output variable so basically it's wise to approximate some kind of
function that map's input to output and as you can see on the picture it
tee's continues that's why it's called regression Wow for LTA we apply it on a
data set that has some kind of most of the times small or just limited amount
of output quasars and as you can see here in the pictures they're represented
two classes in the example that I'll show you in a moment they are also true
classes in my next video I'll be doing LGA on
the iris dataset where we have three output classes and what it tries to do
is name suggests it's trying to discriminate to differentiate between
the different classes so what it does is it's trying to find such a vectors in
the system space as you can see here on the picture that provide a better
separation to when the elements of the system are projected onto those new
vectors we can see the separation they're providing here is a bad vector
and here is very good vector that provides a good separation for the
system the most time initially we're trying to find for example here we have
a two dimensional problem we try to find it two vectors right and when the system
is reordered according to those vectors with them becoming the new axes of the
system we get a better separation so if some of the classes were overlapped we
get them better separated into the space but LG actually has one other
characteristic that even makes it more useful and that is it's not only finds
those good vectors good directions into the space but it also gives us tools to
grade them to find out how useful they are to find out how much of the
information they contain and by doing that we can choose only the
number of vectors that actually are meaningful right that actually provide
meaningful information and by doing that we can reduce the dimensionality of the
problem for example on the picture that you can see here only this vector the
best subspace will be chosen and that reduces the problem foot from two
dimensions to one dimension here it is the same thing only that vector will be
chosen and again from two dimensions x1 and x2 the problem will be reduced to
only one dimension and here we can see a better representation on that and in the
example later I'm doing the same thing and in the next video on the iris data
set the problem at the beginning is four dimensional and I reduce it to two
dimensional for better visual representation and even two we will see
that even when the information is enough to differentiate the classes and as you
can imagine that is just enormous because it greatly reduces the amount of
calculations on any derivative systems and you know here we are making
obviously simple example but in the real world we might have a system that's
dependent on hundreds of features on thousands of features which means that
it has thousands of dimensions in we can reuse them you know by a factor of 10 of
100 of thousands you know it depends and that is just huge so let's get into how
does LGA actually work and finds those good vectors okay LGA uses a root that
codes the fish's ratio and you can see it here basically tries to maximize the
ratio of the difference between the means of the classes which is on the top
and what that does is it's twice to make those through center points of the
different classes or distributions as far away as possible which is way very
logical right we want them to be separated the other thing that we need
to pay attention through is the variances of the classes which we can
see on the bottom and what that represents is how spread each class is
so as you can see we are trying to maximize that ratio meaning that we find
the best trade-off between this far away means as possible and as condense
classes as possible which logical will give us the best separation and here
again we can see the best vector right into the true dimensions into one
dimension when we reduce a higher dimensional system to one dimension we
can actually see its histogram which is something very useful because it shows
us obviously the separation between the red and the green quartz it also shows
us the mean as we can see here where the highest point is and also it shows us
how condensed all the variance of the crosses okay so we looked at what LG is
trying to achieve and the main ideas behind it but how does it actually does
it what is the mattes and approaches that it uses other than just some kind
of abstract Fisher ratio well here's my example and I'll show you the code in a
moment as you can see I have two classes which is blue and green and they're true
dimensional classes so that just lets say x1 and this is x2 so we have two
classes in a two-dimensional problem and I have called them the Gaussian
distribution because the classes are basically random numbers with a pointed
means covariances are basically the variance of the classes and of course
count so here we can see class ah this is its mean was B and this is its
covariance matrix or Elvis just variants how spread the quass is in
here I have eight weights which by weights I mean only just those vectors
that we are trying to project our points on these those are those directions and
here I have the histograms of my system projected onto those vectors right and
very simple representation he's here is a histogram against vector zero and here
we can see it zero five and down here I have actually how the vector looks right
here it is zero here is the five it is quite simple but when we apply it to our
distribution if we put it something like here we can see that when we project our
points on it we get that histogram and it is not very good separation right so
here if you've just made eight arbitrary directions trying to represent how good
they separate the system and we can see here we have better separation here we
have very bad separation and we find out that actually with weight seven we have
the best separation but this is just arbitrary right I have just picked those
directions on random this point represents the Fisher's ratio and as I
said our optimal vector will have the highest Fisher ratio right the biggest
difference between the minutes and least variance and here we can see that with
vector seven our Fisher ratio is almost 6 which is the highest from any other
point which again makes sense right because we saw that with weight 7 we had
best separation and so the actual mathematical approach used for finding
the optimal vector the direction into the space is by using
the scatter matrices of our distribution and I want you to remember that the
maths is just formulas right we can see them always and we can just put them in
there but what matters is actually understanding what is happening and why
we're doing what we're doing and the scatter matrices of our system of our
distribution gives us the eigen pairs which is eigenvectors right and eigen
values and if you don't know what again vectors and eigen values is recommend
you looking it on Wikipedia there is a very good explanation so here we can see
our eigen value in our eigen vector and as I told you the LGA actually shows us
how useful each vector is and here we can see that we have zero zero eigen
value and the eigen value is actually what shows us how meaningful the vector
is the higher the eigen value the more meaningful it is so here we have 0 0
which means that this vector is basically useless which means that we
have the perfect separation only on one dimension which is represented by this
vector by this direction and here is just the optimal W vector which is the
same thing as you can see here is a representation of the optimized for
separation Gaussian distribution right this is the same system as before only
to shift it with our perfect vector being the axis and here I took a
screenshot of it as you can see and let me show you how it was in the beginning
so as you can see is the same distribution only 2 shifted a bit right
and this is obviously in two dimensions but if we had higher dimensions for
example five dimensions we obviously cannot what it
but what LGA does it is shifts it changes our points in such a
way that they have better separation and here I have ported the histogram against
the optimal W and as you can see we have very high means we have the least
variances and that results in best separation and here I'll take a screen
shot of it and I will show it against my best separation from my random
directions the density of the point from the means is higher and even though that
as you can see the means are not as far away is here the variances of the
classes are much smaller which again represents this best trade-off between
the true chinks and so grants us the best separation in here is representing
the optimal direction into the two dimensional space and as you can see the
Fisher ratio is over 6 so that's the maximum that we can achieve and you know
for our simple problem it is very easy to see that the direction that is
something like what I'm doing at the moment with my mouse you know it is
around optimal vector which is why I chose it here but you know when we are
looking at higher dimensions as we said in the real world with hundreds of
dimensions it is basically impossible to just approximate it right which is why
LGA is very useful this formula gives us the eigen pairs which we later use to
find our optimal direction and you might ask ok but when do we apply the Fisher's
ratio when the Fisher ratio is actually ingrained it is part of this formula SB
is the between quas covariance matrix matrix which is basically trying to make
the crosses s far away as possible right and SW is the total which in quas
covariance matrix which is basically trying to make
each class is condensed as possible you can see here that for between quartz
covariance matrix we use the means and this is transpose and for total which in
class covariance matrix we use X n which is basically the total mean of the whole
system and M 1 and M 2 here at the means of the classes as they are in s B and
later we are basically trying to find W this weight is called in machine
learning or just you know the optimal vector or optimal direction and we can
find it by solving the eigenvalue problem represented here piece of code
I'll just skim it a very fast here with this function I am making the Gaussian
distributions as you can see and here are my miss and covariances here I
actually use my functions to get my distributions and here I put them
we saw the port down there later I group my two classes into separate lists so I
it's just easier for me to access them here I create the histograms for each of
the classes against the different weights and here create the histograms
that we so this is our weights multiplied by our classes we are
plotting the histograms here finding the means and virus variances so I can
calculate the Fisher ratio which is represented by that and here we can see
it's the means and variances and I put it also and then add a start finding the
optimal W so here I have the means of the two classes as which so this is
basically M 1 and M 2 and I find my between cross covariance matrix so we
just patent I'm using numpy but it's just that little formula here there is
nothing to it okay and then I'm finding the
total which in quest covariance matrix here I'm just creating my matrices here
is for a here is for B in here get my eigenvalues and eigenvectors by solving
the eigenvalue problem here just arrange them into pairs so it's better looking
and basically that's it after that i put everything and finding the optimal
Fischer ratio which we saw later I know that there is a lot more code but I'll
look at it in my next videos where I'll show you
LGA for the iris dataset which is 4 dimensional problem with frequencies and
later I'll also show you what gistic regression which is something very
interesting and kind of is in between linear regression and linear
discriminant analysis I hope that this video was useful to you I hope that we
are getting a better understanding for what LGA is and how it works and if you
have any questions please ask them down into the comments I'll be happy to try
to answer them for you if I can and you know stains versa so stay tuned I have a
lot of Technology interesting videos of a lot of financial education a lot of
entertainment so look around and TiVo so it's all for you see you next time
Không có nhận xét nào:
Đăng nhận xét