Hey guys Welcome back to the sixth module of this data
science course. And in today’s session we will start by understanding
what is the confusion matrix. Then we’ll look at some of the performance
metrics which can be obtained from the confusion matrix. Following which we will understand the
concept of thresholding. And finally will implement the confusion matrix
in R So let’s get started. So simply put a confusion matrix helps you
to describe the performance of a classification model. Now to get a confusion matrix. All we have to do is create a table of actual
values and the predicted values. So where the confusion matrix is quite simple
to understand but the related terminology can be a bit difficult. So we have different terms like true positive, false positive, true negative and false negative. So let us understand all of those terms through
this example. Now let’s say there’s a date set of
all the patients in hospital and we build a logistic regression model on top of it to
predict if the patient has cancer or not. Now there could be four possibilities. So let’s look at all four. Let’s start with true positives. So these are the cases in which the actual
value is true and the predicted value is also true that is the patient has been diagnosed
with cancer and the model also predicted that the patient has cancer. So next we have truo negatives and these are
the cases in which the actual value is false and the predicted value is also false. That is actually the patient doesn’t have cancer And the model also predicted that the patient
doesn’t have cancer. Then we have false positives. So these are the cases in which the predicted
value is true but the actual value is false. That is the model predicted that the patient
has cancer but in reality he doesn’t. And this is also known as type 1 error. Finally we have false negatives and these
are the cases in which the actual value is true but the predicted value is false. i.e. the model predicted that the patient does not have cancer. But in reality he does .Now consider the real
life implications of this. If there’s a patient who has cancer but you
have incorrectly diagnosed that he doesn’t have cancer. This is really dangerous isn’t
it. So that is why we need to reduce the quantity
of false negatives so those with all of the terms in the confusion matrix . Now we look at some of the performance metrics
So we will start with accuracy so we can get the get the accuracy by dividing the true positives and
true negatives with all of the values so dividing dividing this left diagonal over here with all of the
values gives us the accuracy. Now let’s take this example. So over here we have 100 true positives and
47 true negatives. So we’ll add them up and divide the result
with all of the values to get the accuracy and thus figure and accuracy of zero point
seven two. So the next performance metric is precision
and this helps us to get the proportion of positive Identifications which will actually correct and
we can get the precision of the model by dividing true positives with the sum of true positives
and false positives. So over here we have hundred true positives
and 30 false positives and we’ll get the precision by dividing hundred by one hundred and thirty
so the precision value is zero point seven six and the next performance metric is recall. And this helps us to get the proportion of
those actual positives which were identified correctly and we can get recalled by dividing
true positives with the sum of true positives and false negatives. And here we have a hundred true positives
and 25 false negatives so we can get the recall value by dividing hundred with 125. And we’re gonna get the recall value of zero point
eight. So those were the performance metrics. Now we’ll understand why we need thresholding. Now when we build a logistic regression
model it returns a probability and not a direct result. So in order to map a logistic regression value
to a binary category it must define a classification threshold and any value about the threshold
is taken to be true and the value below the threshold is taken to be false. So let’s understand the concept of threshold
with this example . Over here we are trying to find out if it’ll rain or not on the
basis of some other factors. And the threshold value which we have taken is zero point six five. So in the first case the logistic regression
model gives a probability of zero point seven three for rain and since it is greater than the
threshold value of zero point six five it’ll be classified as yes. And in the second case the logistic regression
model gives a probability of zero point five nine for rain and since it is less than the threshold
value it will be classified as No. So that was the concept of threshold. Now it’s time to implement the confusion matrix in R. So let go to R studio . Right. So we have R studio right in front of us
and this is our same old customer churn dataset Right. So now to build a logic regression model we
already know that we need to use the GLM function. And before that again we need to divide the
entire dataset into train and test sets. So for that we require the sample.split function
which is a part of CA Tools package. So let me load that. So I have loaded the CA tools package. So now I can use the sample.split function. Right. And this time I want to understand how does
that churn column vary with respect to the monthly charges column so that is why I’ll be
splitting with respect to the churn column. And the split ratio which i will take would be it zero point six five. And I will store this in split log Now I can divide the dataset by using the subset
function. The first parameter is the entire data set Now I ll use the split log. And wherever the value is true I ll select all of those observations and store them in the train side And similarly wherever the value of split
log is false I ll select all of this observation from the customer churn data set and I will store it in test. Right. Now again let me have a glance at number of
rows of train and test. so n row of train Would give me four thousand five hundred and
seventy eight and n row of test has given me a value of two thousand four hundred and sixty
five. Right. We have our training and testing sets ready. Now it’s time to build a model. So I use the GLM function and the first parameter
would be the formula. OK so let me get the formula over here. so it will be churn . Which is the dependent variable. I’ll give it a delta symbol and independent variable would be monthly
charges. so this is the formula guys. Next I need to get the data set onto which
you are building a model, so data would be train. And finally we have the family. So family over here would be binomial. Since we are building a logistic regression
algorithm. And I will store this in mod log. OK so we have built a model. Now it’s also time to predict the values. So I’ll use the predict function. So the first parameter is the model which
you’ve just built. Next we have the new dataset which is nothing
by our test set. And finally we have the type of prediction
which will be. test points. So since we want the probabilities I could
be responsible here and I was worried this end result. There is a spelling mistake over here so this
needs to be response. It will be view of log. So this is the predicted probability oh here
guys have a good range. So it’ll be a range of resolve blocks. So let me see what do I get. Right. So this is the range of probabilities. So it is from 14 percent to 44 percent. Right. So now I have got the range of probabilities. So now I can find the confusion matrix and
we’ve already learned to build the confusion matrix all we need to do is build a table
of the actual values in the predicted values. All right. So the actual values you get from the asset. So the children come from. But I said my actual values and the predicted
values are stored end result. BLOCK And since these are actually probabilities
now I need to go a threshold value so let’s say the threshold value which I give as zero
point three. So anywhere where the probability is greater
than zero point three will be given blue. And anywhere where the probability is less
than zero point three will be given false and this will give me my confusion matrix. So this is our confusion matrix guys. Now let me check the accuracy and let me what
it is for that and it could divide the left diagonal by all of the values so 1 1 6 7. Bless. You. 9. Divided by 1 1 6 7 less. 3 9 plus 6 4 4 plus the 1 5. So let me see what do I get right. So the accuracy of the built model is zero
point six 1. Now where I lose I will change the threshold
value and make it to be zero point five. This time let me check the accuracy. There’ll be 1 4 5 9 plus 1 7 5 which are the
left diagonal divided by all of the values which will be. 1 4 5 9 plus 1 7 5. Less 3 5 2 plus 4 7 9. 0. 4 This time the accuracy is 66 percent guys. Okay. So when that threshold value zero point three
then the accuracy which we get a 61 percent and when we take the threshold value to be
25 percent the accuracy which we get a 66 percent. So this is how we can work with confusion
metrics. And this brings us to the end of the session. Thanks for attending and let us meet in the
next class.

Confusion Matrix | How to Implement Confusion Matrix In R | Intellipaat