Hey guys Welcome back to the sixth module of this data

science course. And in today’s session we will start by understanding

what is the confusion matrix. Then we’ll look at some of the performance

metrics which can be obtained from the confusion matrix. Following which we will understand the

concept of thresholding. And finally will implement the confusion matrix

in R So let’s get started. So simply put a confusion matrix helps you

to describe the performance of a classification model. Now to get a confusion matrix. All we have to do is create a table of actual

values and the predicted values. So where the confusion matrix is quite simple

to understand but the related terminology can be a bit difficult. So we have different terms like true positive, false positive, true negative and false negative. So let us understand all of those terms through

this example. Now let’s say there’s a date set of

all the patients in hospital and we build a logistic regression model on top of it to

predict if the patient has cancer or not. Now there could be four possibilities. So let’s look at all four. Let’s start with true positives. So these are the cases in which the actual

value is true and the predicted value is also true that is the patient has been diagnosed

with cancer and the model also predicted that the patient has cancer. So next we have truo negatives and these are

the cases in which the actual value is false and the predicted value is also false. That is actually the patient doesn’t have cancer And the model also predicted that the patient

doesn’t have cancer. Then we have false positives. So these are the cases in which the predicted

value is true but the actual value is false. That is the model predicted that the patient

has cancer but in reality he doesn’t. And this is also known as type 1 error. Finally we have false negatives and these

are the cases in which the actual value is true but the predicted value is false. i.e. the model predicted that the patient does not have cancer. But in reality he does .Now consider the real

life implications of this. If there’s a patient who has cancer but you

have incorrectly diagnosed that he doesn’t have cancer. This is really dangerous isn’t

it. So that is why we need to reduce the quantity

of false negatives so those with all of the terms in the confusion matrix . Now we look at some of the performance metrics

So we will start with accuracy so we can get the get the accuracy by dividing the true positives and

true negatives with all of the values so dividing dividing this left diagonal over here with all of the

values gives us the accuracy. Now let’s take this example. So over here we have 100 true positives and

47 true negatives. So we’ll add them up and divide the result

with all of the values to get the accuracy and thus figure and accuracy of zero point

seven two. So the next performance metric is precision

and this helps us to get the proportion of positive Identifications which will actually correct and

we can get the precision of the model by dividing true positives with the sum of true positives

and false positives. So over here we have hundred true positives

and 30 false positives and we’ll get the precision by dividing hundred by one hundred and thirty

so the precision value is zero point seven six and the next performance metric is recall. And this helps us to get the proportion of

those actual positives which were identified correctly and we can get recalled by dividing

true positives with the sum of true positives and false negatives. And here we have a hundred true positives

and 25 false negatives so we can get the recall value by dividing hundred with 125. And we’re gonna get the recall value of zero point

eight. So those were the performance metrics. Now we’ll understand why we need thresholding. Now when we build a logistic regression

model it returns a probability and not a direct result. So in order to map a logistic regression value

to a binary category it must define a classification threshold and any value about the threshold

is taken to be true and the value below the threshold is taken to be false. So let’s understand the concept of threshold

with this example . Over here we are trying to find out if it’ll rain or not on the

basis of some other factors. And the threshold value which we have taken is zero point six five. So in the first case the logistic regression

model gives a probability of zero point seven three for rain and since it is greater than the

threshold value of zero point six five it’ll be classified as yes. And in the second case the logistic regression

model gives a probability of zero point five nine for rain and since it is less than the threshold

value it will be classified as No. So that was the concept of threshold. Now it’s time to implement the confusion matrix in R. So let go to R studio . Right. So we have R studio right in front of us

and this is our same old customer churn dataset Right. So now to build a logic regression model we

already know that we need to use the GLM function. And before that again we need to divide the

entire dataset into train and test sets. So for that we require the sample.split function

which is a part of CA Tools package. So let me load that. So I have loaded the CA tools package. So now I can use the sample.split function. Right. And this time I want to understand how does

that churn column vary with respect to the monthly charges column so that is why I’ll be

splitting with respect to the churn column. And the split ratio which i will take would be it zero point six five. And I will store this in split log Now I can divide the dataset by using the subset

function. The first parameter is the entire data set Now I ll use the split log. And wherever the value is true I ll select all of those observations and store them in the train side And similarly wherever the value of split

log is false I ll select all of this observation from the customer churn data set and I will store it in test. Right. Now again let me have a glance at number of

rows of train and test. so n row of train Would give me four thousand five hundred and

seventy eight and n row of test has given me a value of two thousand four hundred and sixty

five. Right. We have our training and testing sets ready. Now it’s time to build a model. So I use the GLM function and the first parameter

would be the formula. OK so let me get the formula over here. so it will be churn . Which is the dependent variable. I’ll give it a delta symbol and independent variable would be monthly

charges. so this is the formula guys. Next I need to get the data set onto which

you are building a model, so data would be train. And finally we have the family. So family over here would be binomial. Since we are building a logistic regression

algorithm. And I will store this in mod log. OK so we have built a model. Now it’s also time to predict the values. So I’ll use the predict function. So the first parameter is the model which

you’ve just built. Next we have the new dataset which is nothing

by our test set. And finally we have the type of prediction

which will be. test points. So since we want the probabilities I could

be responsible here and I was worried this end result. There is a spelling mistake over here so this

needs to be response. It will be view of log. So this is the predicted probability oh here

guys have a good range. So it’ll be a range of resolve blocks. So let me see what do I get. Right. So this is the range of probabilities. So it is from 14 percent to 44 percent. Right. So now I have got the range of probabilities. So now I can find the confusion matrix and

we’ve already learned to build the confusion matrix all we need to do is build a table

of the actual values in the predicted values. All right. So the actual values you get from the asset. So the children come from. But I said my actual values and the predicted

values are stored end result. BLOCK And since these are actually probabilities

now I need to go a threshold value so let’s say the threshold value which I give as zero

point three. So anywhere where the probability is greater

than zero point three will be given blue. And anywhere where the probability is less

than zero point three will be given false and this will give me my confusion matrix. So this is our confusion matrix guys. Now let me check the accuracy and let me what

it is for that and it could divide the left diagonal by all of the values so 1 1 6 7. Bless. You. 9. Divided by 1 1 6 7 less. 3 9 plus 6 4 4 plus the 1 5. So let me see what do I get right. So the accuracy of the built model is zero

point six 1. Now where I lose I will change the threshold

value and make it to be zero point five. This time let me check the accuracy. There’ll be 1 4 5 9 plus 1 7 5 which are the

left diagonal divided by all of the values which will be. 1 4 5 9 plus 1 7 5. Less 3 5 2 plus 4 7 9. 0. 4 This time the accuracy is 66 percent guys. Okay. So when that threshold value zero point three

then the accuracy which we get a 61 percent and when we take the threshold value to be

25 percent the accuracy which we get a 66 percent. So this is how we can work with confusion

metrics. And this brings us to the end of the session. Thanks for attending and let us meet in the

next class.