Lecture Recording
- Twitch.tv (original lecture)
- Lark Minutes w/ Transcription: https://rong.feishu.cn/minutes/obcnq27eb1754w39x9ka2jln
The Lark Suite link might not work for all, see resources for more details.
Text Transcription
The text transcription was performed by ByteDance Lark Suite’s Meeting Minutes program, on the original lecture video. Below are the exported text version with timestamp.
2021年10月9日 下午 2:47|1小时16分钟35秒
关键词:
文字记录:
Julian McAuley 02:25
Hope everything is going well. Only a couple of things to announce first of all I think the white list should be more or less cleared up by now. There, at least opening all of the seats for the remote section. If you’re still stuck in a in person section wait list and you really want to have 100% certainty of getting in you could switch to the. Remote section, but I think in practice, it doesn’t make much difference. The classroom practice seems like it’s not full if you’re in the remote section and you’d really like to show up in person. There’s probably space for you. Ok great also I have been messing around with the audio settings on Twitch and so forth. I’m sorry.
Julian McAuley 03:14
The audio wasn’t perfect in the first lecture the recordings are also there on the. On the Podcast website. It is a little bit difficult this year because you know, I have to do all of the broadcasting in everything from the one machine, so I can’t really easily keep track of how everything’s going while I’m lecturing so please. Just do tell me if there’s any. Issues with the audio or anything, especially if your remote. Good.
Julian McAuley 03:40
So where were we, we had started talking about supervised learning and regression and really just. Line fitting can I choose remote if I’m in in person? Yes, if you’re in person. You can you can also attend remotely attendance is not required? Yeah, so on Monday, we got in about this file not very far. We talked about how we can. Build maybe the simplest possible prediction algorithm to estimate things like numerical quantities by.
Julian McAuley 04:15
Line fitting if we can fit this line that finds the Best Association between height and weight. Then, for a given feature such as height. We can find its position on the line and that gives us an estimate for that person’s weight. So it’s kind of an approximation or prediction. We talked about how we can do that in more than a single variable. We could have many different input features from which we would like to predict weight or something else. And we’d like a model that automatically learns what these coefficients should be which of these features is correlated or negatively correlated with the output variable? We’re trying to predict and the question is? How can we fit all of these unknowns as accurately as possible?
Julian McAuley 05:04
Ok, good. And we had sort of shown we can take our equation for a line or our equation for a plane sort of Y equals MX plus B or Y equals M One X one plus M2X2, plus B we can rewrite that.
Julian McAuley 05:21
As this kind of inner product. Where we have the left hand side of our inner product contains? Features things we observe the right hand side of the inner product contains unknowns or model parameters that we’d like to fit and the point is we’d like to choose those model parameters. So that the predictions are as accurate as possible, and writing things out in terms of this inner product was convenient because it allows us to do linear algebra.
Julian McAuley 05:55
Right we could then take this. This relationship between height and weight. We could write it out for many, many observations in terms of a Vector Y. Of the quantity we’re trying to predict. And a matrix of features X that we’re trying to predict from and a vector of unknowns theater that we’d like to use to make the prediction.
Julian McAuley 06:22
Ok, so. You know we have this equation that is in an equation in 3 unknowns. They do not need one in theater 2. Can we kind of solve this equation for Theta. That’s what we’d like to do, we have an equation that looks like this? Find the? Value of Theta that corresponds to the line of best fit so you know, let’s get some linear algebra going OK? What can we do so. I think your first guess if you haven’t seen this type of equation before might be to say.
Julian McAuley 06:53
Alright, we have a matrix X here. Let’s solve a theater by multiplying both sides by. X inverse. And you know, maybe it’s obvious. Maybe it’s not obvious why that kind of doesn’t work right thing about X here is that it’s not invertible matrix? Why is it not invertible well it’s not square? Ok so this matrix has 3. Columns corresponding to our 3, features or 2 features plus a constant. But it could have you know 10000 rows corresponding to 10000 different people? So we can’t invert that type of matrix. So that’s not going to work. So we’re going to have to apply some kind of trick. Will say here’s our equation, I’ll give myself some space? Let’s. Multiply both sides of this equation by X transpose. Ok, this is kind of A. Strange trick there, but if if X is A. 10000. By 3 Matrix, then X transpose is A3 by 10000 matrix and X transpose X is going to be A3 by 3 matrix OK, so it’s at least square to the extent that being square. Makes it invertible and now we can multiply both sides by the inverse of that.
Julian McAuley 08:24
Ok and now on the left hand side this will cancel out. Sorry. Which part. Excellent? Yeah, X to the T is just X transpose? May have to revise some linear algebra, we can point you in the right direction, but you were just saying put the matrix on its side, OK, you take a10000 by 3 matrix and you’re converting it to A3 by 10000 matrix. If you forgot your linear algebra and you’d like not to remember it again. You know, I don’t blame you. You can do all this with library functions of course. This is really just the derivation of where a particular solution comes from. And this is a bit funny.
Julian McAuley 09:11
Actually, I think this is we did something weird here. This is what’s called the pseudoinverse when you’ve taken this matrix multiplied by its own transposon that inverted it. It’s really bizarre if you look at it for like 30 seconds and you remember linear algebra. Maybe this makes sense to you. If you look at it for I don’t know another 2 minutes, it stops making sense again, I think. Like What do we do, I mean? We took this? This you know 10000 by 3 matrix, which is a ton of information is like thousands and thousands of observations and we converted it to this. Little 3 by 3 matrix it’s like you destroyed information by doing this and of course. What did we get at the end of the day we get like a line of best fit or a plane of best fit or something it does not. You know the solution theater here does not go through the values exactly as an approximation and it’s not clear like you know where in this. You know where in this equation did the approximation happen like what happened. When we destroy daughter and why is this a good solution? There’s many different approximations you could have Huawei is this one good one? That’s something we’ll get back to later on for now.
Julian McAuley 10:25
This is just like a kind of magic equation that gives us one potential solution to this problem. Ok, so that’s all that’s our line of best fit. That’s our solution for theater in terms of what’s called the pseudo inverse of X?
Julian McAuley 10:42
Ok, so yeah, there’s there’s the math version of it. Let’s actually work through an example. In some code. This is probably what’s going to be most helpful to people who are trying to start their homeworks already and so forth. So let’s try and build a simple regressor that says how.
Julian McAuley 11:00
People’s preferences towards beer vary as a function of age. This is one of the data sets. That’s on the course webpage. I don’t know that it’s used for the homework yet, but it’ll it’ll show up at some point I’m sure. So this is data set I’d scraped from beer advocate lots of people use it for their assignments. We have here. Well, a bunch of users a bunch of items and a bunch of metadata associated with the users on the items we have these reviews. With people evaluate beers in terms of their look, feel taste smell and overall impression. We have information about the user such as their gender and their age and their country information about the beer such as. It’s alcohol level ABV whatever you like.
Julian McAuley 11:46
All that good stuff, OK and that’s all available on the course web page. So yeah, let’s try to make this real valued prediction using regression, either using? Linear algebra libraries or just using her regression library, So what are we trying to do we’re trying to say the rating? Of a beer is equal to theater nought +3 to one times. Whoops that’s not AG at all. Age so in other words, do ratings increase or decrease with age. We can, we can build a simple predictor. We can also maybe learn something. About how preferences change with age or do some science, if you like. Someone’s asking you better better.
Julian McAuley 12:35
Ttv is enabled. I don’t know what that is sorry bout that you can explain it to me after class or something OK, so let’s. Let’s give that ago. We have some code fired up here and I’ll release the. Source code of all of these notebooks for folks later on which well there’s plenty of starter code on the website. But this could be one solution. You might base your homework solutions off of so NUM py is a library for.
Julian McAuley 13:05
Linear algebra. Let me know if the font size is a problem for anyone. But hopefully that’s kind of readable. This is a Scientific Optimization Library, but that’s going to have our I think that’s where we’re getting our regression routines from.
Julian McAuley 13:23
This is a random number generator. I don’t think will use that today. This is a utility function for. Rating structured data sets, so let’s do those imports alright. So I’m going to read in a data set from a file. Hopefully, that’s not too strange.
Julian McAuley 13:45
Can you up the fantatic? Yes, I can? Maybe that’s better OK. See how that goes. So this is a file I just downloaded and saved to my hard drive. Here it is it’s not much to look at. I’m sure the font of that is not readable but it’s giant so it’ll probably crash my browser. If I try and change the font size but you have basically it’s a bunch of strings each line of the file corresponds to a single review. Contains some information about the item so we have things like the style of the beer. The writing the user gave to these various aspects the pallet. The taste or the name of the beer. The time when the review is entered the alcohol level idea of the beer and probably user somewhere.
Julian McAuley 14:35
The actual text of the review itself. Whatever so this is. Told Jason structured file which basically is equivalent to Python dictionary if you’ve ever used Python before which really means it’s a bunch of key value pairs. So we have a string describing a particular key and then a value and that can be recursively defined so for the time struct here. The key is time struct in the value is itself another Jason Object. Or now the set of key value pairs whatever that’s I mean, this is almost like a native type in Python. So we can read it using this magic command called Eval and that will just read that string as though we had. You know read this line as though we had talked into Python And because this looks like a Python dictionary. The result will be a Python. Dictionary there’s always someone who complains and says you shouldn’t be giving us random files to just run eval on you could. You could import a bunch of libraries and then I don’t know steal all of our daughter and then format. The hard drive or something if you don’t trust me that much. You can use this thing called Literal Eval promises I guess abstract abstract syntax tree library that will make sure that the things it’s it’s reading are not doing import statements and so forth it is safer, but slower. We don’t want that so we’ll just use eval right.
Julian McAuley 16:03
Ok, good so. We call this function that just reads in all the lines from the file. It still takes a few seconds. Not too long, maybe I should have done it before. There it goes OK. Ok and well, we can see how it should be. Length 50,000I think and this is just the first entry so you can see it’s now a bunch of key value pairs. I could extract a particular value if I wanted too so. Let’s like this. What no I mispelt are? There you go.
Julian McAuley 16:53
What was your question we were trying to estimate rating as a function of age OK? So where is the age key value pair here? I’m not sure it’s there at all. I think not every user has to enter it so let’s find one who did. I think it’s called age in seconds or something so I pointed out if you see it. We’ll get there. No one wants to tell us their age when they’re reviewing a beer, I guess.
Julian McAuley 17:28
There it is OK user birthday raw. There’s our age feature this user apparently was born in 1901. Fantastic so you know when you work with data sets are usually have to filter them. That’s nothing too shocking. Can you use all the data processing libraries? Absolutely whatever you like Donuts?
Julian McAuley 17:49
Now this is just how I do it and this is this tends to be the format of most of the data sets II build. Ok, so we have this string of the birthday. I did also make this sort of convenience variable, which is the users age in seconds, which I don’t know if it’s the most convenient way to represent their age, but it’s kind of difficult to manipulate these string structured.
Julian McAuley 18:11
Date formats right so let’s build a version of our data set that only has users with A. I’m going to be ageist here, but you know users with a reasonable age. So we’ll do that using a list comprehension. I’m going to destroy my data set by doing this, if. First of all first entered their age so.
Julian McAuley 18:37
User agent seconds. So first of all if the data point has an age. And. The users age in seconds is what’s a reasonable maximum age for someone to be drinking. Beer maybe 80. I don’t know so that’s their age in minutes. The agent hours in days in years, so alright so users invented an age and his age is less than 80 alright. I don’t know you can think about that. That’s good filtering or not let’s see how much we got left. 10000 such users OK so we lost. 4 fifths of our data sets.
Julian McAuley 19:31
I mean that is a problem when you’re doing analysis on real data OK, so we now have a data set where every single user has an age in seconds. And it should be something semi reasonable this persons born in 1958. Able enough. Ok so we’d like to build our matrix of features, So what we’re going to have. We’re going to have that column of ones in X followed by the feature. We’re using to make a prediction so our feature is just A. A one and will append onto that. What was it? It was datum. User slash age in seconds. And will return it.
Julian McAuley 20:16
Ton F so this is just a function that for a single data point returns. The feature vector corresponding to that data point. Ok, so now we want to give us a feature vector. We want to build our feature matrix by extracting that vector for every single data point so we want feature.
Julian McAuley 20:40
D4 DN daughter, I mean, this magic here is called a list comprehension. If you haven’t seen it before it’s quite lovely. It just basically condensing a for loop into this into this list. Let’s see how that goes. So we can look at a few data points.
Julian McAuley 20:58
Ok here is our feature matrix right. We have a column of ones and we have our column of ages. You know you might want to mess around with this and process it. A little more you can imagine when you try and invert matrices where you have these really tiny values and really huge values that could. Lead to not the most stable system but let’s see how we go and why are vector of predictions as the thing we’re trying to predict what was that going to be called? Review slash overall say we could predict. Several other quantities, but that would be fine.
Julian McAuley 21:43
Ok, what does that look like? So doctor or some predictions both of these you know why has length 10000X has length 10000 and with 2. So we can just feed all that into into a nice library. Just a warning I think it’s fine. It’s probably saying the matrix is badly conditioned or something. Anyway, this library will spit out some value of Theta fantastic, so it says. What does this say this is our value of theater nought? And this is our value of Theta, One, which says. The rating is equal to 4. 09 minus this is a very tiny value -1. 58, either either the 10. So, your time times 10 to the power -10 or something. Times the age in seconds, so every second, you age, you like beer this much less. In other words, and you can kind of get a sense of that quantity.
Julian McAuley 22:53
This is pretty dumb model, but I mean, it’s making some kind of prediction. So Theta one. Let’s put it back in 2 years I guess so that’s per minute per hour per day per year that’s265. So every year, you like beer this much less if you live to be 100 hafis targets wiped off your rating.
Julian McAuley 23:18
Yeah, I mean, you can think of this system problems here were fitting something with a linear function that probably isn’t really. Linear it’s probably not true that you like beer, the most on the day. You’re born probably it peaks in the middle somewhere and then goes down again is that something we can capture with linear function, not not clear yet. Alright so yeah, that’s that’s a function of age. I mean, you can you can see how you can change the variable here?
Julian McAuley 23:45
We could do the same thing for the ABV feature alcohol level. Not very hard. What was that called BS slash ABV? There you go everything has an alcohol level, I suppose. Ok and it says your rating for a beer is 3. 4 plus. Zero . 06 times the ABV so every percentage point of alcohol. It has you like it. This much more if it’s10% alcohol. You give them extra half star to 100% alcohol give an extra $6. Great.
Julian McAuley 24:27
I know you can do all of this just with good old. You know, linear algebra operations and see if you get kind of the same thing I mean, I think roughly you do. There can be slight differences. If the matrix is really ill conditioned or something but roughly speaking, what this library is doing is. Just performing that kind of matrix inversion like I described you get the same answer here.
Julian McAuley 24:53
What are the rest of the values that it returns I mean? We have to dig into the? Into the. Library documentation to figure that out, residuals, I think is a measurement of the. Like the amount of error per sample. So I remember right that’s where the sum of squared errors. That’s1 of the error measures will use later in the class rank. I suppose, is the is the rank of the system but OK. There’s someone posted the documentation before I can even start speaking. I don’t remember what this is. Something something great. Ok. Not too bad right and you know you can think about what the interpretation of these values is. One is the intercept and one is a slope Tom when you’re.
Julian McAuley 25:40
Interpreting the output of a linear model, you have to really think carefully about what that means if we made a statement like? Um for every percentage increase of the ABV the rating goes up by zero . 06. Doesn’t there’s not always a precisely true statement about a linear model is saying that is a statement that is true if. If none of the other features change, so once you have a model with with many different features simultaneously. Those features are going to be correlated with each other and if you examine a single coefficient you’re saying. If. Nothing else changes only the ABV changes and everything else would remain fixed.
Julian McAuley 26:21
This parameter says how much would my prediction go up if that feature changed by a single unit. I don’t know take time to digest that warning to get an exact sense of it. Alright Alright That’s simplest case, we had a single real valued feature on a single real valued quantity that we were trying to predict. Ok, So what happens with other feature types that are not real valued how we predict preferences as a function of gender or something.
Julian McAuley 26:56
This is sort of starting to get us into the topic of. It’s called feature engineering. We’re going to spend half of the electron today, probably so how can we estimate rating is a function of gender? I mean the picture that we had before maybe doesn’t make sense anymore right. We don’t observe gender values in a data set like this as being something that sort of Lisa longer continuum. I think I don’t remember what beer advocate has I think it has a male female or don’t specify selection of genders, but of course, you know, one can have many more than that, so to begin with maybe we’d like something. Along these lines and you know, we can make it more complex in a minute. So you’re not really doing like a line of best fit anymore. You’re saying there should be a prediction for males and there should be a different prediction of females. So you’d like a model that’s kind of like this. Right that says. The model should make one prediction for males and it should make a different prediction for females. Ok, so we’d like feed and ought to be our predicted rating for males and theater one to be the difference between males and females. So how much higher does a female rate something than a male? That would be our interpretation of theater one in this model. Ok, so yeah, I mean, we sort of. We took this picture of fitting a line and converted it to this picture of fitting a histogram or something but it still really line. Fitting we’re just fitting a line that only passes through 2 values right.
Julian McAuley 28:38
And I think if you think about that in terms of what is your feature matrix look like now we really have a feature matrix that looks like? Is kind of this would be a binary encoding? Of gender essentially you still have your column of ones corresponding to theater North. Then you have some zeros and ones, and whatever. And this would be A? And Mail user. And this would be a female user.
Julian McAuley 29:13
So let’s go back to the Binary Encoding. Our feature is is no longer a real value. It’s just a binary measurement of some. Somebody. Exhibits a particular attribute or not. Great that’s binary features it wasn’t too difficult so. I mean, let’s let’s do it. Let’s do it in the code why not so. What is our gender attribute user slash gender? I think I’ll have to? Filter the data set a bit more so make sure I only get users who entered their gender. If. Use a slash gender Indy and. Something like that I think. You know, we only lost a bit of our daughter, I suppose surprisingly. And we can build this feature matrix. So how should we build it. Will make the feature for a Mail user and will say. If datum. User slash gender. Equals I think it’s just female or capital F, you get these details right. Capital F Yeah. Then. Faf equals. 11 right so this is our binary feature for a Mail user. This is our binary feature for a female user and then we return. The value OK looks like it’s a lot of dudes reviewing beers OK fine. Or our predictions. Ok so this is now our. Sorry something is killing my computer and then I have to restart this kernel or something I think. Or it could be this horrible giant Jason file that I’ve been trying to read. Let’s see what frees up some memory.
Julian McAuley 31:17
This is my prediction for Mail users and this is not my prediction for female users. But it’s the Delta. It’s saying females are going to rate things that much higher than males. Alright.
Julian McAuley 31:33
What is using all my CPU let’s restart the kernel shall we? Which is going to slowly?
Julian McAuley 31:50
These Windows, laptops, I don’t know.
Julian McAuley 32:00
Is that going smooth again now? Maybe let’s hope everything doesn’t crash? Which is still going? I didn’t drop any frames CPU usage 22%, fantastic OK good sorry.
Julian McAuley 32:18
You know the reason I like these Microsoft laptops is ‘cause you can write with the pen thing. It doesn’t always work that great, but I haven’t found anything else that you can do this with. So I don’t know, and also run tensor flow at the same time. So OK, let’s do some sort of more complex feature representations. I think in theory. This is where we will mentor and yesterday. But I ran out of time and got busy with lots of questions and stuff, so the next exercise, which will come to in a minute is. Doing more complicated feature transforms. Ok. Yeah, so we’ll get to doing we get to doing month. In a minute too.
Julian McAuley 33:03
Alright, What was our next exercise can, we do something like quadratic functions of the of the ABV. Let’s try that quickly and see if we can avoid crashing out computer, too much. So that would be a feature like. Well let’s see what we’re trying to do. I mean, our previous example. We kind of said look we have rating. As a function of. Abv. We have a bunch of observations. And we fit it using a line and we said the rating just goes up and up and up and we think that’s probably not realistic. Right we would think it’s more realistic that well. I don’t know what happens maybe. B’s get better with a certain amount of ABV and then they start to get worse again. That seems like a more realistic shape doesn’t it? Yeah, alcohol by volume sorry to use so much beer lingo. I’m in Australia and I just get used to it. So maybe this is more realistic shape of function and I think when you first see this kind of like linear regression. Algorithms you think? What is the use of linear regression if we’re just fitting lines. We can’t even fit something like a quadratic? But it turns out you can you can do that just fine. So what it means for a model to be linear is that it’s linear in the unknowns OK, so that you can write out the model as theater nought.
Julian McAuley 34:41
Theater 3. Dot one. Abv. Abv. Cubed right so that’s a linear model. We’ve taken nonlinear transforms with our features, but the model is still linear in the parameters. So we can do that just fine. So we would fit a quadratic. By writing a feature vector that had.
Julian McAuley 35:12
F dot append. Dayton beer slash ABV. F dot append. Datum slash BR. Abv squared we can go further and see if it works, or not. You know there’s your. Feature Matrix. We can see what values of feeder we get. Ok, there, you go, we gotta bunch of different values, the prediction. The Model’s prediction of the rating is equal to 3, plus . 17 times. The ABV so it goes up linearly with the amount of alcohol, but then it goes down. Slowly quadratically so eventually if it has enough alcohol. This second time will take over an sort of tank. The prediction of the rating makes more sense right This is A? Seemingly more sensible model.
Julian McAuley 36:14
Ok. Very good so you can’t fit things like Oh my God. You can’t fit things like polynomial functions. Just fine with the linear model. You have to just understand the distinction between whether it’s. Linear in the. Parameters. Versus being able to take non linear transformations of the features wonderful. I always forget, which slides I have animations on which ones. I’m writing all over.
Julian McAuley 36:50
Very good. Excellent so yeah, we can do things like this we can take all kinds of nonlinear transforms we can take quadratics. We could take an exponential function of our features we could take a periodic function of our features as long as we can write out the prediction in this form means it’s linear in theater.
Julian McAuley 37:17
Thanks so much for the subscriptions. I should point out to people, you really do not have to subscribe to avoid the advertisements. I don’t know if it makes any difference. You shouldn’t be seeing those if you I don’t know install the right AD blockers or something no one’s been complaining about them. Today, so hopefully people have figured that out.
Julian McAuley 37:36
Ok, good So what you can’t do. I mean, this is what you can’t do. When we’re talking about this distinction between a model being linear in the features versus linear in the parameters, so you can’t fit a model like this, where you say. Oh, you know, I want to transform my parameter in this way. By squaring the parameter or passing the parameter through some complex function. And these are both perfectly reasonable ideas to have you might you might think well if I square my parameter. I will ensure that that model coefficient is always positive. I know there should be a positive relationship. So I’d like to fit a model that only has positive coefficients and this funky operation here, which is called a sigmoid function gives you a value between zero and one maybe you know that a parameter should be between zero and one you know that that. Coefficient should be between zero and one so that’s what we can’t do with a linear model.
Julian McAuley 38:29
You could no longer solve this for Theta by taking some nice matrix invoice. Wouldn’t it be easier to do PCA and take the most causal variables rather than including every possible features.
Julian McAuley 38:43
Yeah, I mean, maybe finally enough this year, I decided to delete PCA from the. The whole course curriculum, I mean, PCOS this form of dimensionality reduction. And you know, one way or another, not many people ended up using it to build their predictive algorithms for their assignments or anything. I mean, it’s a It’s a technique to find. To find. Low dimensional structure in data sets, which can be used as a form of feature engineering, but. You know it’s I think we have to have a longer conversation as to whether that’s in general, a good idea or not, but let’s not get ahead of ourselves, too much.
Julian McAuley 39:28
Ok and yeah final point here is you know fitting these types of complex nonlinear relationships. You know doing something like this, or like this. That’s what this whole topic of like deep learning is all about it’s about coming up with nonlinear transforms of our parameters and so forth, which is the difference between you know. A neural network or something and and a simple linear model.
Julian McAuley 39:49
Ok. So where were we. Sorry, the slides are a bit in funny order but you know, we’ve gotten as far as doing what have we done so far. We’ve done simple regression linear regression. We’ve done polynomial transforms of our features. We’ve done binary features. What we’d like to do next so? Maybe we’d like to do. Really cat. Oracle features rather than binary features so of course, you can have more than 2 possible values for gender attribute I mean, even in his beer data set. We kind of do because a lot of users. Just don’t specify it. So we have at least 3. How could you fit that with linear model? How can you write down a function that looks like this? If. You no longer have this binary male female option right. So, your first your first thought might be. Let’s do something like this. Let’s encode the gender information. This way will call Mail zero females. One other 2 not specified through. So on, and so forth.
Julian McAuley 41:01
Ok, that would mean our models predictions would look like this right so I’ll prediction for males would be that in order prediction for females would say this is the difference between males and females. This is the difference between males and other and this is the difference between males and not specified it kind of seems on the surface of it.
Julian McAuley 41:22
Ok, but you’d really kind of fitting a model like this. Where again you have 4 values for that attribute and then you’re just finding a line of best fit that goes through all 4 of them. And you can see like you know it’ll work fine. You can implement it and run it, but it’s not going to be a very flexibel function right. So, your code is not going to crash or anything, and that’s kind of the tough part of machine learning.
Julian McAuley 41:52
But it’s you’ve accidentally made some some unreasonable assumptions here you’re saying. Look this difference between males and females is because this is a linear function is identical to this difference between female another and so on. And so forth if I had encoded my variables differently just by putting them in a different order. I’d learn a different function. So you couldn’t fit something like this with that Encoding which is what you’d really like to fit like a different value for each possible variable.
Julian McAuley 42:25
Alright so maybe this is the model, we’d really like it says there’s actually 4 parameters right there’s a prediction for males. And then different deltas for female other and not specified. And that kind of makes sense. You know you’re trying to predict 4 possible quantities. You should do so with 4. Unknowns rather than just 2 unknowns that make maybe it makes more sense.
Julian McAuley 42:51
So. If you want to write something like that out, as an inner product that might look like this so you want to write it out. As in product between theater and your features and your encoding would look look like the following so.
Julian McAuley 43:06
If his vector of length 3 females are called 100 others called 010 not specified as 001 and if you. If you take this encoding and you just expand it. You’ll get the predictions from the previous slide basically. Take some time to do that, if you’re not following. I think it’s sort of a small detail here. I think somebody already asked about this on pizza. As you know, we had 4 possible values male female other not specified yet our feature dimension is length 3 only that’s kind of a weird and surprising thing. The first time you see it. You know why don’t we use a length 4 encoding? And the very short answer is well. We just don’t have to write so yeah, we have a length RE encoding, but we just encoded mails using? As well as my. And died on me which is encoded mails using theater nought. Yeah, so males are essentially 000 or someone says in the chat there. So we don’t need to, with 3 values in our encoding, we can still specify 4 possibilities. So that’s what’s called a one hot encoding alright. It’s encoding where you have a single one in. You have a vector of zeros with a single one corresponding to a particular category of object. Yeah, you only need 3 about 3 dimensions to. To establish 4 different categories, so yeah, this is something we’ll see showing up lots of the time as a way of representing different types of features. Whenever you have categorical objects or even objects that belong to multiple categories simultaneously. Like if you were saying, what was the genre of a movie or something you could use this type of Encoding, but a movie could belong to multiple genres simultaneously perhaps.
Julian McAuley 45:01
When do we use int and when do we use one heart. So you know? Why do I? Why do I use this encoding and why do I not use? This encoding well. The I mean, the short answer is it corresponds to a wrong assumption. There is no linear relationship between this encoding and the value. You’re trying to predict I think maybe a clearer example of that will come up when we’re talking about. Modeling temporal data I mean, just in a few minutes in this lecture where you kind of could do it with either a linear encoding or one hot encoding, but one will kind of work better one or just use float and we are using a float.
Julian McAuley 45:43
The prediction is not an int I should say here you know the prediction is still afloat right. These are the predictions. These values are real valued quantities. It’s only the features that are integers. The things we’re predicting are still real valued. Mail would be 1000 ‘cause we always have that one at the beginning of our feature vector. And female would be 1100. Ok. Yeah, so you know why what? What is the danger of using this redundant feature so if we said just in the binary case if we said Mail? Is equal to? 110. She’s going slow again. I’m just going to restart my kernel can’t hurt right. You think I have enough memory that this wouldn’t be a problem. Apparently not. Windows 11.
Julian McAuley 47:02
Ok, I’ve called nails 100 and females. Will call 101 OK? I’m going to bombard you with some linear algebra, but then our matrix X? Would look something like this 12345 would have maybe some Mail users? And we have some female users.
Julian McAuley 47:29
Ok, it doesn’t seem so bad yet if we tried to take. X transpose X now. I can. This very fast that’s going to be 5. 2. 3. 220. 303 OK and then we try and invert this thing, then we would call this sort of A. Be. And this is a plus B, which I don’t know Long story short if you don’t remember linear algebra. You can’t invert this matrix.
Julian McAuley 48:09
So where did we go wrong? We did something pretty simple. We ended up with a noninvertible system. I think there’s kind of an easier way to see the problem with this, if you don’t remember your linear algebra, so good which is to say you know.
Julian McAuley 48:28
Penn come on. Come on pen please. No. Don’t die on me. Is it out of battery? Maybe my parents died that would make me extremely sad. Can I write using a mouse? No, the computer has just frozen. It’s not it today today I’m sorry.
Julian McAuley 49:16
It’ll be alright. Let’s give it a quick reboot.
Julian McAuley 49:53
Ok. I think we’re almost back.
Julian McAuley 50:08
Welcome back where did you go?
Julian McAuley 50:22
Ok, I’m very sorry about that. Ok, we were saying, Why is this matrix not invertible it seems like a very harmless matrix it seems like a reasonably harmless encoding that we tried to use here. Another way to think about that is to say look the rating.
Julian McAuley 50:42
Is equal to? Leader Nought Plus Theater, one times Mail plus theater 2 times female so you could have some values there like rating is equal to 4 plus. Zero . 5 for males, plus zero . 2 for females or something or you could just as well say the rating is equal to. You know. 1000. Plus. Minus I guess. -995. 5 females minus. 900. 95. 8 for females right. If I got that right, I think you can see, those are like both identical to each other.
Julian McAuley 51:39
These are 2 identical solution, so the problem with this system really is, you have. Infinitely, many. Identical solutions. So it’s kind of under specified and that’s the main reason we want to avoid having this kind of.
Julian McAuley 51:58
Redundancy in our feature matrix we didn’t need to include. This female column, ‘cause we could fully determine the values in that column from the other column and that’s something you generally want to avoid when designing a model in practice with most of these libraries is not actually going to matter but OK that’s why we generally don’t have this.
Julian McAuley 52:18
Extra dimension if we have a12 categorical variables will usually use an 11 dimensional one Hot Encoding. Yeah, that’s exactly right from the comment. The redundancy Here is that one feature is just the negation of another feature. Ok. So yeah, let’s do more complex example of using something like the month. How do we encode the month of the year as a way of doing the prediction and that’s getting to someone’s question previously about you know what’s A? When is it suitable to just use a direct integer encoding versus using one hot encoding and I think it’s on the surface of it.
Julian McAuley 52:56
This kind of starts to look reasonable again right if we’re trying to predict how to ratings change overtime fitting something like a line. Might seem perfectly OK alright do the ratings go up or did they go down with the months of the year. So looks alright we’re predicting a real valued quantity from real valued data so the month is real valued.
Julian McAuley 53:21
0211 or something. But you know when does it not so this is essentially how we tried to do it right. We tried to do this with a simple feature transformation rating equals theater North +3 to one times month and month is just integer encoded. Yeah, that’s what you get. That’s kind of how it would look we’ve just mapped the months here 2 integer values from. 0211. Ok. Good so you know where that starts to look bad is when you sort of stick 2 years together if you think? What happens when the month goes from December to January again. What you end up is feeding this function that looks kind of like this sawtooth curve? What does that say really it says? You know the rating either goes up or goes down with the month of the year but then at the beginning of the next year. It must reset again? So it goes back to zero. That’s I mean that’s the kind of function you would be trying to fit. If you attempted to encode the month. As an integer right seems reasonable when you sort of draw it that way. It looks like nonsense. You could of course feature function like this too. You could just fit a relatively straight line or something but it’s always good to have that sort of shape. Because of the particular, including you chose to use. Do I take attendance on Twitch are absolutely not. There’s an art student IDs either. I don’t think what happens if you look at multiple years. So yeah, it wraps around you get this sawtooth pattern. So I think the second thought you might have rather than feeding it using a linear function. Might be to say, Alright, it’s periodic data, which would fit it using a periodic function. Kind of makes sense so you might think let’s use something like a trigonometric function like this. Or you know convert month to a number of degrees and we have some offset Termina scale time. Ann. And you get something like this maybe. You know. Ignoring for the moment. You can’t actually do that. Yeah, this is not a linear function anymore. We have an unknown inside of trigonometric function. So we don’t have to do this. But even if we could do this. Is it actually a good idea or not? Is this a reasonable model so yeah? It’s not linear? So if we can’t do that with linear model. What can we do So what we can do with linear model is essentially a piecewise function and this is going to be easier than it sounds so we can fit something like this, which is actually a very flexible class of function that just makes a different prediction for each month. And that’s essentially going to correspond to.
Julian McAuley 56:21
Using one hot encodings for our temporal data right so we’re going to map each of the. You know essentially we have, like, we did for this sort of categorical gender attribute. We have a different prediction for each month right so that corresponds to. Basically mapping the months into a one hot encoding yeah, so this is how we would encode the data. Even though. Column of ones and then we have an encoding which it says what month is it is. It February up to December. Great pretty easy. So yeah, and that will allow us to fit this nice flexible. Piecewise function that’s captured much more flexible shapes, then what a sine wave or something would be capable of capturing actually it’s not smooth. It’s jagged, but it’s very flexible. So yeah, I mean that’s like that’s that’s about it, I think for.
Julian McAuley 57:20
You know what encoding you can of course, like stick all of these things together. I think there’s a homework question. That’s kind of like this, this year. So you can say well. Maybe there’s seasonal trends for beer preferences. At the level of entire months or you know really seasons. Maybe there’s also trends at the level of days of the week. People just give different ratings because it’s Friday versus it being Saturday. So we could we could model both of those things simultaneously just by. Concatenating 2 of these one hot encodings together great pretty easy. Yeah, I guess I did this in my free time at one point.
Julian McAuley 57:58
This is the actual seasonal trend of beer ratings as a function of the month. If you stick them through this type of linear regression model with the with the one Hot Encoding. The Y axis is extremely compressed. It’s not a very substantial effect. The difference is what is that even one 50th of a star or something? But there seems to be a trend I mean, the function if it is actually pretty smooth here. There’s This is on the whole data set another 50,000 samples. But like the 3000000 examples, so you have enough to actually get a smooth observation. People just don’t like deer as much in August and they like it most of the. End of the year and then the function wraps around nicely.
Julian McAuley 58:41
Yeah, of course, if you look at the analysis of your coefficients like this, you have to be. Pretty careful about how you interpret them. I mean does this really mean people don’t like beer in August. Probably not right, it could mean. Could be lots of things could be lots of things going on? That data one could be in August. People drink different types of beer that are not very good or something. I don’t know a certain seasonal beers or just what people are consuming is changing. The average rating rather than the actual season, changing the rating the other could be that. Yes, there’s that seasonal effect. But there’s more users in the Southern Hemisphere. During August and people in the Southern Hemisphere are just inherently negative. So they give really bad ratings or something you know, there could be all sorts of other features. You might include in the model. Such as categorical information or geographical information, which would cause this temporal effect to actually disappear. Basically. Good yeah that’s it for feature engineering, so the next topic is sort of what I’m going to call diagnostic so how do we actually? An analyze regression algorithms and I should say you know feature engineering is something will come back to a lot as we start to look at more complex features based on text or visual data or anything.
Julian McAuley 01:00:02
This is just a quick introduction we’ve got. Real valued features we’ve got transforms like polynomials. We have binary features and we have categorical features an piecewise functions that so much. We’ve covered was that a question over there.
Julian McAuley 01:00:15
Modeling the data for 2 consecutive years. For one hot coding. Well compared to the linear function, the linear function would have had to fit something shaped like this. Which means at the end of the year it has to wrap around back to the start position right?
Julian McAuley 01:00:32
This function here, which is a piecewise function. I mean still wraps around at the end of the year but it’s flexible enough. It can it can fit at the end of the year or very similar value to the beginning of the year. So this is still periodic is still wraps around. It just wraps around to the same shape again, you know. So we found that the trend is kind of shaped like that. That’s the model. We really fit with the piecewise function. I can’t hear the question that’s a good point I should remember to repeat the question for people on Twitch Yeah, the question was just about how does this? Piecewise function solve the issue of things, wrapping around the end of the year and they still do wrap around. They just wrap around in a reasonable way now.
Julian McAuley 01:01:20
Should we be finding statistical power? I mean, it’s not something I include in this class. It is in the textbook. If you really want it. You want to look ATP values and so forth to actually measure the statistical significance of. Particular coefficient is you know a zero . 002 star difference on a50,000 Datapoint data set really significant or not. I don’t know I mean, I have a maybe controversial opinion, which is that in this sort of data mining context. That’s usually not so interesting. I mean, you see that a lot in class. Classical statistics when you’re dealing with these very small data sets, where you actually surveyed a few 100 people or something. When you’re dealing with these giant data sets with millions of observations. Statistical significance is rarely a problem the magnitude of the effect can be very small but it’s everything is always significant so I don’t know I don’t waste my time with that stuff. Good so yeah, next topic is kind of diagonal. 6, which is really saying.
Julian McAuley 01:02:22
How do we measure whether a particular set of predictions with in this case of regression algorithm is good or not? So let’s just get right into it, I mean, the first form of evaluation. We might consider is what’s called the mean squared error. That’s actually one of the outputs that are library function, we called? Earlier is is kind of it spits out this thing called residuals, which I think is the sum of squared errors or something.
Julian McAuley 01:02:53
We’re actually doing here is we’re saying. For every prediction our model makes the feature times the parameter. How does that compare to the label an for every data point where squaring that and then we’re taking the average so this is what’s called the mean squared error, the average squared error. That our model made now this is just shorthand for the same thing. So if you don’t if you’ve never seen this type of notation before that’s just saying. That is equal to the sum of squared values.
Julian McAuley 01:03:35
Ok so yeah, this is one way of measuring the quality of the models predictions, saying what are these squared errors the model makes? What is the average magnitude of the error squared so you know why that? Why do we normally do this you may have seen this if you’ve done regression or anything in stats before the question is why do you choose the mean squared error and this does involve some maths but this is something I thought was? Very interesting when I took my first machine learning class. So it’s something I’d like to go into in a bit of depth? Why do we pick this? It’s sort of on the surface of it might be a bit of a funny choice you have this model that predicts. Writing is a function of. Abv or something you made a line of best fit you have a bunch of observations. Um for every observation you have some error.
Julian McAuley 01:04:29
D. Which is equal to Yi minus Exide theater? Those are all of your errors. And we’re saying we’ll call that D subscript. I that’s the error for the I TH data point and we’re saying the quality of our model is.
Julian McAuley 01:04:47
Proportional to. All of these dies. Squared right why, why do we use that? Why do we use squared values? Why don’t we use? Absolute values. Why don’t we use? Some of I. Count the number of times. However, is greater than a half. You know why do we prefer the first one? Why do we prefer this sum of squared errors rather than the sum of absolute errors or counting the number of times our error is outside of a certain bound?
Julian McAuley 01:05:25
Where did the squared error come from and I think. Someone is easy to find the gradient. I don’t think so. I mean, this is. This has an OK gradient is not too bad? Why don’t we use the arrow to the power of 4 you know? Why don’t we use cubed absolute error or something?
Julian McAuley 01:05:42
Where does this squared error come from? I think it it can be a bit weird. If you think about it a little bit, it’s kind of saying. You know this maybe is the most normal thing in a way it’s saying if we make a small error. We get a small penalty if we make a big error we get a big penalty. Fine. This is saying if we make a small error. We get a really small penalty if we make a big error. We get a huge penalty? Why is that the right thing to do? Rather than. Just having our penalty being directly proportional to our error. Ok so yeah, that’s something. Interesting to sort of dig into that. What we’re really doing is making an assumption about what kind of errors are common or uncommon So what we’re doing here by penalizing the error using a square is saying. We have a huge penalty for large errors or in other words, large errors must be extremely uncommon and small errors must be extremely common, so this is our error size D equals Yi Minus XI Dot. Our conversation, you know errors may be followed. This Bell curve type and so very bad, Bell Curve. They follow this sort of Bell curve shape. Where yeah small errors are? Very common and large errors are sort of extremely uncommon. Right yeah, someone’s already said it in the chat. But like maybe that correspond to something like a Gaussian distribution. That’s what this sort of Bell curve shape corresponds to.
Julian McAuley 01:07:20
So this is saying. The error DI is distributed, according to a normal distribution with standard deviation Sigma. Or something so it really saying. The label why I is equal to.
Julian McAuley 01:07:39
The prediction. Plus, some randomly distributed error. Alright and. What you can do you can do is say? What is the probability that you would observe an error of a certain magnitude under some model so you would say? What is the probability? Under the Model Theta. That you would see some set of errors, where you would see the labels Y. And that’s equal given the fashion II’ve always very sloppy writing at probabilities. We can just write out the equation for this normal distribution.
Julian McAuley 01:08:21
Again, this is something that is easy to forget with time. I might get it wrong, too, but the constants don’t really matter. Why I minus XI dot theater squared divided by Sigma squared? Now we would like to say. What is the best model? What is the model that explains it makes the errors we observed most likely? What’s the model is most consistent with errors errors, assuming that the errors followed this Bell curve type shape. So that’s going to be basically. Max over theater. I should’ve had a summation in there somewhere, so this should really be a product overall observations.
Julian McAuley 01:09:21
We can, we can simplify that a little bit. Get rid of all of our constants, ‘cause we’re just maximizing.
Julian McAuley 01:09:36
And we can what else can we do, we can take? Algorithm of that expression.
Julian McAuley 01:09:48
And that will become a sum. And the exponentiation will disappear. And we can get rid of the negative and convert it to a min. And we’re done like magic. We said if we assume that errors come from this normal distribution is Bell curve distribution. Then the best possible model or the most likely model is the one that minimizes the mean squared error. That’s kind of our proof. If you don’t remember your probability of never taking probability class. Before that looks a bit looks a bit strange, but there’s not too much going on there were saying, if we? Assume that small errors are common. And large errors are extremely uncommon. Which corresponds to some particular shape of distribution then the best possible model will be one that minimizes the mean squared error and that’s like a really nice Association that all of these types of models we have. Have this nice Association between. This optimization problem and actually maximizing the probability if you know if you don’t care about probability. You can kind of ignore that and just keep going ahead and minimizing the mean squared error all the time that’s not a bad thing to do, but it is.
Julian McAuley 01:11:09
Interesting to see where that comes from. You know the mean squared error is a specific choice that corresponds to a specific assumption about what kind of errors. You have and there’s not always a good choice. You know if you had. If you did, occasionally have really large errors. If you had big outliers among your errors. The mean squared error would probably not be a good thing. You want to choose a different shape of our distribution that allowed for occasional large outliers or something. So you’re optimizing for small errors or you’re assuming that small errors are very common. And large areas are very uncommon. Ok. If you don’t have too much time left we got a few more minutes so. That’s sort of our justification for using the mean squared error as opposed to some other error distribution.
Julian McAuley 01:12:03
The next question is, you know what’s a good mean squared error. If we fit some model and find a certain mean squared error. How do we know it’s low enough? The answer kind of is that we don’t mean squared error is going to depend on the variance of the data so. Just a little bit more math and then that’ll be all the math for awhile. Almost done so. I’ll just show you the relationship between the mean squared error and the mean and the variance. Basically, I think we can call it a day so the mean? So why bar? Is just equal to? No one N. Times the sum of all labels. The variance of wine. Is equal to one N? Times the sum of Huaihai minus Y bar.
Julian McAuley 01:13:06
If you remember your mean and variances and this is already starting to look a bit similar to the MSA right. Rms Y is equal to one N. Times. Whyyy. Minus XI dot. Theatre squared so the only difference between the variance of the label and the mean squared error is that we are replacing. Our predictions are models prediction. With just the average value so it’s basically saying when you have data that is highly variable. You’ll tend to have very large MSS when you have data that’s very narrowly concentrated you’ll tend to have small MSS. So if you’re looking at things like writing data sets, yeah, if you have ratings in the range of. You know one to 5. You’ll tend to have pretty small MSA’s. If you have another data set like this one on the website of wine reviews, where ratings are in the range zero to 100. You would think you’re going to get pretty big mess. cause the variance is much higher. Then again, if every single person rates wines between. 92 and 94. Then actually the variance would be very low in the MSS would be low. So there’s no kind of absolute scale to? What is a good mean squared error. It depends on the fundamental variability or the variance of the labels you’re trying to predict, although very spread out or are they very close together. So OK that’s just the same thing written out cleanly if you can’t people do complain about my handwriting so I try to have this nice copies.
Julian McAuley 01:14:37
Yeah, that brings us to the very last thing which is to say one way we can deal with this relationship between the mean squared error and the variance is to come up with this. This evaluation criterion called the fraction of variance unexplained where we just divide the mean squared error by the variance OK. That’s going to be a number between zero and one where zero corresponds to. Perfect prediction mean squared error of zero and one corresponds to a trivial prediction. So always predicting the main yeah, I’ve gone through the last off a little bit quickly. So we can kind of revisit this on Monday. I think I would still like to. Take any questions if there are any I do try and leave time. I know it’s a bit scary in a big classroom if there’s anymore on Twitch that’s fine, too. And if not, that’s also OK, yeah, I should say you know the regarding the homework. You should have pretty much everything you need to know to do at least the first half of the questions if you’d like to get started on it. So all the questions about regression. Ok, yeah, thanks I’ll see you next week.
Julian McAuley 01:15:59
Yeah, there’s tons of office hours. There’s. Office hours nearly every day, they posted on Piazza. Just a second just answering questions on Twitch Yeah, you can just hand in anything that’s a PDF formatted, so take your Jupiter Notebook and hit print to PDF basically. I think we have some instructions on that for pizza too. Ok, more complex questions, maybe I’ll get to next week. Ok, good thanks folks.