Vlog: Linear Regression Explained with Henrik Nordmark
In this video, I’m going to cover linear regression. Linear regression is a so called supervised learning technique, which means that we’re trying to make a prediction about something. And let’s imagine a very simple example of trying to predict house prices. So in my example, we’re going to predict house prices in sunny San Diego, California, which is coastal city. Because everybody wants to live as close as possible to the beach and a lovely ocean view, the closer you are to the beach, the more expensive the houses will generally be. And those will be the two main variables we have.
So, we’ll have here on the y axis, the cost of the house, let’s say dollars, this, this would be in the US. And then on the x axis, we have our predictor variable. And our predictor variables, as I said, is going to be distance.
And, then we’ll plot some data that we’ve collected as well. Get some data that already exists or will ask around or do something like Zillow, and get some, some data about how much houses cost and get as much data as we possibly can. There’s always no petitions here. But you could really go on a website like Zillow and just scrape some data or a variety of other data sources. And we’ll look at what that looks like. So what we can see here is that, you know, there, there’s some variation in the data. Not all houses cost the same, which seems logical. But most importantly, the trend that we spoke about earlier, seems to to hold true that the higher costing houses are definitely kind of clustered over here. As being pretty close to the beach, and there is a downward trend and the further you go out cheaper the houses will become. And so what linear regression tries to do is that it tries to quantify that relationship, quantify that trend, and it will look for a line of best fit to go through all of these points.
And you might ask is like Well, okay, great, but what does this actually represent other than a line that seems to cross through all these dots and glide actually has a very clear interpretation. Which is that? If you sort of ignore other factors besides distance, then this red line gives you essentially, the expected value, the predicted value of what a house could should cost at different different distances from the beach. So if you’re way over here, very close to the beach, then you can see your expected house price will be pretty high, it’ll be somewhere in this region. Then if you’re over here, and you look at the house price, like oh, it’s much, much lower.
I won’t give you an exact prediction, or an exact result compared to your real data because as you can see these two guys are basically at the same distance from the beach, and yet they have different prices. Now, why is that? Well, we don’t really know. But it could just be that there are other variables at play here of other factors. Maybe one house has a better ocean view, or one house has more bad bathrooms or bedrooms. And, and so our mom doesn’t take that into account, all it knows is that these two houses are roughly at the same distance and seem to have roughly the same price, not the exact same price but roughly the same price. And so if you want to make a prediction about some new house, also in this area, you would predict it to be essentially here. So this is a predicted value is not a real value.
So this red cross, if you want it to be a little bit more formal mathematically, we would write this up as an equation. We would say, okay, we have a house price, and this house price is going to be equal to some constant beta zero, plus beta one, x one plus epsilon, and I’ll explain what all these things mean. So, beta zero is what is known as a y intercept. It tells you essentially What would be the house price here, right at the zero mark. So if you built a house, like right on the water, there may not be any houses right on the water. Again, this is just sort of theoretical. But it basically tells you the position of where this theoretical red line crosses the y axis. And then this beta one x one is this variable of distance. So x one tells you how far I am away from the beach. And beta one tells you okay, how much do I subtract from beta zero to get the right house price? So, let’s say this is a million dollars. or something. And then as you drop further further down, you’re subtracting more and more of that house price until you’re way down here. And let’s say that the house price becomes 250,000. And beta one essentially represents how steep the slope is, if we had seen data that looked much steeper than this beta one coefficient would have been a much larger number in absolute value. What is this funny term here? This is called the random error term. It’s limited by the symbol epsilon. And what this is saying is okay for the most part price follows this relationship well, regards distance, so mostly falls on this beautiful red line. But it doesn’t perfectly fit the red line, right. So there is some some noise or some randomness. And, and so this epsilon term takes account of that randomness. So if you look at this distance here from the line to that true value there, that’s a bit of noise, or from here to this guy that’s a bit of noise or the arrow could go the other way around could be from this predicted value to this real value here. And all of these are basically described by the steps long term. Which we hoped to be normally distributed and add up to zero and have a whole bunch of other properties. But the main point is that linear regression just allows you to get you this nice trendline allows you to make numerical predictions about a whole variety of things. It doesn’t have to be house prices can be population growth, it can be sales, it can be any number of things. And you don’t need just use one variable. If you have multiple variables, you can incorporate them. So in our example here, instead of just using house, distance to the beach, we could say, okay, we want distance to the beach and number of bedrooms or this is the beach bedrooms and bathrooms and the more variables you add, the more likely you are to get accurate predictions. Which means that your random error term will become smaller. It might not disappear completely. In fact, it’s highly unlikely that will disappear completely. But it should get smaller as you get more and more information to make good predictions. Alright, that’s all for this video. I hope you’ve enjoyed it. And let me know if you have any questions. Bye.