Search This Blog

Everyday Statistics for Programmers: Regression Analysis

Now that we've covered most of the basics of statistics, from averages and standard deviations to confidence and significance, it's time to tackle the most-loved of all statistical tools – linear regression. In its most basic form, regression analysis is simply the practice of fitting a straight line to a set of data that consists of pairs of measurements. The measurements can be any of number of things—voltage and temperature, weight and age, GDP and productivity—as long as it's a pair of measurements that you're trying to figure out are dependent upon one another or not.

The conceptual simplicity of linear regression, and the ease of carrying it out, makes it equally easy to get into trouble by applying it where you shouldn't. We humans tend to think of most things in a linear way. If something is good, more is better. If something is bad, more is worse. If something is moving, we can assume it's going in a straight line to intercept it or avoid it, as the case may be. Linearity is a large part of how we experience our world, so we automatically assume that linearity can explain new experiences.

As a first approximation, this approach can be useful, but it's not always right. When using linear regression it's important to think hard about the data to decide if linear regression really makes sense. Making a scatter plot of the data points first is essential. If the points show some curvature, it may be necessary to fit the data to some curve other than a straight line. If the points don't show any dependency at all, i.e. they're all clumped together in the middle of your graph or scattered around like buckshot, then regression analysis is not going to tell you much.

Another thing to be careful of with regression analysis is making predictions about measurement values that lie outside the range of your data set. The samples that you have may look straight as an arrow, but if you extrapolate too far outside of your data set's range, you run the risk of making fantastical claims that aren't supported by your data. It may only be locally linear, and who knows, if you get far enough away from your measurement range, the results could curve and go in the opposite direction! Don't make claims about values that aren't in your data, or at least guard your statements liberally with disclaimers that they are only conjecture.

With those warnings out of the way, let's turn the discussion to a more concrete example. I happen to have a ton of data on the range of my Nissan Leaf, so we'll use that. One thing that I knew going into owning a fully electric car was that the range was likely to be dependent on temperature, so in my mileage log I kept track of the ambient temperature. A scatter plot of two years worth of data looks like this:

Plot of Leaf's Range Vs. Temperature

There is a definite trend to this data with lower temperatures reducing the range of the car, and higher temperatures increasing the range. This idea of a trend brings up one more note of caution. Make sure that you can legitimately claim the dependency that you're asserting. In this case it is fairly obvious that the temperature could cause changes in the range due to changing the capacity and efficiency of the battery. There are known mechanisms in lithium-ion batteries that would cause this behavior. It is also obvious (I hope) that the change in the car's range is not causing the change in ambient temperature. That would be preposterous. Things are not always this simple, though, and I'll get into that more next week when I cover correlation.

So we have a scatter plot that looks kind of linear, and we want to fit a line to it. How do we do that? Well, you could just dump the data into Excel and go to Tools → Regression Analysis, but we want to actually understand what we're doing, so we're going to look at the equations. From algebra we know that the equation for a line is

y = m*x + b

Where m is the slope of the line and b is the value where the line crosses the y-axis. Both x and y are variables. In the Leaf example, x is the ambient temperature and y is the range of the car. If we can figure out what m and b are, then we can plug any temperature into this equation for x and calculate an estimated range of the Leaf at that temperature. We want to calculate values for m and b that will minimize the distance between each of the data points and the resulting line that goes through them. The line parameters that result are called the least squares estimates of m and b.

The term least squares should give you a clue as to how we're going to figure out the best fit line. That's right, the sum of squared differences proves to be quite useful again. Basically, we want to minimize the sum of squared differences between the y values of the line, given by m*x + b, and the y values of the data points. Deriving the equations for m and b involves using calculus to compute derivatives with respect to m and b, setting them equal to zero to find the minimum, and solving for m and b. I won't show the full derivation here, but the calculations for the slope and intercept look like this when implemented in Ruby:
module Statistics
  def self.dot_product(x, y)
    (0...x.size).inject(0) { |s, i| s + x[i]*y[i] }
  end

  def self.s_xy(x, y)
    dot_product(x, y) - sum(x)*sum(y) / x.size
  end

  def self.s_xx(x)
    dot_product(x, x) - sum(x)**2 / x.size
  end

  def self.linear_reg(x, y)
    m = s_xy(x, y) / s_xx(x)
    b = mean(y) - m*mean(x)
    [m, b]
  end
end
The methods s_xy() and s_xx() that are used to calculate the slope m are fairly standard notation for these calculations in statistics, so that explains the terse naming. Notice that the slope calculation makes some sense because it is loosely taking the form of y/x. Once the slope is calculated, the y-intercept calculation is a straightforward solution of the linear equation using the averages of the x values and y values for the (x,y) point.

Now that we can calculate a best-fit line for a data set, we can see what such a line looks like for the Leaf data. Running the linear regression on the data yields a slope of about 0.3 miles/°F and a y-intercept of about 56 miles. That means at 0°F, we can expect this 2012 Leaf to get about 56 miles of range, and we can plug any temperature into the linear equation to see approximately what range to expect at that temperature. Pretty cool. Remember to be careful about plugging in values that are too far outside the range of the data. The range for temperatures above 100°F or below -10°F could be much different than this trend line predicts. Here's what the trend line looks like on the scatter plot:

Scatter plot of Leaf Range Vs. Temperature with Trend Line


You may have noticed that the data is pretty noisy, which makes it not unlike a lot of real-world data. Temperature is not the only variable that's influencing the Leaf's range. Other factors, like wind speed and direction, traffic conditions, variations in driving style and speed, differences in route, and measurement error can all play a role. There is a way to quantify this variation from the trend line to figure out exactly how much of the variation in the data is explained by changes in temperature, and that value is called the coefficient of determination, or r-squared value of the linear regression.

To calculate the r-squared value, we're going to bring back the old workhorse, the sum of squared errors. This time the error is the difference between the y-value of each data point and the corresponding y-value of the trend line. These differences are called the residuals of the linear regression. The other piece of information we need is the total sum of squares, denoted as SST, which is a similar calculation to the sum of squared errors, but it uses the difference between each data point's y-value and the mean of all the y-values. The implementation of the r-squared calculation in Ruby looks like this:
module Statistics
  def self.sse(x, y, m, b)
    dot_product(y, y) - m*dot_product(x, y) - b*sum(y)
  end

  def self.sst(y)
    dot_product(y, y) - sum(y)**2 / y.size
  end

  def self.r_squared(x, y, m, b)
    1 - sse(x, y, m, b) / sst(y)
  end
end
The actual calculation of a sum of squared errors is conspicuously missing from the code, and that's because the calculations of both the SSE and SST terms can be simplified into the above forms that conveniently use methods that are already defined.

With the r-squared value, we can put a number on the amount of variation in the data that is explained by the trend line. For the Leaf range data that number is 0.477. What does that mean? It means 47.7%, or nearly half of the data can be explained by the equation: range = 0.3*temp + 56. The other half of the variation is due to other sources. In general an r-squared value of 50% is okay, and linear regression is a reasonable way to analyze the data. An r-squared value over 90% means the data is very linear, and an r-squared value less than 10% means the data is not at all linear. Doing some other analysis or finding a different measurement to explain the variation in the variable under investigation would be a good idea with such a low r-squared value.

Regression analysis is a very powerful statistical tool, and it can be extended in many ways to answer even more complicated questions using more advanced statistical techniques. That is a topic best left for another day, though. For now remember that linear regression is only appropriate if your data is indeed linear, and be sure to check the r-squared value to quantify how closely your data tracks a trend line. Next week I'll explore another statistical tool that's closely related to regression, and that is the gnarly topic of correlation.