Lucid Mesh: October 2014

Math Book Face Off: Everyday Calculus Vs. How Not To Be Wrong

I'm deviating slightly from my normal Tech Book Face Off. I think it's kind of fun, so expect more deviations like this in the future. It's been a long time since I've done some serious math studying, and I wanted to get started again by dipping my toes in the shallow end of the field. I also wanted to find a good, entertaining popular math book that deals with real-world applications of mathematics and the development of mathematical thinking. My books of choice for this venture were Everyday Calculus by Oscar E. Fernandez and How Not To Be Wrong by Jordan Ellenberg, both of which came out earlier this year. Let's see how they tackle the problem of speaking math to the masses.

VS.

Everyday Calculus

I was really expecting to enjoy this book. Unlike most high school kids, I enjoyed learning calculus. It just made sense to me. After all, derivatives are essentially subtraction taken to an extreme, and integrals are addition taken to an extreme. Things get complicated when you actually have to derive equations using limits or when you get into proving theorems in calculus, but that's true of math in general. The basic ideas of calculus are fairly straightforward.

Fernandez did a decent job of bringing that across, but something about the book didn't sit well with me. I thought his treatment of concepts was too superficial, and he never went beyond the trivial exploration of applications of calculus to everyday life. The general format of the book was to walk through a day in the life of Fernandez from waking up to going to bed and examine things like the sleep cycle, coffee, and television from the perspective of calculus. The concept was okay, but I just couldn't get into it. I wanted to see some more substantial analysis than what was there, and I thought that should have been possible while still keeping the book accessible to the wider audience that it was intended for.

As an example, in chapter 3 he shows how the derivative can be used when calculating the efficiency of driving a car with different rates of acceleration, but he assumed that the efficiency of the car itself was constant when calculating fuel consumption for a given trip. It would have been more interesting—and more realistic—to model the efficiency of the car dependent on acceleration and integrate over the resulting curve to figure out a more accurate fuel consumption for the trip. Then he could have showed how much of an impact the acceleration rate had on fuel economy. Granted, he didn't cover integration until later in the book, but still, he could have revisited it. I would have loved to see a more in depth analysis of that.

Some parts of the book did pique my interest. I thought the equation for sustainability analysis was pretty cool, and I enjoyed contemplating the questions that came to mind when Fernandez was describing how integration was developed. It took 2000 years to fully develop the concepts and structure of calculus, which makes it seem like it was very difficult to do. Indeed, the use of limits and infinitesimals was rejected as a valid method of calculation for a long time, and Isaac Newton had to overcome a lot of resistance to his ideas. Now calculus is routinely taught to millions of students around the world, and it's a basic requirement for many fields of study.

It's amazing to me that calculus has gone from a field that took two millennia to develop to the point that even a few people could understand and use it, to something that a significant portion of the population is expected to know. Are we getting smarter as a species, or is it more a matter of us collectively standing on the shoulders of giants? Is it inherently easier to learn something once it is already known? That's likely, considering that once an idea is discovered or developed, it can be shaped and refined until it is much more easily accessible to more people. It's also interesting to think that we're getting smarter, not that we're all smarter than Newton, but on average we might be smarter than the average person a thousand years ago. It would be hard to determine if such a trend had to do with anything more than better hygiene, nutrition, and education. It's still fun to think about, though.

In any case, Everyday Calculus had a few interesting nuggets, but overall I didn't really enjoy it. However, it was a quick read, and it did make me want to get out my old calculus and numerical methods textbooks, so it wasn't a total loss.

How Not To Be Wrong

Despite its pretentious title, this book was full of awesome. With chapter titles like "Everyone Is Obese" and "Dead Fish Don't Read Minds," I knew I was in for an entertaining read, and I was not disappointed. Ellenberg does a great job covering a number of real-world issues from a mathematical perspective, and clearly explains the logic and reasoning behind many of the mathematical methods that are used in analysis.

The book had a definite focus on probability and statistics, which makes a lot of sense when talking about how to make money off of a poorly designed lottery or how the link between smoking and lung cancer became undeniable as the evidence mounted against cigarettes. Ellenberg periodically brought other fields of mathematics into the discussion, such as linear algebra and non-Euclidean geometry, but he always circled back to statistics. It's fitting since statistics is probably the most applicable field of mathematics to our everyday life, and it's something everyone would benefit from knowing more about. So many of our personal experiences that guide our intuition can lead us to the wrong conclusions when we try to extrapolate them into broader contexts, and statistics provides the tools to correct our thinking.

For example, Ellenberg goes into a discussion on Big Data and how when you're filtering for some particular type of person that's a very small percentage of the population (his example is terrorists, but it could be any small group), it doesn't matter how accurate your filter is. Even if it's 99% accurate in filtering out people not in the group you want to detect, you're going to end up with a lot of false positives because 1% of hundreds of millions of people is still millions of people. Then he lays out the argument for the right to privacy:

You might well think that Facebook would never cook up a list of potential terrorists (or tax cheats, or pedophiles) or make the list public if they did. Why would they? Where's the money in it? Maybe that's right. But the NSA collects data on people in America, too, whether they're on Facebook or not. Unless you think they're recording the metadata of all our phone calls just so they can give cell phone companies good advice about where to build more signal towers, there's something like the red list going on. Big Data isn't magic, and it doesn't tell the feds who's a terrorist and who's not. But it doesn't have to be magic to generate long lists of people who are in some ways red-flagged, elevated-risk, "people of interest." Most of the people on those lists will have nothing to do with terrorism. How confident are you that you're not one of them?

It doesn't matter if innocent people should have nothing to hide, which is one of the arguments the people make that are trying to create these lists and provide Security For All. Innocent people should not live in fear that their government will wrongly implicate them in criminal activity, especially without their knowledge via these secret lists. It's not really improved security when a large number of citizens are in danger of wrongful incrimination by their government.

Ellenberg also talks eloquently about education in mathematics, both from his perspective as a professor teaching students the importance of practicing calculation and learning how to think mathematically, and his experience as a student learning about hard work:

The cult of the genius also tends to undervalue hard work. When I was starting out, I thought "hardworking" was a kind of veiled insult—something to say about a student when you can't honestly say they're smart. But the ability to work hard—to keep one's whole attention and energy focused on a problem, systematically turning it over and over and pushing at everything that looks like a crack, despite the lack of outward signs of progress—is not a skill everybody has. Psychologists nowadays call it "grit," and it's impossible to do math without it. It's easy to lose sight of the importance of work, because mathematical inspiration, when it finally does come, can feel effortless and instant. I remember the first theorem I ever proved; I was in college, working on my senior thesis, and I was completely stuck. One night I was at an editorial meeting of the campus literary magazine, drinking red wine and participating fitfully in the discussion of a somewhat boring short story, when all at once something turned over in my mind and I understood how to get past the block. No details, but it didn't matter; there was no doubt in my mind that the thing was done.

I can totally relate to this type of problem solving. I experience it all the time when I have a hard design problem I'm working on, and it's constantly turning over in the back of my mind. I call it bedtime debugging.

This passage also made me think about another aspect of studying that I pretty much avoided in high school and college. As I get older, I become more interested in the history of math and science, the lives of the great thinkers that discovered and developed the ideas we use today, and the difficulties involved in the process of discovery. This kind of knowledge gives context to the theorems and laws and ideas of math and science. Understanding where these ideas came from can give you a greater appreciation for the humanity involved in our development of knowledge, that it didn't all come about in the perfect form that it's presented in classrooms and textbooks. There was intense struggle, debate, and uncertainty behind it all, so you can take comfort in the fact that your own struggles are normal. The pursuit of knowledge has always been a battle.

Clearly, this book has made me think about some deep topics, and there were many more instances like these of discussions in the book sending me off thinking on wild tangents. I really enjoy books that do that, and Ellenberg was especially good at it. He's also an exceptional writer. He has a way of starting off a topic, developing it for a while, and then summarily dropping it on the floor to describe some other thing. At first you wonder what just happened, but you keep on reading because this new topic is also well written and interesting. Eventually, he'll pick the first topic up off the floor and tie it in with his current explanation, and everything becomes clear. Every once in a while he'll also drop little quips about old topics into the current discussion to keep you on your toes. It was completely engaging, and I burst out laughing a number of times while reading particularly witty parts. My wife probably thought I was crazy, laughing at a book about math like that. It truly was a treat to read.

Surrounded By Math

We are surrounded by math in our daily lives. It permeates everything we experience, and it can explain a lot about why things happen, if we're only willing to pay attention and understand. I went looking for a book that would capture the pervasiveness of mathematics in the real world, and I definitely found what I was looking for. While Everyday Calculus fell flat, and I couldn't get into it, How Not To Be Wrong delivered the goods. It has everything I want in a good popular math book. It's well written, engaging, and above all, it made me think hard on a number of topics. That always makes for a satisfying read. If you're in the mood to see how mathematics, and statistics in particular, shape our world and our understanding of it, go read How Not To Be Wrong. I can't recommend it enough.

Everyday Statistics for Programmers: Nonlinear Regression

Last week I talked about how to figure out if two variables in your data set are correlated, and the week before I talked about fitting a trend line to your data. What happens if you know your data is correlated, but the relationship doesn't look linear? Is there anything that can be done to estimate the trend in the data if it's not linear? Yes, depending on what the data looks like, we can either transform the data to make it linear, or do a polynomial regression on the data to fit a polynomial equation to it.

We'll take a closer look at data transformations, and then briefly cover polynomial regression. The idea with data transformations is to somehow make your data linear. There are a few data trends that this will work for, and luckily, these types of trends cover most of the nonlinear data that you're ever likely to see. These trends fall into the categories of exponential, power, logarithmic, and reciprocal relationships.

The exponential relationship is probably the most common of these, so lets go through an example of how to transform a set of data that exhibits an exponential trend. Say you've started a new website, and you're measuring the number of active users on your site each week. Your usage data might look something like this:

Graph of exponential data with linear trendline

The linear trend line is there to show that the data is not really linear. It's kind of linear, but in the middle of the scatter plot the data bows below the line and at the end of the plot it looks to be taking off above the line. One would assume that to get a better idea of what might happen in the near future, using a linear trend line will underestimate the site's performance. To get a better idea, we can use linear regression, but first we want to transform the data to make it more linear. In this case that means taking the logarithm of the y-values of the data, which produces the following scatter plot of the transformed data:

Graph of transformed exponential data with linear trendline

The transformed data definitely looks more linear, as the trend line running right through the scatter plot shows. You can do a linear regression on it in exactly the same way as any other linear regression, ending up with the m (slope) and b (offset) coefficients. However, we want the coefficients for the original exponential relationship, which is of the form

y = α*exp(ß*x)

Since we took the logarithm of y, the form of the linear equation that we found the coefficients for is

ln(y) = ln(α) + ß*x

This looks just like the linear equation y = m*x + b, so we can conclude that α = exp(b) and ß = m. Now that we have the coefficients of the exponential trend line, we can plot it against the original data like so:

Graph of exponential data with exponential trendline

Pretty slick. The exponential trend line fits the data much better, and we can merrily be on our way, extrapolating future site growth (but not too far in the future, of course). One thing to be careful of, though. Notice how the data points get more spread out later in time? That's characteristic of many exponential trends, but not all. If the data points look evenly spread out over the entire range of the data (in statistical terms this would be stated as constant variance over the range), then transforming the data will cause the smaller values to be more heavily weighted in the linear regression. A more appropriate analysis to do in this case, if you need the additional accuracy, would be a true nonlinear regression using a nonlinear solver.

A similar procedure to the logarithmic transform for exponential trends works for a power relationship of the form

y = α*x^ß

The difference is that the logarithm of the x-values must also be calculated to give the data a linear form of

ln(y) = ln(α) + ß*ln(x)

The original coefficients are found the same way as they were for the exponential relationship, with α = exp(b) and ß = m. That wasn't too difficult; just a slight tweak was needed to the data transformation.

The last two forms of logarithmic and reciprocal relationships are easier to calculate because only the x-values need to be transformed with a ln(x) and a 1/x operation, respectively. The linear coefficients are the same between the transformed and original data, so once the linear regression is done on the transformed data, the work is pretty much done.

That's all well and good as far as nonlinear regression goes, but what happens if your data looks like this:

This data doesn't fit any of the relationships discussed so far. It's an example of a quadratic relationship that also shows up fairly often in different types of data, especially economic data. When you see data like this, you can assume that there is some kind of tradeoff involved, and you probably want to optimize for the area on the scatter plot where the data peaks. To figure out exactly where that peak is, you need to find an equation for the trend line, and to do that, you need polynomial regression.

Before we get into polynomial regression, also note that data with a quadratic relationship doesn't have to look like the scatter plot above. It can also be flipped and shifted so that it looks more like an exponential relationship. The data would have a growth characteristic in this case, instead of a trade-off characteristic. It's often hard to tell the difference between quadratic and exponential growth, especially if the data doesn't span enough time. A trend line might show a misfit if the data increases much faster than a quadratic function. In that case, the exponential fit is definitely the way to go. Otherwise, you'll have to resort to your knowledge of the domain and use your judgement as to which relationship better describes the data.

Cubic relationships can also show up in data from time to time, but they are much more rare. Having data with polynomial relationships larger than third order are more of a warning sign than anything. If you're trying to fit a tenth-order polynomial to your data, you may want to rethink your analysis or question your data. You may be moving out of the realm of statistics and into the realm of signal processing.

Getting back to the problem at hand, how do we find the best fit quadratic curve for the data set shown above? Basically, you can construct a linear equation with matrices and solve for the coefficients. That actually makes polynomial regression a form of linear regression, but it's done on nonlinear data. The math is more involved than we'll get into here, but there is a nice implementation of the algorithm in Ruby that I found on a great programming algorithm website, called rosettacode.org. Here's the code adapted to the Statistics module I've been building:

require 'matrix'

module Statistics
  def self.polyfit(x, y, degree)
    x_data = x.map { |xi| (0..degree).map { |pow| (xi**pow).to_f } }
 
    mx = Matrix[*x_data]
    my = Matrix.column_vector(y)
 
    ((mx.t * mx).inv * mx.t * my).t.to_a[0]
  end
end

The first three lines of the polyfit() method are building up the X and Y matrices, and then the last line of the method does the actual calculation to produce the coefficients. If you call this method with the (x,y) data from the scatter plot above and set the degree to 2, you'll get the three coefficients out for the equation:

y = a*x^2 + b*x + c

If you then plot this quadratic equation as the trend line for the scatter plot, you get a nice fit of the data:

Scatter plot of quadratic data with trend line

That wraps up the everyday ways of doing regression on nonlinear data. Doing nonlinear regression with transformations on the data or using a polynomial regression should cover the vast majority of nonlinear data that you would come across, and the analysis is fairly straightforward once you decide on which type of curve to fit.

I hope you've enjoyed this miniseries on statistics, and that it proves useful for your everyday data analysis. Happy estimating!

Everyday Statistics for Programmers: Correlation

Last week I went through the basics of linear regression, and I touched on correlation but didn't get into it. That's because correlation may seem fairly simple, but the simplicity of the calculation disguises a vast field of complexity and logical traps that even professional researchers, statisticians, and economists routinely fall into. I'm not claiming to know more than these people that live and breathe correlation every day, but the issues are well known. Knowing the issues involved can raise our general awareness of when certain conclusions are warranted and when they're going too far.

I Think, Therefore I Correlate

One of the reasons why correlation is so deceptive is that we see correlations all the time. A correlation is basically a relationship between two things. If two separate things grow or increase or change at the same time, they are correlated. If one thing increases and another thing decreases at the same time, they are also correlated.

Our pattern-matching brains are wired to detect this behavior. It's how we've survived and developed throughout our history. When our distant ancestors would eat something and then get sick, they recognized that they shouldn't eat that plant or animal anymore. When they would eat or drink certain things when they were sick, or they would apply different plants to wounds, and then get better, they recognized that as well and passed on that information as early medicine. When they started planting seeds and developed early agriculture, they paid attention to what practices would result in better crop yields. All of these things are correlations, and more often than not, seeing those correlations could mean the difference between life and death. It's ingrained in us.

Even though seeing correlations is a good thing, it's also very easy to develop false correlations. Our ancestors used to believe comets were bad omens. The coming of a comet meant that something terrible was about to happen. Many ancient cultures believed (and some still do) that various traditions, like dancing, could bring the rains so their crops would grow. There are innumerable examples of other false correlations different cultures have believed at different times, and plenty survive to this day. We even collect them as old wives' tales and pass them on half-jokingly to future generations.

Have you ever heard the joke about the guy who was walking down the street in a big city, flapping his arms wildly up and down? Another man walks up to him and asks why he's doing that. The first guy, still flapping his arms, says, "To keep the alligators away." The second guy says, "But there aren't any alligators here." The first guy responds, "See, it's working!" That's false correlation.

These false correlations come about when two things happen at the same time, and someone attaches too much significance to the fact that those two events just happened to coincide. The person witnessing the events doesn't realize that there could be some other explanation for what they experienced, and he goes on to convince enough other people that these two things are related that it becomes common knowledge.

Correlate Wisely

When we're looking at data and calculating correlations, or even reading studies done by others, we need to be careful to not be fooled. It's harder than it sounds. Correlations are easy to calculate, and many things can seem correlated when they're not, either by coincidence or because something else that you're not paying attention to is systematically causing the correlation that you're observing. It may be all the more difficult to recognize a false correlation because you want to believe that it's there because of internal biases or external influences. On top of that, your brain is wired to see the pattern even if it's not the right explanation, and you have to fight against this instinct to get to the truth.
You should first decide whether the correlation makes any sense at all. Is there a plausible mechanism that could explain the connection? Are the two things related in any way that would make the correlation reasonable? Is there anything that would directly contradict the correlation that you're seeing? Is it extraordinary that these two things are correlated? Remember, extraordinary claims require extraordinary evidence.

If the correlation is plausible, then you might start thinking about which of the variables in the correlation is dependent on the other. While you can do this, you must be careful because this line of thought leads to another big way that correlations are used incorrectly. If you've read enough studies that use statistics, sooner or later you'll hear the phrase, "correlation does not imply causation." There are other analysis tools that can get closer to proving the claim that A causes B, but correlation is not one of them. All correlation says is that A and B are related in some way, nothing more. To make stronger claims about dependence, more work will be involved.

It may seem obvious that one thing causes the other in the system you're studying, but it may still be possible for causation to run in the other direction or an entirely different cause may explain why both of the variables being measured are moving together. One recent, high-profile example of this issue occurred after the financial crisis of 2008. A couple of economists, Carmen Reinhart and Kenneth Rogoff, set out to measure the relationship between government debt and GDP growth. They found a negative correlation, meaning that as government debt went up, a country's GDP growth tended to go down, and they concluded that government debt was causing the slow down in GDP growth.

It turned out that there were some errors in their data and calculations, so the correlation was actually much weaker than they thought, but even before that there was a big debate about which way causation actually ran. It's fairly easy to argue that for many of the countries in the study, they suffered lower GDP growth or even negative growth that drove up their government debts, instead of the other way around. In reality causation ran in both directions, and it was highly context dependent by country. Some countries started with high government debt before going into recession, and others went into recessions that ballooned their debt. Correlation couldn't show any of these details.

Correlation In Practice

The ease with which correlation analysis can go wrong doesn't mean correlation isn't useful. It is very useful as a starting point for further analysis. Correlations can show if a relationship actually exists so you can investigate it further, or if you're barking up the wrong tree and need to look elsewhere. So how do we calculate a correlation, given a set of data with X-values and Y-values for two different variables that we want to compare? Here it is, implemented in Ruby once again:

module Statistics
  def self.r(x, y)
    s_xy(x, y) / (Math.sqrt(s_xx(x))*Math.sqrt(s_xx(y)))
  end
end

I defined the methods s_xy() and s_xx() last week for calculating the r-squared value of a regression analysis, so at this point we're getting pretty far along in statistical equations, using earlier calculations to construct new equations. In fact, this correlation coefficient is closely related to the r-squared value. All you have to do is square it and that's what you get, which is why it's referred to as the r value in the code above.

Both the correlation coefficient and the r-squared value give an indication of how strongly linear a data set is, but the correlation coefficient has one property that the r-squared value doesn't have. It can be negative, and a negative correlation coefficient happens when one variable increases while the other decreases, i.e. the scatter plot shows a negative-trending slope. So if the correlation is +1 or -1, the data is perfectly linear with a positive or negative slope, respectively.

If the correlation is close to zero, then the data is either uncorrelated or it might have a nonlinear relationship—you have to check the scatter plot to decide. The closer the correlation is to +/-1, the more linear the relationship is, but to get a better idea of that linearity, you should square the correlation coefficient. This operation gives you the r-squared value, which shows the percentage of the variation in the data that is due to linearity.

Go Forth And Correlate

If you take away anything from this discussion, remember that correlation is quite useful if used wisely, but it is also incredibly easy to misuse. Correlation does not imply causation, no matter how much you want to believe that causation runs a certain way in your data. Unless there is a known mechanism that explains why one variable depends on the other, you need to perform other analyses before you can claim causation. If your data has a time component to it, there are a number of time series analyses and cross-correlation analyses that could uncover more details in the data related to causation. Other types of data require other more advanced statistical techniques to figure out causation. In any case, examining the correlation of the data is a good place to start, and will generally point you in the right direction.

We're getting close to the end of this miniseries on statistical tools. Next week I'll wrap up with a couple ways to analyze data that's not linear, but looks like it could fit some other type of curve.

Everyday Statistics for Programmers: Regression Analysis

Now that we've covered most of the basics of statistics, from averages and standard deviations to confidence and significance, it's time to tackle the most-loved of all statistical tools – linear regression. In its most basic form, regression analysis is simply the practice of fitting a straight line to a set of data that consists of pairs of measurements. The measurements can be any of number of things—voltage and temperature, weight and age, GDP and productivity—as long as it's a pair of measurements that you're trying to figure out are dependent upon one another or not.

The conceptual simplicity of linear regression, and the ease of carrying it out, makes it equally easy to get into trouble by applying it where you shouldn't. We humans tend to think of most things in a linear way. If something is good, more is better. If something is bad, more is worse. If something is moving, we can assume it's going in a straight line to intercept it or avoid it, as the case may be. Linearity is a large part of how we experience our world, so we automatically assume that linearity can explain new experiences.

As a first approximation, this approach can be useful, but it's not always right. When using linear regression it's important to think hard about the data to decide if linear regression really makes sense. Making a scatter plot of the data points first is essential. If the points show some curvature, it may be necessary to fit the data to some curve other than a straight line. If the points don't show any dependency at all, i.e. they're all clumped together in the middle of your graph or scattered around like buckshot, then regression analysis is not going to tell you much.

Another thing to be careful of with regression analysis is making predictions about measurement values that lie outside the range of your data set. The samples that you have may look straight as an arrow, but if you extrapolate too far outside of your data set's range, you run the risk of making fantastical claims that aren't supported by your data. It may only be locally linear, and who knows, if you get far enough away from your measurement range, the results could curve and go in the opposite direction! Don't make claims about values that aren't in your data, or at least guard your statements liberally with disclaimers that they are only conjecture.

With those warnings out of the way, let's turn the discussion to a more concrete example. I happen to have a ton of data on the range of my Nissan Leaf, so we'll use that. One thing that I knew going into owning a fully electric car was that the range was likely to be dependent on temperature, so in my mileage log I kept track of the ambient temperature. A scatter plot of two years worth of data looks like this:

There is a definite trend to this data with lower temperatures reducing the range of the car, and higher temperatures increasing the range. This idea of a trend brings up one more note of caution. Make sure that you can legitimately claim the dependency that you're asserting. In this case it is fairly obvious that the temperature could cause changes in the range due to changing the capacity and efficiency of the battery. There are known mechanisms in lithium-ion batteries that would cause this behavior. It is also obvious (I hope) that the change in the car's range is not causing the change in ambient temperature. That would be preposterous. Things are not always this simple, though, and I'll get into that more next week when I cover correlation.

So we have a scatter plot that looks kind of linear, and we want to fit a line to it. How do we do that? Well, you could just dump the data into Excel and go to Tools → Regression Analysis, but we want to actually understand what we're doing, so we're going to look at the equations. From algebra we know that the equation for a line is

y = m*x + b

Where m is the slope of the line and b is the value where the line crosses the y-axis. Both x and y are variables. In the Leaf example, x is the ambient temperature and y is the range of the car. If we can figure out what m and b are, then we can plug any temperature into this equation for x and calculate an estimated range of the Leaf at that temperature. We want to calculate values for m and b that will minimize the distance between each of the data points and the resulting line that goes through them. The line parameters that result are called the least squares estimates of m and b.

The term least squares should give you a clue as to how we're going to figure out the best fit line. That's right, the sum of squared differences proves to be quite useful again. Basically, we want to minimize the sum of squared differences between the y values of the line, given by m*x + b, and the y values of the data points. Deriving the equations for m and b involves using calculus to compute derivatives with respect to m and b, setting them equal to zero to find the minimum, and solving for m and b. I won't show the full derivation here, but the calculations for the slope and intercept look like this when implemented in Ruby:

module Statistics
  def self.dot_product(x, y)
    (0...x.size).inject(0) { |s, i| s + x[i]*y[i] }
  end

  def self.s_xy(x, y)
    dot_product(x, y) - sum(x)*sum(y) / x.size
  end

  def self.s_xx(x)
    dot_product(x, x) - sum(x)**2 / x.size
  end

  def self.linear_reg(x, y)
    m = s_xy(x, y) / s_xx(x)
    b = mean(y) - m*mean(x)
    [m, b]
  end
end

The methods s_xy() and s_xx() that are used to calculate the slope m are fairly standard notation for these calculations in statistics, so that explains the terse naming. Notice that the slope calculation makes some sense because it is loosely taking the form of y/x. Once the slope is calculated, the y-intercept calculation is a straightforward solution of the linear equation using the averages of the x values and y values for the (x,y) point.

Now that we can calculate a best-fit line for a data set, we can see what such a line looks like for the Leaf data. Running the linear regression on the data yields a slope of about 0.3 miles/°F and a y-intercept of about 56 miles. That means at 0°F, we can expect this 2012 Leaf to get about 56 miles of range, and we can plug any temperature into the linear equation to see approximately what range to expect at that temperature. Pretty cool. Remember to be careful about plugging in values that are too far outside the range of the data. The range for temperatures above 100°F or below -10°F could be much different than this trend line predicts. Here's what the trend line looks like on the scatter plot:

Scatter plot of Leaf Range Vs. Temperature with Trend Line

You may have noticed that the data is pretty noisy, which makes it not unlike a lot of real-world data. Temperature is not the only variable that's influencing the Leaf's range. Other factors, like wind speed and direction, traffic conditions, variations in driving style and speed, differences in route, and measurement error can all play a role. There is a way to quantify this variation from the trend line to figure out exactly how much of the variation in the data is explained by changes in temperature, and that value is called the coefficient of determination, or r-squared value of the linear regression.

To calculate the r-squared value, we're going to bring back the old workhorse, the sum of squared errors. This time the error is the difference between the y-value of each data point and the corresponding y-value of the trend line. These differences are called the residuals of the linear regression. The other piece of information we need is the total sum of squares, denoted as SST, which is a similar calculation to the sum of squared errors, but it uses the difference between each data point's y-value and the mean of all the y-values. The implementation of the r-squared calculation in Ruby looks like this:

module Statistics
  def self.sse(x, y, m, b)
    dot_product(y, y) - m*dot_product(x, y) - b*sum(y)
  end

  def self.sst(y)
    dot_product(y, y) - sum(y)**2 / y.size
  end

  def self.r_squared(x, y, m, b)
    1 - sse(x, y, m, b) / sst(y)
  end
end

The actual calculation of a sum of squared errors is conspicuously missing from the code, and that's because the calculations of both the SSE and SST terms can be simplified into the above forms that conveniently use methods that are already defined.

With the r-squared value, we can put a number on the amount of variation in the data that is explained by the trend line. For the Leaf range data that number is 0.477. What does that mean? It means 47.7%, or nearly half of the data can be explained by the equation: range = 0.3*temp + 56. The other half of the variation is due to other sources. In general an r-squared value of 50% is okay, and linear regression is a reasonable way to analyze the data. An r-squared value over 90% means the data is very linear, and an r-squared value less than 10% means the data is not at all linear. Doing some other analysis or finding a different measurement to explain the variation in the variable under investigation would be a good idea with such a low r-squared value.

Regression analysis is a very powerful statistical tool, and it can be extended in many ways to answer even more complicated questions using more advanced statistical techniques. That is a topic best left for another day, though. For now remember that linear regression is only appropriate if your data is indeed linear, and be sure to check the r-squared value to quantify how closely your data tracks a trend line. Next week I'll explore another statistical tool that's closely related to regression, and that is the gnarly topic of correlation.

Everyday Statistics for Programmers: Significance

So far we've covered the basic statistical tools of averages and distributions and gone a little deeper with standard deviations and confidence intervals. This time we're going to take a look at statistical significance.

Before getting too far into the details, it's important to understand what statistical significance means. If you have a set of data that you're comparing to an expected or desired value, or you have multiple sets of data taken under different conditions, you can calculate whether or not the data sets are different—either from the expected value or from each other—using a statistical test. Then you can say the results are statistically significant to a certain confidence level.

"Significant" has a very specific meaning in statistics, and it is somewhat different from the common usage of the term. It doesn't have anything to do with how large a difference there is between two values. You can very easily get a statistically significant result by using large sample sizes to accentuate small differences in a measurement.

You see this all the time when the media reports on scientific results, especially in health or medical studies. The headline is usually something sensational, and in the article the author is careful to say that the results are significant without giving any indication of the magnitude of differences. This wording is usually a red flag that the results were statistically significant, but not practically significant. The author very well knows that if he wrote that some treatment produced only 3% better outcomes, or that changing your lifestyle in some way would result in a 5% improvement in some metric, nobody would care. Conversely, if the practical significance is large, then you better believe the journalist is going to highlight it.

Statistical significance should be used as a tool to help you decide if your data is actually telling you what you think it is—is there really a measurable difference in the data. That's it. To determine whether the results are meaningful or actionable, you need to use your own judgement based on your knowledge of the domain you're working in. Now that we know what it means to be statistically significant, how do we calculate it?

Let's start with a single data set that is compared with an expected value, and you want to know whether the mean of the data is equivalent to or different from the expected value. Suppose you have a set of run times for a program you're working on. You have many different input files to test the program with various usage models, and the results of measuring the program's run time for each of these input files makes up your data set. You want to see if the average run time meets or exceeds the performance requirements for the program.

The average run time is an approximation, and it would be different with a different set of input files. If you measured the average run times for many different sets of input files, you would find that the set of averages had its own distribution. What we want to know is if the average run time is less than or equal to the desired performance value, considering that we can't know the real average run time, only an estimate of its distribution.

The mean of this distribution is the mean of the data set, and the standard deviation of the mean is the standard deviation of the data set divided by the square root of the number of samples in the data set. We can create a test statistic by taking the difference of the mean and expected value, and dividing this difference by the standard deviation of the mean. In Ruby we could calculate the test statistic like this:

module Statistics
  def self.test_statistic(data, expected_value)
    mu = mean data
    sigma = stdev data
    (mu - expected_value)/(sigma/Math.sqrt(data.size))
  end
end

I've defined the mean and stdev methods previously. The result of this calculation is a value in standard deviations. You can look up in a normal distribution table what the confidence level would be for a certain standard deviation. In our example, we're looking for the average run time to be less than an expected value, so the test statistic needs to be less than a certain standard deviation value for a certain confidence level. For a confidence level of 95%, the test statistic would have to be less than 1.645. This is called an upper-tailed test. A lower-tailed test is similar with the inequality reversed. A two-tailed test is used when you want to know if the average is indistinguishable from the expected value (or alternatively, if it is significantly different than the expected value).

In a two-tailed test, the required test statistic changes for a given confidence level. E.g. for a confidence level of 95%, the test statistic would need to be within 1.96 standard deviations. If the test statistic falls outside the desired range, you can say the difference between the mean and expected value is statistically significant with a confidence level of 95%. The following graph shows what these rejection regions look like for a two-tailed test. A single-tailed test would exclude one or the other of the regions.

Graph of Probability Distribution of Mean with Rejection Regions

In formal statistics the mean being equivalent to the expected value is referred to as the null hypothesis, and if the test statistic falls within a rejection region, it is said that the null hypothesis is rejected. I've resisted using these terms because I find the double negative and generic terminology confusing, but there it is for completeness.

Now that we have a method of comparing the mean of a data set to an expected value, we can extend the method for comparing the means of two data sets. The distributions of the two means will probably overlap somewhat, and we want to quantify that overlap to determine of they are significantly different (in the statistical sense). The following graph shows what we're dealing with:

Probability Distribution of Two Different Means

We can easily replace the expected value in the previous test statistic calculation with the second mean, but we have to do something more sophisticated with the standard deviations. We'll make use of our old friend, the square root of the sum of squares! This technique really is used a lot in statistics. In Ruby this calculation looks like this:

module Statistics
  def self.compare_test(data1, data2)
    mu1 = mean data1
    sigma1 = stdev data1
    mu2 = mean data2
    sigma2 = stdev data2
    (mu1-mu2)/Math.sqrt(sigma1**2/data1.size + sigma2**2/data2.size)
  end
end

The same kinds of rejection regions apply for this test, but to make that more clear, another example is in order. Suppose we are now optimizing the program we were measuring earlier and we want to determine if the optimizations really improved the run time. We have a second set of run times with its own mean and standard deviation. We would want to see if the calculated test statistic is greater than a certain standard deviation, given a desired confidence level. For a confidence level of 95%, the test statistic would need to be larger than 1.645 standard deviations, and this corresponds to an upper-tailed test. If you only need to show that the two data sets are different, you can use a two-tailed test.

One issue that I haven't addressed, yet, is that the relationship between the test statistic and confidence level changes depending on how large the data sets are. This whole discussion has assumed fairly large data sets of more than about 50 samples. The tests we have been using for these large data sets are called z-tests. If your data sets are smaller, you can do a very similar t-test that uses adjusted critical values for the test statistic. These values can also be found in a t distribution table, so you can handle data sets with smaller sample sizes.

To wrap up, statistical significance is a way to determine if the mean of a data set is different from an expected value or the mean of another data set, given the variance of the data and the number of samples in the data. It's a useful tool for figuring out if it's appropriate to make claims about the data, or if there is too much variation or too little difference to say anything about the data.

This type of testing only scratches the surface of hypothesis testing, and analysis of variance is an even more advanced method of determining significance of differences in multiple data sets. I encourage you to explore these statistical methods in more detail, but z-tests and t-tests are solid tools for everyday statistical use. Next week we'll move on to another fundamental tool of statistical data analysis, linear regression.