Lucid Mesh: Statistics

Showing posts with label Statistics. Show all posts

Everyday Statistics for Programmers: Nonlinear Regression

Last week I talked about how to figure out if two variables in your data set are correlated, and the week before I talked about fitting a trend line to your data. What happens if you know your data is correlated, but the relationship doesn't look linear? Is there anything that can be done to estimate the trend in the data if it's not linear? Yes, depending on what the data looks like, we can either transform the data to make it linear, or do a polynomial regression on the data to fit a polynomial equation to it.

We'll take a closer look at data transformations, and then briefly cover polynomial regression. The idea with data transformations is to somehow make your data linear. There are a few data trends that this will work for, and luckily, these types of trends cover most of the nonlinear data that you're ever likely to see. These trends fall into the categories of exponential, power, logarithmic, and reciprocal relationships.

The exponential relationship is probably the most common of these, so lets go through an example of how to transform a set of data that exhibits an exponential trend. Say you've started a new website, and you're measuring the number of active users on your site each week. Your usage data might look something like this:

Graph of exponential data with linear trendline

The linear trend line is there to show that the data is not really linear. It's kind of linear, but in the middle of the scatter plot the data bows below the line and at the end of the plot it looks to be taking off above the line. One would assume that to get a better idea of what might happen in the near future, using a linear trend line will underestimate the site's performance. To get a better idea, we can use linear regression, but first we want to transform the data to make it more linear. In this case that means taking the logarithm of the y-values of the data, which produces the following scatter plot of the transformed data:

Graph of transformed exponential data with linear trendline

The transformed data definitely looks more linear, as the trend line running right through the scatter plot shows. You can do a linear regression on it in exactly the same way as any other linear regression, ending up with the m (slope) and b (offset) coefficients. However, we want the coefficients for the original exponential relationship, which is of the form

y = α*exp(ß*x)

Since we took the logarithm of y, the form of the linear equation that we found the coefficients for is

ln(y) = ln(α) + ß*x

This looks just like the linear equation y = m*x + b, so we can conclude that α = exp(b) and ß = m. Now that we have the coefficients of the exponential trend line, we can plot it against the original data like so:

Graph of exponential data with exponential trendline

Pretty slick. The exponential trend line fits the data much better, and we can merrily be on our way, extrapolating future site growth (but not too far in the future, of course). One thing to be careful of, though. Notice how the data points get more spread out later in time? That's characteristic of many exponential trends, but not all. If the data points look evenly spread out over the entire range of the data (in statistical terms this would be stated as constant variance over the range), then transforming the data will cause the smaller values to be more heavily weighted in the linear regression. A more appropriate analysis to do in this case, if you need the additional accuracy, would be a true nonlinear regression using a nonlinear solver.

A similar procedure to the logarithmic transform for exponential trends works for a power relationship of the form

y = α*x^ß

The difference is that the logarithm of the x-values must also be calculated to give the data a linear form of

ln(y) = ln(α) + ß*ln(x)

The original coefficients are found the same way as they were for the exponential relationship, with α = exp(b) and ß = m. That wasn't too difficult; just a slight tweak was needed to the data transformation.

The last two forms of logarithmic and reciprocal relationships are easier to calculate because only the x-values need to be transformed with a ln(x) and a 1/x operation, respectively. The linear coefficients are the same between the transformed and original data, so once the linear regression is done on the transformed data, the work is pretty much done.

That's all well and good as far as nonlinear regression goes, but what happens if your data looks like this:

This data doesn't fit any of the relationships discussed so far. It's an example of a quadratic relationship that also shows up fairly often in different types of data, especially economic data. When you see data like this, you can assume that there is some kind of tradeoff involved, and you probably want to optimize for the area on the scatter plot where the data peaks. To figure out exactly where that peak is, you need to find an equation for the trend line, and to do that, you need polynomial regression.

Before we get into polynomial regression, also note that data with a quadratic relationship doesn't have to look like the scatter plot above. It can also be flipped and shifted so that it looks more like an exponential relationship. The data would have a growth characteristic in this case, instead of a trade-off characteristic. It's often hard to tell the difference between quadratic and exponential growth, especially if the data doesn't span enough time. A trend line might show a misfit if the data increases much faster than a quadratic function. In that case, the exponential fit is definitely the way to go. Otherwise, you'll have to resort to your knowledge of the domain and use your judgement as to which relationship better describes the data.

Cubic relationships can also show up in data from time to time, but they are much more rare. Having data with polynomial relationships larger than third order are more of a warning sign than anything. If you're trying to fit a tenth-order polynomial to your data, you may want to rethink your analysis or question your data. You may be moving out of the realm of statistics and into the realm of signal processing.

Getting back to the problem at hand, how do we find the best fit quadratic curve for the data set shown above? Basically, you can construct a linear equation with matrices and solve for the coefficients. That actually makes polynomial regression a form of linear regression, but it's done on nonlinear data. The math is more involved than we'll get into here, but there is a nice implementation of the algorithm in Ruby that I found on a great programming algorithm website, called rosettacode.org. Here's the code adapted to the Statistics module I've been building:

require 'matrix'

module Statistics
  def self.polyfit(x, y, degree)
    x_data = x.map { |xi| (0..degree).map { |pow| (xi**pow).to_f } }
 
    mx = Matrix[*x_data]
    my = Matrix.column_vector(y)
 
    ((mx.t * mx).inv * mx.t * my).t.to_a[0]
  end
end

The first three lines of the polyfit() method are building up the X and Y matrices, and then the last line of the method does the actual calculation to produce the coefficients. If you call this method with the (x,y) data from the scatter plot above and set the degree to 2, you'll get the three coefficients out for the equation:

y = a*x^2 + b*x + c

If you then plot this quadratic equation as the trend line for the scatter plot, you get a nice fit of the data:

Scatter plot of quadratic data with trend line

That wraps up the everyday ways of doing regression on nonlinear data. Doing nonlinear regression with transformations on the data or using a polynomial regression should cover the vast majority of nonlinear data that you would come across, and the analysis is fairly straightforward once you decide on which type of curve to fit.

I hope you've enjoyed this miniseries on statistics, and that it proves useful for your everyday data analysis. Happy estimating!

Everyday Statistics for Programmers: Correlation

Last week I went through the basics of linear regression, and I touched on correlation but didn't get into it. That's because correlation may seem fairly simple, but the simplicity of the calculation disguises a vast field of complexity and logical traps that even professional researchers, statisticians, and economists routinely fall into. I'm not claiming to know more than these people that live and breathe correlation every day, but the issues are well known. Knowing the issues involved can raise our general awareness of when certain conclusions are warranted and when they're going too far.

I Think, Therefore I Correlate

One of the reasons why correlation is so deceptive is that we see correlations all the time. A correlation is basically a relationship between two things. If two separate things grow or increase or change at the same time, they are correlated. If one thing increases and another thing decreases at the same time, they are also correlated.

Our pattern-matching brains are wired to detect this behavior. It's how we've survived and developed throughout our history. When our distant ancestors would eat something and then get sick, they recognized that they shouldn't eat that plant or animal anymore. When they would eat or drink certain things when they were sick, or they would apply different plants to wounds, and then get better, they recognized that as well and passed on that information as early medicine. When they started planting seeds and developed early agriculture, they paid attention to what practices would result in better crop yields. All of these things are correlations, and more often than not, seeing those correlations could mean the difference between life and death. It's ingrained in us.

Even though seeing correlations is a good thing, it's also very easy to develop false correlations. Our ancestors used to believe comets were bad omens. The coming of a comet meant that something terrible was about to happen. Many ancient cultures believed (and some still do) that various traditions, like dancing, could bring the rains so their crops would grow. There are innumerable examples of other false correlations different cultures have believed at different times, and plenty survive to this day. We even collect them as old wives' tales and pass them on half-jokingly to future generations.

Have you ever heard the joke about the guy who was walking down the street in a big city, flapping his arms wildly up and down? Another man walks up to him and asks why he's doing that. The first guy, still flapping his arms, says, "To keep the alligators away." The second guy says, "But there aren't any alligators here." The first guy responds, "See, it's working!" That's false correlation.

These false correlations come about when two things happen at the same time, and someone attaches too much significance to the fact that those two events just happened to coincide. The person witnessing the events doesn't realize that there could be some other explanation for what they experienced, and he goes on to convince enough other people that these two things are related that it becomes common knowledge.

Correlate Wisely

When we're looking at data and calculating correlations, or even reading studies done by others, we need to be careful to not be fooled. It's harder than it sounds. Correlations are easy to calculate, and many things can seem correlated when they're not, either by coincidence or because something else that you're not paying attention to is systematically causing the correlation that you're observing. It may be all the more difficult to recognize a false correlation because you want to believe that it's there because of internal biases or external influences. On top of that, your brain is wired to see the pattern even if it's not the right explanation, and you have to fight against this instinct to get to the truth.
You should first decide whether the correlation makes any sense at all. Is there a plausible mechanism that could explain the connection? Are the two things related in any way that would make the correlation reasonable? Is there anything that would directly contradict the correlation that you're seeing? Is it extraordinary that these two things are correlated? Remember, extraordinary claims require extraordinary evidence.

If the correlation is plausible, then you might start thinking about which of the variables in the correlation is dependent on the other. While you can do this, you must be careful because this line of thought leads to another big way that correlations are used incorrectly. If you've read enough studies that use statistics, sooner or later you'll hear the phrase, "correlation does not imply causation." There are other analysis tools that can get closer to proving the claim that A causes B, but correlation is not one of them. All correlation says is that A and B are related in some way, nothing more. To make stronger claims about dependence, more work will be involved.

It may seem obvious that one thing causes the other in the system you're studying, but it may still be possible for causation to run in the other direction or an entirely different cause may explain why both of the variables being measured are moving together. One recent, high-profile example of this issue occurred after the financial crisis of 2008. A couple of economists, Carmen Reinhart and Kenneth Rogoff, set out to measure the relationship between government debt and GDP growth. They found a negative correlation, meaning that as government debt went up, a country's GDP growth tended to go down, and they concluded that government debt was causing the slow down in GDP growth.

It turned out that there were some errors in their data and calculations, so the correlation was actually much weaker than they thought, but even before that there was a big debate about which way causation actually ran. It's fairly easy to argue that for many of the countries in the study, they suffered lower GDP growth or even negative growth that drove up their government debts, instead of the other way around. In reality causation ran in both directions, and it was highly context dependent by country. Some countries started with high government debt before going into recession, and others went into recessions that ballooned their debt. Correlation couldn't show any of these details.

Correlation In Practice

The ease with which correlation analysis can go wrong doesn't mean correlation isn't useful. It is very useful as a starting point for further analysis. Correlations can show if a relationship actually exists so you can investigate it further, or if you're barking up the wrong tree and need to look elsewhere. So how do we calculate a correlation, given a set of data with X-values and Y-values for two different variables that we want to compare? Here it is, implemented in Ruby once again:

module Statistics
  def self.r(x, y)
    s_xy(x, y) / (Math.sqrt(s_xx(x))*Math.sqrt(s_xx(y)))
  end
end

I defined the methods s_xy() and s_xx() last week for calculating the r-squared value of a regression analysis, so at this point we're getting pretty far along in statistical equations, using earlier calculations to construct new equations. In fact, this correlation coefficient is closely related to the r-squared value. All you have to do is square it and that's what you get, which is why it's referred to as the r value in the code above.

Both the correlation coefficient and the r-squared value give an indication of how strongly linear a data set is, but the correlation coefficient has one property that the r-squared value doesn't have. It can be negative, and a negative correlation coefficient happens when one variable increases while the other decreases, i.e. the scatter plot shows a negative-trending slope. So if the correlation is +1 or -1, the data is perfectly linear with a positive or negative slope, respectively.

If the correlation is close to zero, then the data is either uncorrelated or it might have a nonlinear relationship—you have to check the scatter plot to decide. The closer the correlation is to +/-1, the more linear the relationship is, but to get a better idea of that linearity, you should square the correlation coefficient. This operation gives you the r-squared value, which shows the percentage of the variation in the data that is due to linearity.

Go Forth And Correlate

If you take away anything from this discussion, remember that correlation is quite useful if used wisely, but it is also incredibly easy to misuse. Correlation does not imply causation, no matter how much you want to believe that causation runs a certain way in your data. Unless there is a known mechanism that explains why one variable depends on the other, you need to perform other analyses before you can claim causation. If your data has a time component to it, there are a number of time series analyses and cross-correlation analyses that could uncover more details in the data related to causation. Other types of data require other more advanced statistical techniques to figure out causation. In any case, examining the correlation of the data is a good place to start, and will generally point you in the right direction.

We're getting close to the end of this miniseries on statistical tools. Next week I'll wrap up with a couple ways to analyze data that's not linear, but looks like it could fit some other type of curve.

Everyday Statistics for Programmers: Regression Analysis

Now that we've covered most of the basics of statistics, from averages and standard deviations to confidence and significance, it's time to tackle the most-loved of all statistical tools – linear regression. In its most basic form, regression analysis is simply the practice of fitting a straight line to a set of data that consists of pairs of measurements. The measurements can be any of number of things—voltage and temperature, weight and age, GDP and productivity—as long as it's a pair of measurements that you're trying to figure out are dependent upon one another or not.

The conceptual simplicity of linear regression, and the ease of carrying it out, makes it equally easy to get into trouble by applying it where you shouldn't. We humans tend to think of most things in a linear way. If something is good, more is better. If something is bad, more is worse. If something is moving, we can assume it's going in a straight line to intercept it or avoid it, as the case may be. Linearity is a large part of how we experience our world, so we automatically assume that linearity can explain new experiences.

As a first approximation, this approach can be useful, but it's not always right. When using linear regression it's important to think hard about the data to decide if linear regression really makes sense. Making a scatter plot of the data points first is essential. If the points show some curvature, it may be necessary to fit the data to some curve other than a straight line. If the points don't show any dependency at all, i.e. they're all clumped together in the middle of your graph or scattered around like buckshot, then regression analysis is not going to tell you much.

Another thing to be careful of with regression analysis is making predictions about measurement values that lie outside the range of your data set. The samples that you have may look straight as an arrow, but if you extrapolate too far outside of your data set's range, you run the risk of making fantastical claims that aren't supported by your data. It may only be locally linear, and who knows, if you get far enough away from your measurement range, the results could curve and go in the opposite direction! Don't make claims about values that aren't in your data, or at least guard your statements liberally with disclaimers that they are only conjecture.

With those warnings out of the way, let's turn the discussion to a more concrete example. I happen to have a ton of data on the range of my Nissan Leaf, so we'll use that. One thing that I knew going into owning a fully electric car was that the range was likely to be dependent on temperature, so in my mileage log I kept track of the ambient temperature. A scatter plot of two years worth of data looks like this:

There is a definite trend to this data with lower temperatures reducing the range of the car, and higher temperatures increasing the range. This idea of a trend brings up one more note of caution. Make sure that you can legitimately claim the dependency that you're asserting. In this case it is fairly obvious that the temperature could cause changes in the range due to changing the capacity and efficiency of the battery. There are known mechanisms in lithium-ion batteries that would cause this behavior. It is also obvious (I hope) that the change in the car's range is not causing the change in ambient temperature. That would be preposterous. Things are not always this simple, though, and I'll get into that more next week when I cover correlation.

So we have a scatter plot that looks kind of linear, and we want to fit a line to it. How do we do that? Well, you could just dump the data into Excel and go to Tools → Regression Analysis, but we want to actually understand what we're doing, so we're going to look at the equations. From algebra we know that the equation for a line is

y = m*x + b

Where m is the slope of the line and b is the value where the line crosses the y-axis. Both x and y are variables. In the Leaf example, x is the ambient temperature and y is the range of the car. If we can figure out what m and b are, then we can plug any temperature into this equation for x and calculate an estimated range of the Leaf at that temperature. We want to calculate values for m and b that will minimize the distance between each of the data points and the resulting line that goes through them. The line parameters that result are called the least squares estimates of m and b.

The term least squares should give you a clue as to how we're going to figure out the best fit line. That's right, the sum of squared differences proves to be quite useful again. Basically, we want to minimize the sum of squared differences between the y values of the line, given by m*x + b, and the y values of the data points. Deriving the equations for m and b involves using calculus to compute derivatives with respect to m and b, setting them equal to zero to find the minimum, and solving for m and b. I won't show the full derivation here, but the calculations for the slope and intercept look like this when implemented in Ruby:

module Statistics
  def self.dot_product(x, y)
    (0...x.size).inject(0) { |s, i| s + x[i]*y[i] }
  end

  def self.s_xy(x, y)
    dot_product(x, y) - sum(x)*sum(y) / x.size
  end

  def self.s_xx(x)
    dot_product(x, x) - sum(x)**2 / x.size
  end

  def self.linear_reg(x, y)
    m = s_xy(x, y) / s_xx(x)
    b = mean(y) - m*mean(x)
    [m, b]
  end
end

The methods s_xy() and s_xx() that are used to calculate the slope m are fairly standard notation for these calculations in statistics, so that explains the terse naming. Notice that the slope calculation makes some sense because it is loosely taking the form of y/x. Once the slope is calculated, the y-intercept calculation is a straightforward solution of the linear equation using the averages of the x values and y values for the (x,y) point.

Now that we can calculate a best-fit line for a data set, we can see what such a line looks like for the Leaf data. Running the linear regression on the data yields a slope of about 0.3 miles/°F and a y-intercept of about 56 miles. That means at 0°F, we can expect this 2012 Leaf to get about 56 miles of range, and we can plug any temperature into the linear equation to see approximately what range to expect at that temperature. Pretty cool. Remember to be careful about plugging in values that are too far outside the range of the data. The range for temperatures above 100°F or below -10°F could be much different than this trend line predicts. Here's what the trend line looks like on the scatter plot:

Scatter plot of Leaf Range Vs. Temperature with Trend Line

You may have noticed that the data is pretty noisy, which makes it not unlike a lot of real-world data. Temperature is not the only variable that's influencing the Leaf's range. Other factors, like wind speed and direction, traffic conditions, variations in driving style and speed, differences in route, and measurement error can all play a role. There is a way to quantify this variation from the trend line to figure out exactly how much of the variation in the data is explained by changes in temperature, and that value is called the coefficient of determination, or r-squared value of the linear regression.

To calculate the r-squared value, we're going to bring back the old workhorse, the sum of squared errors. This time the error is the difference between the y-value of each data point and the corresponding y-value of the trend line. These differences are called the residuals of the linear regression. The other piece of information we need is the total sum of squares, denoted as SST, which is a similar calculation to the sum of squared errors, but it uses the difference between each data point's y-value and the mean of all the y-values. The implementation of the r-squared calculation in Ruby looks like this:

module Statistics
  def self.sse(x, y, m, b)
    dot_product(y, y) - m*dot_product(x, y) - b*sum(y)
  end

  def self.sst(y)
    dot_product(y, y) - sum(y)**2 / y.size
  end

  def self.r_squared(x, y, m, b)
    1 - sse(x, y, m, b) / sst(y)
  end
end

The actual calculation of a sum of squared errors is conspicuously missing from the code, and that's because the calculations of both the SSE and SST terms can be simplified into the above forms that conveniently use methods that are already defined.

With the r-squared value, we can put a number on the amount of variation in the data that is explained by the trend line. For the Leaf range data that number is 0.477. What does that mean? It means 47.7%, or nearly half of the data can be explained by the equation: range = 0.3*temp + 56. The other half of the variation is due to other sources. In general an r-squared value of 50% is okay, and linear regression is a reasonable way to analyze the data. An r-squared value over 90% means the data is very linear, and an r-squared value less than 10% means the data is not at all linear. Doing some other analysis or finding a different measurement to explain the variation in the variable under investigation would be a good idea with such a low r-squared value.

Regression analysis is a very powerful statistical tool, and it can be extended in many ways to answer even more complicated questions using more advanced statistical techniques. That is a topic best left for another day, though. For now remember that linear regression is only appropriate if your data is indeed linear, and be sure to check the r-squared value to quantify how closely your data tracks a trend line. Next week I'll explore another statistical tool that's closely related to regression, and that is the gnarly topic of correlation.

Everyday Statistics for Programmers: Significance

So far we've covered the basic statistical tools of averages and distributions and gone a little deeper with standard deviations and confidence intervals. This time we're going to take a look at statistical significance.

Before getting too far into the details, it's important to understand what statistical significance means. If you have a set of data that you're comparing to an expected or desired value, or you have multiple sets of data taken under different conditions, you can calculate whether or not the data sets are different—either from the expected value or from each other—using a statistical test. Then you can say the results are statistically significant to a certain confidence level.

"Significant" has a very specific meaning in statistics, and it is somewhat different from the common usage of the term. It doesn't have anything to do with how large a difference there is between two values. You can very easily get a statistically significant result by using large sample sizes to accentuate small differences in a measurement.

You see this all the time when the media reports on scientific results, especially in health or medical studies. The headline is usually something sensational, and in the article the author is careful to say that the results are significant without giving any indication of the magnitude of differences. This wording is usually a red flag that the results were statistically significant, but not practically significant. The author very well knows that if he wrote that some treatment produced only 3% better outcomes, or that changing your lifestyle in some way would result in a 5% improvement in some metric, nobody would care. Conversely, if the practical significance is large, then you better believe the journalist is going to highlight it.

Statistical significance should be used as a tool to help you decide if your data is actually telling you what you think it is—is there really a measurable difference in the data. That's it. To determine whether the results are meaningful or actionable, you need to use your own judgement based on your knowledge of the domain you're working in. Now that we know what it means to be statistically significant, how do we calculate it?

Let's start with a single data set that is compared with an expected value, and you want to know whether the mean of the data is equivalent to or different from the expected value. Suppose you have a set of run times for a program you're working on. You have many different input files to test the program with various usage models, and the results of measuring the program's run time for each of these input files makes up your data set. You want to see if the average run time meets or exceeds the performance requirements for the program.

The average run time is an approximation, and it would be different with a different set of input files. If you measured the average run times for many different sets of input files, you would find that the set of averages had its own distribution. What we want to know is if the average run time is less than or equal to the desired performance value, considering that we can't know the real average run time, only an estimate of its distribution.

The mean of this distribution is the mean of the data set, and the standard deviation of the mean is the standard deviation of the data set divided by the square root of the number of samples in the data set. We can create a test statistic by taking the difference of the mean and expected value, and dividing this difference by the standard deviation of the mean. In Ruby we could calculate the test statistic like this:

module Statistics
  def self.test_statistic(data, expected_value)
    mu = mean data
    sigma = stdev data
    (mu - expected_value)/(sigma/Math.sqrt(data.size))
  end
end

I've defined the mean and stdev methods previously. The result of this calculation is a value in standard deviations. You can look up in a normal distribution table what the confidence level would be for a certain standard deviation. In our example, we're looking for the average run time to be less than an expected value, so the test statistic needs to be less than a certain standard deviation value for a certain confidence level. For a confidence level of 95%, the test statistic would have to be less than 1.645. This is called an upper-tailed test. A lower-tailed test is similar with the inequality reversed. A two-tailed test is used when you want to know if the average is indistinguishable from the expected value (or alternatively, if it is significantly different than the expected value).

In a two-tailed test, the required test statistic changes for a given confidence level. E.g. for a confidence level of 95%, the test statistic would need to be within 1.96 standard deviations. If the test statistic falls outside the desired range, you can say the difference between the mean and expected value is statistically significant with a confidence level of 95%. The following graph shows what these rejection regions look like for a two-tailed test. A single-tailed test would exclude one or the other of the regions.

Graph of Probability Distribution of Mean with Rejection Regions

In formal statistics the mean being equivalent to the expected value is referred to as the null hypothesis, and if the test statistic falls within a rejection region, it is said that the null hypothesis is rejected. I've resisted using these terms because I find the double negative and generic terminology confusing, but there it is for completeness.

Now that we have a method of comparing the mean of a data set to an expected value, we can extend the method for comparing the means of two data sets. The distributions of the two means will probably overlap somewhat, and we want to quantify that overlap to determine of they are significantly different (in the statistical sense). The following graph shows what we're dealing with:

Probability Distribution of Two Different Means

We can easily replace the expected value in the previous test statistic calculation with the second mean, but we have to do something more sophisticated with the standard deviations. We'll make use of our old friend, the square root of the sum of squares! This technique really is used a lot in statistics. In Ruby this calculation looks like this:

module Statistics
  def self.compare_test(data1, data2)
    mu1 = mean data1
    sigma1 = stdev data1
    mu2 = mean data2
    sigma2 = stdev data2
    (mu1-mu2)/Math.sqrt(sigma1**2/data1.size + sigma2**2/data2.size)
  end
end

The same kinds of rejection regions apply for this test, but to make that more clear, another example is in order. Suppose we are now optimizing the program we were measuring earlier and we want to determine if the optimizations really improved the run time. We have a second set of run times with its own mean and standard deviation. We would want to see if the calculated test statistic is greater than a certain standard deviation, given a desired confidence level. For a confidence level of 95%, the test statistic would need to be larger than 1.645 standard deviations, and this corresponds to an upper-tailed test. If you only need to show that the two data sets are different, you can use a two-tailed test.

One issue that I haven't addressed, yet, is that the relationship between the test statistic and confidence level changes depending on how large the data sets are. This whole discussion has assumed fairly large data sets of more than about 50 samples. The tests we have been using for these large data sets are called z-tests. If your data sets are smaller, you can do a very similar t-test that uses adjusted critical values for the test statistic. These values can also be found in a t distribution table, so you can handle data sets with smaller sample sizes.

To wrap up, statistical significance is a way to determine if the mean of a data set is different from an expected value or the mean of another data set, given the variance of the data and the number of samples in the data. It's a useful tool for figuring out if it's appropriate to make claims about the data, or if there is too much variation or too little difference to say anything about the data.

This type of testing only scratches the surface of hypothesis testing, and analysis of variance is an even more advanced method of determining significance of differences in multiple data sets. I encourage you to explore these statistical methods in more detail, but z-tests and t-tests are solid tools for everyday statistical use. Next week we'll move on to another fundamental tool of statistical data analysis, linear regression.

Everyday Statistics for Programmers: Standard Deviations and Confidence

Last week I laid some groundwork for understanding statistics that you can use everyday. Calculating averages and looking at histograms to get an idea of what kind of data you're dealing with is a great first step in statistical analysis. But there are so many more tools available, especially if your data approximates a normal distribution. A couple of those tools are standard deviations and confidence intervals, and they can help you understand the characteristics of your data in more detail.

It's important to remember that these tools work best when used with data sets that have a normal distribution because the equations of the tools describe properties of the normal distribution. The more your data set's distribution differs from a normal distribution, the less descriptive power these tools will have for your data. With that cautionary note out of the way, let's tackle some statistics.

The standard deviation is a measure of how tightly clustered a data set's samples are around the mean. As more samples occur closer to the mean, the standard deviation goes down. If the samples are more spread out, it goes up. The following graph shows what different normal distributions look like with standard deviations of 1, 2, and 0.5:

Different normal distributions with different standard deviations

The above description of standard deviation nearly tells you how to calculate it. You could memorize a formula, but it's good to understand the reasoning behind the tools you use so that you can derive the formula even if you can't remember it. Here's how I would build up the formula for the standard deviation:

From the description, we know that we need to start with the mean of the data set and then find the distance (i.e. difference) of every sample from the mean.
We don't care if the difference is positive or negative, so we'll square each difference to remove its sign.
We're looking for a general value representing the spread of the data, so we can calculate the average of all of the squared differences to come to a single representative value of the data. This value, as calculated so far, is actually called the variance of the data.
We're almost there. The only problem is that the units of the variance are squared. To get back to the sample units, take the square root of the variance.

This set of operations—taking the sum of squared differences or the mean of squared differences—is extremely common in statistics. It's a fundamental tool for measuring the spread of data points or for estimating errors from calculations on the data. Keep it in mind. We'll be coming back to it in later posts.

At this point you may be asking why we squared the differences and then took the square root of the variance. Why not just take the absolute value of the differences, find the mean, and call it a day? If you think of every sample as a separate dimension of the variance of the data set, then the difference between the mean and each sample can be considered a vector. The way to find the magnitude of a vector is to calculate the square root of the sum of squares of its elements. That's exactly what the standard deviation formula does.

Now that we have figured out how to calculate the standard deviation of a data set, here's what it would look like in Ruby code:

module Statistics
  def self.sum(data)
    data.inject(0.0) { |s, val| s + val }
  end

  def self.mean(data)
    sum(data) / data.size
  end

  def self.variance(data)
    mu = mean data
    squared_diff = data.map { |val| (mu - val) ** 2 }
    mean squared_diff
  end

  def self.stdev(data)
    Math.sqrt variance(data)
  end
end

One thing to note with this calculation of the standard deviation is that it is biased slightly low because the variance is found by dividing by the number of samples. This bias is negligible for sample sizes larger than about 10, but to get a slightly less biased estimate, you can divide by the sample size minus one.

The mean and standard deviation together are very useful for describing data. If you talk about a data set as the mean plus or minus two standard deviations, you get a pretty good idea of the range of values for a particular variable. For example, if you have a set of temperature data with a mean of 25°C and a standard deviation of 2.5°C, you can say that the data has a value of 25±5°C to show that it represents a range of temperature values that mostly vary between 20°C and 30°C.

Using two standard deviations to represent the range of a data set is fairly typical. Two standard deviations happens to cover slightly more than 95% of a data set with a normal distribution, so it's a good way of showing the typical values that the data are expected to take on. One standard deviation covers about 68% of the normal distribution. Three standard deviations cover nearly the entire normal distribution at 99.7%, so it's correspondingly much less likely to see values outside of three standard deviations. The following graph shows what one and two standard deviations look like on the normal distribution:

1 and 2 standard deviations of normal distribution

Another way to use the standard deviation is to describe how confident we are that the mean of the data lies within a certain range of values. Our confidence will also depend on the size of the data set for which we're calculating the mean. This makes sense. If you have 10,000 samples you should be much more confident of where the average is than if you only have 100 samples to work with. The range of values that should include the mean to a certain confidence level is referred to as a confidence interval.

To calculate a confidence interval, all you have to do is find the number of standard deviations that will give you the desired confidence level and divide by the square root of the number of samples in the data set. A 95% confidence level is commonly used, and that corresponds to 2 standard deviations (well, actually 1.96, but 2 is easier to remember for quick and dirty calculations). Adding to our example Ruby code, we can use the following method for calculating a 95% confidence interval:

module Statistics
  def self.confidence_interval_95(data)
    mu = mean data
    delta = 1.96 * stdev(data) / Math.sqrt(data.size)
    [mu - delta, mu + delta]
  end
end

In the previous temperature example with a standard deviation of 2.5°C, if we had 1000 samples, we could say that we are 95% confident that the average temperature lies within 25±0.16°C. Notice that, because of the square root operation, to shrink the confidence interval by a significant amount, you have to dramatically increase your sample size. To cut the interval in half, you would have to take four times as many samples. To reduce the interval by an order of magnitude, you would have to increase your sample size a hundred fold!

We should be careful here not to say that there is a 95% chance that the average temperature lies in the calculated range. It's a subtle distinction from the above phrasing, but it's wrong. It also starts down a slippery slope to making dubious claims about the data.

The problem with saying there's a 95% chance is that you are no longer talking about confidence. You are talking about probability, and saying there is a probability that the average temperature is within that range is not correct. The average may or may not be within this confidence interval, but it is a definite value, whether you know it or not. That means once you have a confidence interval, the true mean is either in it or not. There is no probability involved. If you don't know what the true mean is—and many times you can't know because you're only sampling from a larger population—then you have a confidence level that the mean is within the range that you calculated.

There are many more details and subtleties contained within these concepts, but with these basics you can find a lot of practical applications of standard deviations and confidence intervals when working with data. Standard deviations help describe the spread of data that follows a normal distribution, and confidence intervals give you a sense of how meaningful the mean of a data set is. Both of these tools can give you a deeper understanding of the data you're analyzing.

Next week we'll take a look at significance, a tool that goes further into figuring out if your data is really saying what you think it is.

Everyday Statistics for Programmers: Averages and Distributions

Programmers work with lots of data, and when you work with any amount of data, statistics will quickly come into play. A basic working knowledge of statistics is a good tool for any programmer to have in their toolbox, one that you will likely use regularly once you're comfortable with it.

In this mini-series I'll go through some of the statistical tools that I use everyday, what to look for when deciding on which tool to use, and how to avoid mistakes that can mess up your results. I'll start with the basics of the basics—the humble average and a few of the most common distributions.

The average, formally called the mean, is the most fundamental concept in statistics. It's the hammer in the statistical toolbox. The trick is to use it on the right nails. An average is easy to calculate. Given a data set, all you need to do is add up all of the samples and divide by the number of samples in the set. Voila, you have a value that lies in the middle of your data set.

In Ruby you could calculate a mean from an array of values with a method like this:

module Statistics
  def mean(data)
    data.inject(0.0) { |sum, val| sum + val } / data.size
  end
end

Even though the average is a basic tool that's super easy to use, if you're not careful with what you use it on, it can give misleading results. For example, you would not want to use an average to analyze data about the growth of a web site's user base over time. The average number of users is a meaningless number because it's strongly related to another variable, time. Depending on the growth rate, calculating the average number of users over a month or even a week would tell you very little about the number of users on your site. A more sophisticated tool, like a curve fit, is more applicable in this case.

The average is best used on a data set where the samples are independent, meaning they are not related to each other through some dominant parameter of the system, and they are close together relative to their magnitude. In statistics this is called the deviation from the mean, and I'll get into deviations more in a later post. The important thing to remember is that the smaller the deviation is, the more meaningful the mean is as a representation of the data set. If a certain data set is a set of temperature measurements with an average of 200°C and varies from 199°C to 201°C, the average is going to say a lot more about the temperature of the data than if the measurements varied from 100°C to 300°C.

The other important thing to consider when using averages is the distribution of the data. The best case scenario for using averages is if the data has a normal distribution. This type of distribution results from a data set whose samples are clustered around the mean value in a way that's not biased to one side or the other. If you group samples into ranges of values and plot the results as a histogram, you would get something like this for a normal distribution:

You can calculate the histogram bin counts in Ruby with a method like this:

module Statistics
  def self.histogram(data, bins, min_val=nil, max_val=nil)
    min_val ||= data.min
    max_val ||= data.max
    step = (max_val - min_val) / bins.to_f
    counts = [0]*bins
    data.each do |value|
      bin = [((value-min_val)/step).floor, bins - 1].min
      counts[bin] += 1
    end
    counts
  end
end

The each block runs through each data value and calculates which bin it should be counted in using the minimum value and the step size between bins. Any value that's less than the lowest bin will be counted in that bin, and any value greater than the highest bin will be counted in that bin. The bins can be automatically calculated from the minimum and maximum data values, but if there are outliers, the bins could be very spread out, making the histogram look weird. In that case, you can specify where you want the range of the histogram to be, and any outliers will be counted in the bins on either side.

Histograms can tell you a lot about the characteristics of a data set. They will clearly show whether the average is a meaningful representation of the data, how spread out the samples are within the data set, and if there are any peculiarities hidden in the data set. If your data looks like a normal distribution, there are a lot of statistical tools at your disposal for analyzing it.

The mean is also well-defined. It sits right at the peak of the histogram. Half of the samples are larger than the mean, and half of the samples are smaller than the mean. If the histogram is narrow and tall, it shows that the distribution is tight, and a lot of samples are very close to the mean value. The mean is a good value to represent the data set in this case, and its usefulness in predicting the value of future samples with the same characteristics is high.

When you expect your data to have a normal distribution, but it doesn't, it's common for the histogram to show one or more additional peaks. This is called a multimodal distribution, or in the most common case of two peaks, a bimodal distribution. A histogram of such a distribution might look like this:

There are a lot of possibilities here, and most bimodal histograms will not look exactly like this one. The peaks could be further apart, one could be narrower than the other, or one could be taller than the other. In any case, the bimodal nature of the data makes an average less characteristic of the data than normal. Depending on how far apart the peaks are, the mean could merely be shifted from where it should be, or it could fall in an area of the data where there are hardly any data points at all—not a terribly useful place for a mean to be.

What this histogram is telling you is that something is affecting the data in a binary way, meaning that the parameter that is affecting the data is takes on two values. When it's one of the two values, the samples will cluster around one mean, and when it's the other value, the samples will cluster around a different mean. Each value of the parameter will create a separate normal distribution, and the combination of samples at two different values of this unknown parameter creates the bimodal distribution. In fact, I created the above graph by adding two normal distributions together with two different means.

If you have data with a bimodal distribution, you can approach it in a variety of ways. Normally, the first thing you want to do, if you don't know already, is figure out what is causing the distribution to split. Did conditions change while taking the data? Is there some unknown variable that you should filter on to split the data into two sets? What can you control better when taking the data? Can this unknown variable take on more than two values, and if so, does that have a deterministic effect on the data? Can you control this variable to optimize things in the system? Exploring and answering these questions can lead to important new insights into the behavior of the system you're analyzing.

Once you know more about what's causing the bimodal distribution, you can decide whether you want to eliminate the cause or use it to enhance the system. That will depend on whether this variable has a desirable effect on the system or not. In either case, bimodal distributions shouldn't be feared. You can learn a lot about a system by exploring why such a distribution exists.

One last very common distribution is a power-law distribution, also known as a Pareto distribution. This kind of distribution shows up all the time when measuring things with a hard one-sided limit, and it generally looks like this:

The initial sharp decline and then gradual sloping fade away of the curve gives the right side of this distribution the name "the long tail." Many times there is no bounded upper limit to how large whatever is being measured can get. Some examples of things that follow a power-law distribution include people's incomes, internet response times, and user scores on websites like Stackoverflow.com. Some distributions, like people's incomes, have a peak that occurs above zero, but they all exhibit the long tail.

The long tail does some interesting things to the mean of the distribution. A few extremely large data points can pull the mean towards them, giving the mean a potentially large bias. Instead of having a mean being balanced between two equal halves of the distribution, you'll have a distribution with less data points that are larger than the mean and more points that are smaller than the mean.

If you are interested in the halfway point of the distribution, you'll have to calculate the median instead. The median is the value that occurs at the exact halfway point when the samples are sorted. The distance between the median and the mean can be used as a measure of how skewed the distribution is towards the tail.

While the mean is not a very useful parameter of a power-law distribution, knowing that you have such a distribution can guide your use of the data. If the goal is to minimize whatever you're measuring—internet response times is a good example—it can be quite fruitful to target the large outliers to see if they can be eliminated. If those super long response times can't be reduced, they will be a constant drag on the average response time, no matter how much you reduce the faster response times. If instead your goal is to increase the metric, finding some way to affect a large portion of the samples that are below the median will have a correspondingly large effect on the average.

To wrap up, the three main types of distributions that tend to show up when analyzing data are the normal, bimodal, and power-law distributions. When starting to analyze a new data set, it's best to figure out what kind of distribution you're dealing with before diving into more complicated statistical analysis. If the data follows a normal distribution, the mean can be used to talk about what a typical sample is like. With other distributions, you should be much more careful when talking about the average value of the data. That's enough for today. Now that the basics are covered, next time I'll get into standard deviations and confidence.