Everyday Statistics for Programmers: Significance

So far we've covered the basic statistical tools of averages and distributions and gone a little deeper with standard deviations and confidence intervals. This time we're going to take a look at statistical significance.

Before getting too far into the details, it's important to understand what statistical significance means. If you have a set of data that you're comparing to an expected or desired value, or you have multiple sets of data taken under different conditions, you can calculate whether or not the data sets are different—either from the expected value or from each other—using a statistical test. Then you can say the results are statistically significant to a certain confidence level.

"Significant" has a very specific meaning in statistics, and it is somewhat different from the common usage of the term. It doesn't have anything to do with how large a difference there is between two values. You can very easily get a statistically significant result by using large sample sizes to accentuate small differences in a measurement.

You see this all the time when the media reports on scientific results, especially in health or medical studies. The headline is usually something sensational, and in the article the author is careful to say that the results are significant without giving any indication of the magnitude of differences. This wording is usually a red flag that the results were statistically significant, but not practically significant. The author very well knows that if he wrote that some treatment produced only 3% better outcomes, or that changing your lifestyle in some way would result in a 5% improvement in some metric, nobody would care. Conversely, if the practical significance is large, then you better believe the journalist is going to highlight it.

Statistical significance should be used as a tool to help you decide if your data is actually telling you what you think it is—is there really a measurable difference in the data. That's it. To determine whether the results are meaningful or actionable, you need to use your own judgement based on your knowledge of the domain you're working in. Now that we know what it means to be statistically significant, how do we calculate it?

Let's start with a single data set that is compared with an expected value, and you want to know whether the mean of the data is equivalent to or different from the expected value. Suppose you have a set of run times for a program you're working on. You have many different input files to test the program with various usage models, and the results of measuring the program's run time for each of these input files makes up your data set. You want to see if the average run time meets or exceeds the performance requirements for the program.

The average run time is an approximation, and it would be different with a different set of input files. If you measured the average run times for many different sets of input files, you would find that the set of averages had its own distribution. What we want to know is if the average run time is less than or equal to the desired performance value, considering that we can't know the real average run time, only an estimate of its distribution.

The mean of this distribution is the mean of the data set, and the standard deviation of the mean is the standard deviation of the data set divided by the square root of the number of samples in the data set. We can create a test statistic by taking the difference of the mean and expected value, and dividing this difference by the standard deviation of the mean. In Ruby we could calculate the test statistic like this:
module Statistics
  def self.test_statistic(data, expected_value)
    mu = mean data
    sigma = stdev data
    (mu - expected_value)/(sigma/Math.sqrt(data.size))
  end
end
I've defined the mean and stdev methods previously. The result of this calculation is a value in standard deviations. You can look up in a normal distribution table what the confidence level would be for a certain standard deviation. In our example, we're looking for the average run time to be less than an expected value, so the test statistic needs to be less than a certain standard deviation value for a certain confidence level. For a confidence level of 95%, the test statistic would have to be less than 1.645. This is called an upper-tailed test. A lower-tailed test is similar with the inequality reversed. A two-tailed test is used when you want to know if the average is indistinguishable from the expected value (or alternatively, if it is significantly different than the expected value).

In a two-tailed test, the required test statistic changes for a given confidence level. E.g. for a confidence level of 95%, the test statistic would need to be within 1.96 standard deviations. If the test statistic falls outside the desired range, you can say the difference between the mean and expected value is statistically significant with a confidence level of 95%. The following graph shows what these rejection regions look like for a two-tailed test. A single-tailed test would exclude one or the other of the regions.

Graph of Probability Distribution of Mean with Rejection Regions

In formal statistics the mean being equivalent to the expected value is referred to as the null hypothesis, and if the test statistic falls within a rejection region, it is said that the null hypothesis is rejected. I've resisted using these terms because I find the double negative and generic terminology confusing, but there it is for completeness.

Now that we have a method of comparing the mean of a data set to an expected value, we can extend the method for comparing the means of two data sets. The distributions of the two means will probably overlap somewhat, and we want to quantify that overlap to determine of they are significantly different (in the statistical sense). The following graph shows what we're dealing with:


Probability Distribution of Two Different Means


We can easily replace the expected value in the previous test statistic calculation with the second mean, but we have to do something more sophisticated with the standard deviations. We'll make use of our old friend, the square root of the sum of squares! This technique really is used a lot in statistics. In Ruby this calculation looks like this:
module Statistics
  def self.compare_test(data1, data2)
    mu1 = mean data1
    sigma1 = stdev data1
    mu2 = mean data2
    sigma2 = stdev data2
    (mu1-mu2)/Math.sqrt(sigma1**2/data1.size + sigma2**2/data2.size)
  end
end
The same kinds of rejection regions apply for this test, but to make that more clear, another example is in order. Suppose we are now optimizing the program we were measuring earlier and we want to determine if the optimizations really improved the run time. We have a second set of run times with its own mean and standard deviation. We would want to see if the calculated test statistic is greater than a certain standard deviation, given a desired confidence level. For a confidence level of 95%, the test statistic would need to be larger than 1.645 standard deviations, and this corresponds to an upper-tailed test. If you only need to show that the two data sets are different, you can use a two-tailed test.

One issue that I haven't addressed, yet, is that the relationship between the test statistic and confidence level changes depending on how large the data sets are. This whole discussion has assumed fairly large data sets of more than about 50 samples. The tests we have been using for these large data sets are called z-tests. If your data sets are smaller, you can do a very similar t-test that uses adjusted critical values for the test statistic. These values can also be found in a t distribution table, so you can handle data sets with smaller sample sizes.

To wrap up, statistical significance is a way to determine if the mean of a data set is different from an expected value or the mean of another data set, given the variance of the data and the number of samples in the data. It's a useful tool for figuring out if it's appropriate to make claims about the data, or if there is too much variation or too little difference to say anything about the data.

This type of testing only scratches the surface of hypothesis testing, and analysis of variance is an even more advanced method of determining significance of differences in multiple data sets. I encourage you to explore these statistical methods in more detail, but z-tests and t-tests are solid tools for everyday statistical use. Next week we'll move on to another fundamental tool of statistical data analysis, linear regression.

No comments:

Post a Comment