Everyday Statistics for Programmers: Correlation

Last week I went through the basics of linear regression, and I touched on correlation but didn't get into it. That's because correlation may seem fairly simple, but the simplicity of the calculation disguises a vast field of complexity and logical traps that even professional researchers, statisticians, and economists routinely fall into. I'm not claiming to know more than these people that live and breathe correlation every day, but the issues are well known. Knowing the issues involved can raise our general awareness of when certain conclusions are warranted and when they're going too far.

I Think, Therefore I Correlate

One of the reasons why correlation is so deceptive is that we see correlations all the time. A correlation is basically a relationship between two things. If two separate things grow or increase or change at the same time, they are correlated. If one thing increases and another thing decreases at the same time, they are also correlated.

Our pattern-matching brains are wired to detect this behavior. It's how we've survived and developed throughout our history. When our distant ancestors would eat something and then get sick, they recognized that they shouldn't eat that plant or animal anymore. When they would eat or drink certain things when they were sick, or they would apply different plants to wounds, and then get better, they recognized that as well and passed on that information as early medicine. When they started planting seeds and developed early agriculture, they paid attention to what practices would result in better crop yields. All of these things are correlations, and more often than not, seeing those correlations could mean the difference between life and death. It's ingrained in us.

Even though seeing correlations is a good thing, it's also very easy to develop false correlations. Our ancestors used to believe comets were bad omens. The coming of a comet meant that something terrible was about to happen. Many ancient cultures believed (and some still do) that various traditions, like dancing, could bring the rains so their crops would grow. There are innumerable examples of other false correlations different cultures have believed at different times, and plenty survive to this day. We even collect them as old wives' tales and pass them on half-jokingly to future generations.

Have you ever heard the joke about the guy who was walking down the street in a big city, flapping his arms wildly up and down? Another man walks up to him and asks why he's doing that. The first guy, still flapping his arms, says, "To keep the alligators away." The second guy says, "But there aren't any alligators here." The first guy responds, "See, it's working!" That's false correlation.

These false correlations come about when two things happen at the same time, and someone attaches too much significance to the fact that those two events just happened to coincide. The person witnessing the events doesn't realize that there could be some other explanation for what they experienced, and he goes on to convince enough other people that these two things are related that it becomes common knowledge.

Correlate Wisely

When we're looking at data and calculating correlations, or even reading studies done by others, we need to be careful to not be fooled. It's harder than it sounds. Correlations are easy to calculate, and many things can seem correlated when they're not, either by coincidence or because something else that you're not paying attention to is systematically causing the correlation that you're observing. It may be all the more difficult to recognize a false correlation because you want to believe that it's there because of internal biases or external influences. On top of that, your brain is wired to see the pattern even if it's not the right explanation, and you have to fight against this instinct to get to the truth.
You should first decide whether the correlation makes any sense at all. Is there a plausible mechanism that could explain the connection? Are the two things related in any way that would make the correlation reasonable? Is there anything that would directly contradict the correlation that you're seeing? Is it extraordinary that these two things are correlated? Remember, extraordinary claims require extraordinary evidence.

If the correlation is plausible, then you might start thinking about which of the variables in the correlation is dependent on the other. While you can do this, you must be careful because this line of thought leads to another big way that correlations are used incorrectly. If you've read enough studies that use statistics, sooner or later you'll hear the phrase, "correlation does not imply causation." There are other analysis tools that can get closer to proving the claim that A causes B, but correlation is not one of them. All correlation says is that A and B are related in some way, nothing more. To make stronger claims about dependence, more work will be involved.

It may seem obvious that one thing causes the other in the system you're studying, but it may still be possible for causation to run in the other direction or an entirely different cause may explain why both of the variables being measured are moving together. One recent, high-profile example of this issue occurred after the financial crisis of 2008. A couple of economists, Carmen Reinhart and Kenneth Rogoff, set out to measure the relationship between government debt and GDP growth. They found a negative correlation, meaning that as government debt went up, a country's GDP growth tended to go down, and they concluded that government debt was causing the slow down in GDP growth.

It turned out that there were some errors in their data and calculations, so the correlation was actually much weaker than they thought, but even before that there was a big debate about which way causation actually ran. It's fairly easy to argue that for many of the countries in the study, they suffered lower GDP growth or even negative growth that drove up their government debts, instead of the other way around. In reality causation ran in both directions, and it was highly context dependent by country. Some countries started with high government debt before going into recession, and others went into recessions that ballooned their debt. Correlation couldn't show any of these details.

Correlation In Practice

The ease with which correlation analysis can go wrong doesn't mean correlation isn't useful. It is very useful as a starting point for further analysis. Correlations can show if a relationship actually exists so you can investigate it further, or if you're barking up the wrong tree and need to look elsewhere. So how do we calculate a correlation, given a set of data with X-values and Y-values for two different variables that we want to compare? Here it is, implemented in Ruby once again:

module Statistics
  def self.r(x, y)
    s_xy(x, y) / (Math.sqrt(s_xx(x))*Math.sqrt(s_xx(y)))
  end
end

I defined the methods s_xy() and s_xx() last week for calculating the r-squared value of a regression analysis, so at this point we're getting pretty far along in statistical equations, using earlier calculations to construct new equations. In fact, this correlation coefficient is closely related to the r-squared value. All you have to do is square it and that's what you get, which is why it's referred to as the r value in the code above.

Both the correlation coefficient and the r-squared value give an indication of how strongly linear a data set is, but the correlation coefficient has one property that the r-squared value doesn't have. It can be negative, and a negative correlation coefficient happens when one variable increases while the other decreases, i.e. the scatter plot shows a negative-trending slope. So if the correlation is +1 or -1, the data is perfectly linear with a positive or negative slope, respectively.

If the correlation is close to zero, then the data is either uncorrelated or it might have a nonlinear relationship—you have to check the scatter plot to decide. The closer the correlation is to +/-1, the more linear the relationship is, but to get a better idea of that linearity, you should square the correlation coefficient. This operation gives you the r-squared value, which shows the percentage of the variation in the data that is due to linearity.

Go Forth And Correlate

If you take away anything from this discussion, remember that correlation is quite useful if used wisely, but it is also incredibly easy to misuse. Correlation does not imply causation, no matter how much you want to believe that causation runs a certain way in your data. Unless there is a known mechanism that explains why one variable depends on the other, you need to perform other analyses before you can claim causation. If your data has a time component to it, there are a number of time series analyses and cross-correlation analyses that could uncover more details in the data related to causation. Other types of data require other more advanced statistical techniques to figure out causation. In any case, examining the correlation of the data is a good place to start, and will generally point you in the right direction.

We're getting close to the end of this miniseries on statistical tools. Next week I'll wrap up with a couple ways to analyze data that's not linear, but looks like it could fit some other type of curve.

Lucid Mesh

Search This Blog