I’m using this in work, and I’m having to explain it a lot, so I’m writing this entry to help myself clarify and simplify my personal understanding of this principle, as well as to have a place to send people who ask.
Pearson’s Correlation Coefficient (r, for samples) is the degree and direction of relation between two variables and ranges from perfect positive correlation (1.0) through no correlation (0.0) to perfect negative correlation (-1.0). If the correlation is positive, the first variable increases or decrease as the second increases or decreases. If the correlation negative, the first variable increases as second variable decreases and vice-versa.
Correlation can be understood, to an extent, by rank order of the first and second variables. If Johnny and Jane score 75% and 80% respectively on a pretest, and then 85% and 95% respectively on the post test, r would be a perfect 1.0 because Johnny’s score was lower in both cases. This would hold true even if Johnny scores 94% (still lower than Jane’s) on the post test, but not if his post test score exceeds Jane’s. At that point, r is a perfect -1.0. With only two cases, r will always be 1.0 or -1.0, but of course, such clear cases are rare.
Extreme r can be visualize through plotting a line for each case from their first variable value to their second variable value. Here are three graphs for simplistic 1.0, 0.0, and -1.0 correlations. It’s a bit more difficult to spot non-extreme correlations with large numbers of cases:
166 cases, r= .71
166 cases, r= -.70
166 cases, r= -.13
For a real world, example, we looked at two variables (intensity of study and score on admissions test) for 166 Middle East language students. We found there to be basically no correlation between the two variables (r=-.04) Here’s the graph, and a close up of the chaos.
Notes:
It’s cliché, but correlation does not mean causation. The sun coming up may correlate with an increase in sewer pressure, but the sun has little direct effect on this. All r indicates is how to variables relate to each other, not what causes this.
I once heard an M.D. on the news say, “80% of type-2 diabetics are obese, so it doesn’t take a Ph.D. to see the connection.” Well, actually, I guess it does. A Ph.D. (sh|w)ould point out that the two conditions may actually be caused by a third, present in both cases. (I’m not saying obesity doesn’t lead to diabetes, just that this statement was not a good argument for it.) Sadly, we, as a society are conditioned to think in terms of percentages, “grade levels”, etc. And very few of us are not convinced by their face validity.
Correlation cannot be determined if the sample size is 1, because the standard deviation of a single sample is 0 and to compute correlation, one must divide by the standard deviation, and one cannot divide by 0.
r can be tested for significance, but its square is often considered more important because it can be interpreted as the percentage of variance in the second variable that can be accounted for by the first (and vice-versa). Thus, r2, is called the coefficient of determination.