More on reliability

Lessons
 Posted by jeremy on November 16th, 2009

My last post questioned the analogy that calls a broken clock reliable because it is consistent. I claimed that reliability is consistency vis-à-vis the variable being measured, so a clock that doesn’t change as the time changes is not reliable. Additionally, I noted that a broken clock is even inconsistent in the degree of inaccuracy it displays: Twice a day the clock will have no error, but at all other times it will have varying degrees of error in its measure of time.

I’d like to continue with another analogy to further explore this construct of reliability.

Jon used a broken scale in his illustration of reliability:

If a person weighs 200 pounds and the bathroom scale says they weigh 200 pounds, then the measurement is valid. However, if the scale indicates different weights each time a person steps on it (even though their weight hasn’t changed), the measurement isn’t reliable. On the other hand, if the scale consistently indicates that a 200 pound person weighs only 150 pounds every time they weigh themselves, the measure is reliable (consistent) but not valid (accurate).

While the scale analogy is just as flawed as the clock analogy (a functioning scale would have to be classified as unreliable), extrapolating from it will reveal some limitations in the concept of reliability. Clocks measure a unique variable in that time is circular. Every twelve hours we come back to the same time, so even a broken clock is right twice a day. Weight, on the other hand, is a linear variable, and there is no instrument that produces reliable measures of weight across the range of possible weight values.

Don’t believe me? Drive your car across your bathroom scale and see what happens; or go weigh yourself on a truck scale.

Every good instrument has a range of values in which it reliably measures the target variable. Outside that range the results may be extremely consistent, but they are not consistent reflections of the target variable. For example, your bathroom scale will read anything over its maximum reading as equal to that maximum value. Though I may weight 472 pounds, your scale tells me I’m only 300.

The same is true of any assessment or psychological scale: Though the test manual may declare that past results had a .93 alpha coefficient or that trained raters agreed 87% of the time, these figures represent averages across all the scores that were recorded. It may be that the raters agreed mostly on the scores that were well above the cutoff score, or that the lowest error occurred for students with very low trait levels. Coefficient alpha, in particular, should be interpreted as a maximum level of score reliability.

Estimates of reliability, which are the product of group scores, are thus less applicable to an individual student’s score.

But given an adequate sample size we could estimate the a scale’s reliability across the levels of the target variable. Item response theory has a function for that: It’s called the information curve.

information curve

This particular information curve is from the results of my department’s dispositions assessment. Teacher candidates self-rate on important professional attitudes, and we use the anonymous results for program improvement. What’s useful in this information curve is that the most consistent scores are for students slightly below average (zero on the x-axis). Since we can do something about these low scores, this curve is encouraging. Were the most informative scores to come from students with high levels of professional dispositions, the scores would be less useful for advising programmatic improvements.

Comments are closed.