Archive for November, 2009

More on reliability

Lessons
 Posted by jeremy on November 16th, 2009

My last post questioned the analogy that calls a broken clock reliable because it is consistent. I claimed that reliability is consistency vis-à-vis the variable being measured, so a clock that doesn’t change as the time changes is not reliable. Additionally, I noted that a broken clock is even inconsistent in the degree of inaccuracy it displays: Twice a day the clock will have no error, but at all other times it will have varying degrees of error in its measure of time.

I’d like to continue with another analogy to further explore this construct of reliability.

Jon used a broken scale in his illustration of reliability:

If a person weighs 200 pounds and the bathroom scale says they weigh 200 pounds, then the measurement is valid. However, if the scale indicates different weights each time a person steps on it (even though their weight hasn’t changed), the measurement isn’t reliable. On the other hand, if the scale consistently indicates that a 200 pound person weighs only 150 pounds every time they weigh themselves, the measure is reliable (consistent) but not valid (accurate).

While the scale analogy is just as flawed as the clock analogy (a functioning scale would have to be classified as unreliable), extrapolating from it will reveal some limitations in the concept of reliability. Clocks measure a unique variable in that time is circular. Every twelve hours we come back to the same time, so even a broken clock is right twice a day. Weight, on the other hand, is a linear variable, and there is no instrument that produces reliable measures of weight across the range of possible weight values.

Don’t believe me? Drive your car across your bathroom scale and see what happens; or go weigh yourself on a truck scale.

Every good instrument has a range of values in which it reliably measures the target variable. Outside that range the results may be extremely consistent, but they are not consistent reflections of the target variable. For example, your bathroom scale will read anything over its maximum reading as equal to that maximum value. Though I may weight 472 pounds, your scale tells me I’m only 300.

The same is true of any assessment or psychological scale: Though the test manual may declare that past results had a .93 alpha coefficient or that trained raters agreed 87% of the time, these figures represent averages across all the scores that were recorded. It may be that the raters agreed mostly on the scores that were well above the cutoff score, or that the lowest error occurred for students with very low trait levels. Coefficient alpha, in particular, should be interpreted as a maximum level of score reliability.

Estimates of reliability, which are the product of group scores, are thus less applicable to an individual student’s score.

But given an adequate sample size we could estimate the a scale’s reliability across the levels of the target variable. Item response theory has a function for that: It’s called the information curve.

information curve

This particular information curve is from the results of my department’s dispositions assessment. Teacher candidates self-rate on important professional attitudes, and we use the anonymous results for program improvement. What’s useful in this information curve is that the most consistent scores are for students slightly below average (zero on the x-axis). Since we can do something about these low scores, this curve is encouraging. Were the most informative scores to come from students with high levels of professional dispositions, the scores would be less useful for advising programmatic improvements.

No, a broken clock is not reliable

Lessons
 Posted by jeremy on November 14th, 2009

There’s a very common – and very inaccurate – analogy about reliability that spreads like a cancer. Though it is easy to understand, the metaphor conveys several misconceptions about reliability, and it’s worth debunking whenever you hear or read it.

First, a little background. The terms reliability and validity are old and used in several contexts, each of which adds distinct nuances. Most scholars learn the concepts in regards to research methods (sorry philosophers), and it is from that field that this horrible analogy originates.

A broken clock is reliable because reliability is consistency of measure. A broken clock always gives you the same reading; it’s very consistent. But a broken clock would only be valid twice a day: when the actual time happens to be the time displayed on the broken clock.

I won’t go into validity today, I think I’ve beat that drum to death, but I will use this flawed analogy to demonstrate why reliability can only refer to a specific type of consistency. My one-line response to this analogy is…

So a working and calibrated clock wouldn’t be reliable because it’s inconsistent? I mean, every time I look at it, it gives me a different reading.

The consistency in reliability refers to the consistency in measuring a variable. In context of the analogy, the variable is time and a clock that doesn’t change as the time changes does not consistently measure the target variable. Twice a day, when the clock accurately reflects the actual time, the measure would be accurate (has no error), but five minutes later it’s less accurate (has some error), and six hours later it would be completely inaccurate (have extreme error). Because of this inconsistency in regards to its measure of time, a broken clock cannot be deemed reliable.

The case of the broken clock also reveals an important, but often overlooked detail about reliability: It’s not the instrument (the clock) that is more-or-less reliable, but the results/readings/information that the instrument produces. It would be more appropriate to say that the time displayed on a broken clock is not reliable.

If anyone asks, “but what about test information in item response theory?” that’s going to be my next post. Let’s see if I can paint myself out of that corner.

News flash: Computers unable to judge quality; can humans?

In the News
 Posted by jeremy on November 13th, 2009

Today the London Times reports that submitting some great work by Hemingway and Golding to a paper-grading computer resulted in low marks on all counts. If a youngster completes an exam with the same flair and style as these established classics, they fail. What, then, would pass?

Mr Herbert said that some children in American had “cracked the code” by learning to write in a style that the computer recognised. This was called “schmoozing the computer”, he said. “At the moment we do not have a reliable and valid way of assessing English language using a software package, although this is something for which there is demand.”

Technophobes out there will see this as more fodder for ditching computers in education, but shouldn’t we compare the computer’s work to the output of the humans it was meant to replace?

A few years ago a British writer submitted chapters of Kipling and Milne for publication as if they were his own creation. Each was rejected with comments of how they would never sell in today’s marketplace. I wish I could find a link…

Additionally, the rubrics for many state tests require a single format of essay (the five-paragraph), and will dock points for anything marginally creative. By the way, not one of Michele de Montaigne’s essays were in the five-paragraph style.

So, while it has been established that computers are lousy at judging literary quality, there is no evidence to support humans’ ability to do so.