Archive for the ‘Lessons’ Category

The Research Design of Teacher Evaluations

 Posted by jeremy on September 28th, 2011

The week’s session of Understanding Educational Research included the great online primer to research design by Trochim. Before diagramming the various designs in articles we had found, we reviewed the classic models:

The controlled, randomized experiment:
R O   O

The quasi-experiment:
O   O

The longitudinal study:

And, of course, the non-experiment:

As we discussed the strengths and weaknesses of each model, we tried to think of research purposes for which each design was suited. Of course, only one of these model can by itself establish causality, which is why it is the “gold standard” of educational research. Both quasi-experiments and longitudinal studies are useful for establishing trends and formulating hypotheses (which are then tested with the first model), but what about that last model?

The non-experiment is pretty much useless. Sure, you give some treatment and you gather some data, but one of two fatal flaws lingers: 1. Either the point of the study was to reveal something about the treatment, but, without a pretest, we can’t say anything about it; or 2. the treatment wasn’t the focus of the study, in which case it shouldn’t be part of the design.

Towards the end of our discussion, one of the literacy masters students shouted, “The treatment is irrelevant!” I thought that was so astute that I added it to the slide with an APA-formatted citation to the student’s comment.

The students, most of whom are in-practice teachers, saw this two slides later:

Teacher Evaluations → X O

A study that claimed to establish the effect of any treatment without at least a pre-test would have a hard time getting published in an academic journal. Yet that is the model we’ve been using for ten years to determine whether teachers, schools, districts, and states are effective.

Some states have moved to the “growth model” for teacher evaluations, which would look like this:

The major improvement with this model is that we can now establish the change that occurred during the treatment period. On the other hand, it does not tell us the degree to which the treatment was responsible for the observed change. This model is still not appropriate if its results are to be used to hold individuals and groups accountable.

It is becoming sadly commonplace to see policymakers misuse social science methods, but, in this case, there is also a high degree of hypocrisy: At the same time state offices of education have attempted to discover teacher effectiveness via research models that cannot produce such evidence, those same offices have called for teachers to use only “evidence-based methods” of instruction. It’s a classic case of do-as-we-say-not-as-we-do.

Conference on ed tech leads me to wonder about historical revisionism in the field

Lessons Opinion Reports
 Posted by jeremy on May 27th, 2011

I attended a regional conference on instruction and technology this week. I’ll admit I didn’t know what to expect – I had never been to anything smaller than a national conference – but I should have been able to predict that the sessions would be highly variable in their quality. Of course, there were the vendor demonstrations and the how-to reports devoid of data, but there were also more discussion-oriented presentations on issues of common-yet-unexplored techniques, and business reports on local issues.

I was satisfied with the quality of several of the presentations I attended, especially one on rubrics wherein, once the presenter discovered (by – gasp! – asking the audience questions) that most of the attendees had experience developing rubrics, she decided to let us do the talking. She would pose a question (which she knew to be the topic of her next slide), and then she would moderate our conversation. After four or five comments, she would flip to the next slide and show us that we had covered everything on it. Her effort in organizing the information wasn’t wasted just because the group in the room already knew it; see, we only knew it collectively, and she helped move that information from collective to individual.

Sadly, the worst hour I spent was during on keynote address. I’ve known for a while that the educational technology field is driven more by marketing hype (and the artificial demand it creates) than by the effectiveness of the technology. This presentation made me wonder the degree to which historical revisionism has helped maintain that unhealthy producer-consumer relationship.

The presenter stated as fact that the early computer-assisted educational programs translated the production mentality of early computers into a one-size-fits-all approach to education. While there were mass-produced products of this type, it’s wrong to ignore the more instructionally-appropriate efforts that included student tracking and automatically adjusted to their needs.

Ten years ago I sat on a committee with some very bright professors in the field. We discussed requirements for instructional authoring systems and, as each “cutting edge” feature was brought up, someone would ask a distinguished professor, “Didn’t [insert thirty-year-old system] have that?” He would nod or give a simple, “yep” in response.

The ed tech field’s collective ignorance of the depth and meaningfulness of their roots is not only wrong because it slights the visionaries of decades past; it’s also wrong because it allows us to be wooed by technologies that appear new, but are only new in that industry has never pushed them before.

This ignorance is most observable in our field’s unabashed embrace of the mobile “app” model. Those of us who have given decades of effort to establishing open, standards-based technology thoroughfares should be revolted that making our wares available on monopolized toll-roads is now seen as a requirement. (For more reading, see The App is Crap.)

Assessation & Evalument

 Posted by jeremy on April 16th, 2011

I participated in a national web conference on assessment sponsored by a respected organization. Many of the talks were good, one or two made some great points, but one in particular had some serious flaws. It’s generally my policy not to call people out, and my purpose here is not to critique the speaker, so I’m going to do my best not to cite him/her.

When this person presented his/her definition of “Assessment,” a participant sitting next to me asked me what I thought of it. “It’s half-right, but gets too much into evaluation,” was my response. As the presentation progressed, I noticed that the line between assessment and evaluation become so blurred that the speaker was using the terms interchangeably.

When the presentation was over, there was an opportunity for us to ask in questions via a chat form. Mine was first in line: “What’s the difference between assessment and evaluation?” The speaker hummed an ha’ed, claimed that this was an open question that “we’re still working on,” and stated that he/she used the terms loosely.

No, no, no. It’s only an open question to people who are trying to figure it out for themselves rather than leveraging the mountains of theoretical and empirical work that several professional organizations have already completed. There are still various definitions for each term, but here are two that draw a distinct difference:

Assessment is the systematic process of gathering data to inform decisions (adapted from Nitko & Brookhart, 2005).

Evaluation is a judgment of quality, merit, or worth.

Obviously evaluative judgments can be used as assessment data, and assessment data can inform evaluation judgments. But the assumptions, theories, and methods are tailored to their respective end goals (judgments or data).

As these fields have developed, assessment has become a bastion of positivist assumptions and quantitative methods, while evaluation was a safe harbor for relativism and qualitative methods almost from its inception. Perhaps because of this bifurcation (and the persistent classification of qualitative methods as non-scientific), colleges and universities have offices of “assessment,” that really perform evaluation.

In common speech, there is little harm in not distinguishing between assessment and evaluation. But when we want to get our hands dirty and accomplish something with these processes, and especially if we’re elevated to the point of lecturing others on how to conduct this work, we ought to have the distinction clear in our mind.

A Few Good Scientists

Lessons Opinion Teaching
 Posted by jeremy on March 29th, 2011

This week’s topic in Understanding Educational Research was ethics. I always enjoy such topics that, despite being infinitely open-ended, have been reduced to a set of guidelines that are – at their origin – completely arbitrary. As I muddled through readings on traditional, contemporary, Kantian, Foucautian, moral, methodological, and situational ethics, I envisioned a young, well coiffed Tom Cruise screaming a post-modern critique at Stanley Milgram (played by Jack Nicholson ).

Cruise: I want the (socially-constructed) REALITY!

Nicholson: You can’t handle reality!

Son (or Daughter), we live in a world of science. And that science must be carried out by people with data.

Who’s gonna do it? You?

We, scientists, have a greater responsibility than you can possible fathom.

You weep for hurt participants and you curse quantitative methods because you have that luxury.

You have the luxury of not knowing what I know: That harming those participants, while unfortunate, yielded data.

And our analyses, while demeaning and tautological to you, brings good from those data.

You don’t want reality.

Because deep down in places you don’t talk about in your autoethnographies, you want our data.

You need our data.

We use words like rigor, significance, validity… we use these words as the pathway to scientific progress.

You use ‘em as a punchline.

I have neither the time nor the inclination to explain our methods to someone who is fed, clothed, transported, and even kept alive by the very science we carry out and then questions the manner in which we carry it out!

I’d rather you just said thank you and wrote your critical incident papers.

Otherwise, I suggest you gather your own data and do some real research!

Either way, I don’t give a damn how you think science should be done.

We performed this as a readers’ theater in my grad class with interesting results. Eighty percent of the my students in that section were science education grads, and, while they would agree with Nicholson’s lines, they understood that he was the villain in A Few Good Men.

How some standards are messed up

Lessons Opinion
 Posted by jeremy on May 5th, 2010

People in assessment are often at the pointy-end of standards. They have to translate national, regional, and even institutional standards into measurable terms. The problem is that very few standards are written to be measured; rather, they embody a committee-negotiated collective set of values.

Consider the National Board of Professional Teaching Standards. Here’s one from adolescent (high school) science that doesn’t look too daunting:

Accomplished Adolescence and Young Adulthood/Science teachers employ a deliberately sequenced variety of research-driven instructional strategies and select, adapt, and create instructional resources to support active student exploration and understanding of science.

At first glance, it is obvious that this standard contains more than one outcome; it’s a comma-delimited list of outcomes. But a bigger problem soon becomes apparent: There are two separate lists that are meant to be cross tabulated. The verbs select, adapt, and create each relate to the pair of objects exploration and understanding. This multiples the number of measurable outcomes contained in the standard. I count a total of seven, but there may be more:

Accomplished Adolescence and Young Adulthood/Science teachers…

  1. employ a deliberately sequenced variety of research-driven instructional strategies
  2. select instructional resources to support active student exploration of science.
  3. select instructional resources to support active student understanding of science.
  4. adapt instructional resources to support active student exploration of science.
  5. adapt instructional resources to support active student understanding of science.
  6. create instructional resources to support active student exploration of science.
  7. create instructional resources to support active student understanding of science.

There are only 6 NBPTS for adolescent science (for comparison, there’s 12 for adolescent math), but those 6 standards breakout into 28 distinct outcomes (including the seven listed above):

Accomplished Adolescence and Young Adulthood/Science teachers…

  1. know how students learn.
  2. know their students as individuals.
  3. determine students’ understandings of science.
  4. determine students’ individual backgrounds. (What is this supposed to mean?)
  5. have a broad and current knowledge of science and science education.
  6. have in-depth knowledge of one of the subfields of science.
  7. use their subfield knowledge to set important learning goals.
  8. use their subfield knowledge to set appropriate learning goals.
  9. employ a deliberately sequenced variety of research-driven instructional strategies.
  10. select instructional resources to support active student exploration of science.
  11. select instructional resources to support active student understanding of science.
  12. adapt instructional resources to support active student exploration of science.
  13. adapt instructional resources to support active student understanding of science.
  14. create instructional resources to support active student exploration of science.
  15. create instructional resources to support active student understanding of science
  16. spark student interest in science.
  17. promote active learning so all students achieve meaningful growth toward learning goals.
  18. promote active learning so all students achieve demonstrable growth toward learning goals.
  19. promote sustained learning so all students achieve meaningful growth toward learning goals.
  20. promote sustained learning so all students achieve demonstrable growth toward learning goals.
  21. create safe learning environments that foster high expectations for each student’s successful science learning.
  22. create safe learning environments in which students experience and incorporate the values inherent in the practice of science.
  23. create supportive learning environments that foster high expectations for each student’s successful science learning.
  24. create supportive learning environments in which students experience and incorporate the values inherent in the practice of science.
  25. create stimulating learning environments that foster high expectations for each student’s successful science learning.
  26. create stimulating learning environments in which students experience and incorporate the values inherent in the practice of science.
  27. ensure that all students succeed in the study of science (including those from groups that have historically not been encouraged to enter the world of science and that experience ongoing barriers).
  28. ensure that all students understand the importance and relevance of science (including those from groups that have historically not been encouraged to enter the world of science and that experience ongoing barriers).

Suddenly the task of gathering evidence that an individual has or has not met this standard is enormous.

Ten Steps to Great Rubrics

Lessons Teaching
 Posted by jeremy on May 5th, 2010

I sat down this morning to write out some thoughts on a rubric I developed for a course this semester. The more I wrote, the more I rambled. I’ve concluded that each of these points needs to be elaborated individually, but for now, here’s a brain-dump.



Last semester I taught on very short notice a course entitled “Understanding Educational Research.” It’s essentially a thesis prep class, but because different advisors have different concepts of the thesis, I chose to play it safe by basing the coursework on a topical review of literature that may or may not lead into the student’s thesis. Because this lit review would be a major assignment (50% of the final grade), I knew I needed a solid rubric and I set out deliberately to develop one through the following steps.

1. Start with good evidence and theory

I believe the best rubrics embody the best thinking in their respective field. Unless you are the leader in your field, this means you need to go out and see what other are saying. Find something someone else has done, whether empirical or theoretical, and build your rubric around it. Generally speaking, I’m a fan of both versions of Bloom’s Taxonomy and the lesser-known Krathwohl’s Taxonomy.

Specific to the topic of master’s thesis lit reviews, I found an unpublished article by two friends to be hugely helpful. These friends adapted a rubric from Doing a Literature Review: Releasing the Social Science Imagination (Hart, 1999), and then used it to evaluate 30 theses. Their article was the first assigned reading for my course, and my students spent much more time discussion the rubric than they did going over their evaluation results.

2. Involve other people

Whether you talk to colleagues, your students, or (preferably) both, get someone else to look over the rubric early and often. In my case, the students worked in groups to determine which of Hart’s criteria were applicable to our class assignment, and then collaboratively crafted draft rubrics during next three class sessions. I served primarily as a sieve, sorting out the contributions of each group and keeping the standards adequately elevated. Which is a nice segue into…

3. Aim high

If your rubric doesn’t describe the heights to which you believe your students may soar, you can only blame yourself when their work disappoints you. Few students will ever do more than they’re told. And why should they? It is not their responsibility to guess what extra work will get them a higher grade. It is imperative that your rubrics include what you know they can do, even if they don’t know it yet.

I had to employ a little subterfuge to raise my students’ expectations. OK, I flat-out lied to them. The article (described above) I assigned at the beginning of class wasn’t really a pre-pub version of some friends’ article; it was Scholars Before Researchers: On the Centrality of the Dissertation Literature Review in Research Preparation (Boote & Beile, 2005). I had removed every mention of “doctoral” and “dissertation” and replaced them with “master’s” and “thesis.” So when my masters students were contemplating which criteria they would meet for their lit reviews, they were working from suggestions for doctoral students. I changed the names in the reference so they wouldn’t find the original article and catch me in my ruse.

It worked perfectly. It wasn’t until after all the lit reviews were submitted that I revealed the intrigue to my students. Yes, I saw a metaphorical dagger or two being flung my direction, but I haven’t fielded a single formal complaint. And, I believe, their work was much better when they held themselves to such a high standard.

4. Avoid subjective terms and judgments of quantity

Most rubrics fail to achieve greatness in part because they rely on overly subjective judgments. Terms like rarely, some, clearly, and (my personal favorite) nearly always are often used to distinguish between levels of performance. But these terms leave so much latitude to the rater that nearly every result is debatable. Other rubrics avoid this pitfall by quantifying degrees of frequency (e.g. “Students correctly cite their sources 70%-89% of the time”). This practice only conveys the impression of objectivity because the criteria are typically not actually measured. Neither using subjective terms, no pseudoquantification is advisable.

This was an issue with many of Hart’s original performance levels for lit reviews. Consider the following levels for one of his criteria (emph. added):

Criterion 1 2 3
Placed the research in the historical context of the field. History of topic not discussed. Some mention of history of topic. Critically examined history of topic.

Notice the subjective terms in the top two performance levels. The difference between no discussion, some discussion, and critically discussed is endlessly discussable. But, this is what we typically see on good rubrics. What other options do we have?

Rather than vary the degree to which a student has performed the same verb, we can find different verbs that describe more acceptable performance. In my case I grabbed verbs from Bloom’s original taxonomy. Here’s is the row from our rubric that corresponds to Hart’s row above:

Criterion 1 2 3 4
Placed the research in the historical context of the field. Mentions the history of the topic, but does not describe it. Describes the topic’s history in isolation from external influences. Frames the history of the topic in relevant social, scientific, and educational events/attitudes. Compares the target topic’s history with histories of related topics.

5. Purposefully weight each criterion

Many rubrics assign the same value to each criterion. While it is possible that they all be equally important, I believe that most of the time this phenomenon is the result of laziness on our part. We don’t want to think about how much “grammar and spelling” should be worth compared to “addressing the topic.” How we determine the weight assigned to each criterion (importance? difficulty? frequency?) is for another blog post.

The students’ input was invaluable for this issue on our rubric. They conveyed sincerity in their arguments for why one row should be more than another, and the final rubric – by which their work was judged – represents their collective opinions. The weights ranged from 6% for defining key term to 20% for summarizing the methods researchers have used to explore the topic.

6. Use non-linear performance levels

I would say 95% of the rubrics I have seen attempt to fit their performance levels to an equal-interval scale. That is, they put the same distance between each level. For example, Hart’s rubric (shown above) used a 1-2-3 scale. But what if the space – perhaps measured by effort – between the second and third levels isn’t the same as the space between the first and second? Rather than blindly following this convention, great rubrics may deliberately space out their performance levels unequally.

For our lit review rubric, we chose 70-80-85-100 for two reasons: First, in our opinion, an A-level paper needed to meet the highest criteria. A 70-80-90-100 distribution would have allowed someone to claim an “A-” without ever performing at the highest level. Second, the effort required to move from the second to third levels was consistently less than that required to move from the third to fourth level.

7. Either include zero as a performance level, or do not describe no-performance

One of my biggest pet peeves are rubrics that assign a value of 1 to the lowest level of performance and contain a description of null performance at that level. Looking at Hart’s rubric above, notice that a student who doesn’t do anything under that criterion still receives a 1-out-of-3 score. This would allow students to claim A-level credit when they neglected a criterion that had been important enough to include on the rubric. Taken to the extreme, a blank paper would earn 33% credit.

A better way would be to choose between 1) including a null description with a zero-credit performance level, or 2) letting your lowest performance level be greater than zero, but describe some minimal performance at that level.

For our lit review rubric, we chose the latter. The value of the minimum performance level is 70%, but it is possible that the student will not even accomplish that level. There is a note at the bottom of the rubric which states that students will receive a score lower than 70% if they fail to fulfill those minimum requirements.

8. Check for understanding (both before and after the assignment)

When a rubric is handed out at the same time the task is assigned, it makes sense to check that the students actually understand what is being asked of them. If students’ literature reviews were scored with Hart’s rubric, the students would need to know what is meant by “critically examines the history of the topic.” Additionally, as the assignments are scored, the rater should watch for common misunderstandings so they can be cleared up the next time the rubric is used.

Because the students helped develop our lit review rubric, I assumed they had a good understanding of what was expected. I was wrong. For example, given that these are masters students in a department of education and that they are all current, former, or future educators, there appeared to be confusion surrounding the term method. Some interpreted it, as I had intended, to imply research methods, but others took it to mean teaching methods. I will be clarifying this distinction on future versions of the rubric.

9. Analyze the results

No assessment tool works well the first time it is used. Commercial tests go through rounds of pilot testing before they are released. State tests… usually need more, but let’s hold ourselves to a high standard. Rubric-derived scores need to be tabulated for each criterion and each level of performance, and then the resulting patterns should be evaluated for their appropriateness. Was there one criterion on which many students scored very low? How can we fix that next time? Do we need to adjust the rubric or the instruction?

If you are concerned that the results may depend on who scored the assignment, you should have multiple raters independently score the same students’ work. This check for inter-rater reliability will tell you if more work needs to be done on the rubric, or perhaps on training scorers to use it.

In most cases an internal consistency reliability analysis (Cronbach’s alpha, KR-20, etc.) is not appropriate for rubric results. Internal consistency checks that your high-scorers aren’t losing points on the easy sections, and that your low-scorers are not getting high marks on the difficult sections. The criteria on a rubric are chosen because they represent various ways in which the quality of the student’s work may vary. We would expect criterion scores to be relatively independent, a trait which internal consistency would mistake as a lack of reliability.

For our lit review rubric, I shaded each cell according to how many students ended up in each performance level. The shading (see below) revealed some welcome and disconcerting patterns. First, average scores for each criterion ranged between 81% and 92%, with an average total score of 83%. This is an appropriate result for such a difficult assignment. Second, very few students achieved the highest performance level for two criteria, which I believe was due to an interaction between the criteria and the specific topics students chose for their lit reviews. Third, students did not do well on the rhetoric criterion, which included organization, grammar/spelling, and APA style.

10. Revise as necessary

The problems uncovered by the analysis (if you didn’t find any problems, look again), can be classified into two gross categories: Problems with the rubric, and problems with the instruction. If the issues concern clarity or applicability of the criteria, then revise them. On the other hand, if students (or a certain set of students) scored lower than you expected, it might not be a problem with the rubric, but with the students’ preparedness. It would be a shame to revise or scrap a functioning rubric just because it didn’t give the results we wanted. Instead, consider altering the instructional activities, and then reusing the rubric to track any changes in student performance.

On my shaded copy of our lit review rubric, I highlighted the problematic cells and inserted endnotes describing the problems and possible solutions. I am not teaching this course next semester, so I need to record my concerns right away. With these notes, I can pick up and revise the rubric the next time I use it.


Whether having a great rubric is worth all this work is a valid question. But a wonderful aspect of rubrics is that they can be reused each semester (or even multiple times within a semester) without impacting the students’ results. So you can put in a little time now, then a little time next year, and develop the rubric in baby steps. So then, it’s not a question of whether it’s worth the time, but how much time is it worth.

The Old/New Threat to Open Source Adoption

Lessons Opinion Technology
 Posted by jeremy on April 15th, 2010

I have been satisfied with the level of acceptance open source products have earned during the last decade. But I fear that its progress may be impeded by the continued confusion between “free” and “open.” This is not a new argument, but the premises have shifted slightly. Basically, open source advocates must constantly remind the public that the YouTube, Twitter, Facebook, etc. are NOT open source.

I sat in a meeting last week listening to faculty and staff debate the future of our campus’ learning management system. Our provider has been purchased by Blackboard, which gives us four years to either migrate to the Beast, or find another way. If you know me, you know what my opinion is.

I’ve been pushing Jon Mott’s model: Let’s focus on the core functionality of what the institution must do, while supporting faculty and students’ use of non-campus (“cloud”) tools. This maintains the security over student data and restricted material without sacrificing the advantage of the emerging technologies. As we were discussing how the college IT staff could support such a model, one person spoke up:

Several faculty have started using open source sites, and they’re starting to realize that when the system goes away, they’ll lose their work.

Anyone in open source, or even with a cursory knowledge of its pros and cons, would have done a double take. I think my brow furrowed into the ceiling panels. They continued.

When sites like YouTube make changes, or just close-up shop, a lot of the work you’ve put into building your content is just gone.

I usually hold my tongue in meetings, especially when deans or higher-ups are in attendance, but this was one misconception I dearly wanted to nip in the bud. Before the moderator had a chance to call on anyone, I corrected:

Just a correction, if I may? YouTube is NOT open source. It’s free to use, but the very nature of open source would actually mitigate the risks you bring up.

I spoke with the person after the meeting to give them a quick rundown on what open source is, is not, its advantages and disadvantages, and why it was so important that I correct them during the meeting.

Today I received my subcommittee’s report on the issues we’ve researched. My view is well represented in the report, but our subcommittee chair made a similar mistake. She equated cloud computing with the use of open source tools. I sent a polite correction, which was applied to the report.

Most of my involvement in open source has assumed that when the products were good enough, they would break into the mainstream. I felt that the movement had succeeded when campus IT guys installed Firefox on the lab computers. But it appears I missed the next glass ceiling: Overcoming mid-level decision-makers’ misconceptions.

More on reliability

 Posted by jeremy on November 16th, 2009

My last post questioned the analogy that calls a broken clock reliable because it is consistent. I claimed that reliability is consistency vis-à-vis the variable being measured, so a clock that doesn’t change as the time changes is not reliable. Additionally, I noted that a broken clock is even inconsistent in the degree of inaccuracy it displays: Twice a day the clock will have no error, but at all other times it will have varying degrees of error in its measure of time.

I’d like to continue with another analogy to further explore this construct of reliability.

Jon used a broken scale in his illustration of reliability:

If a person weighs 200 pounds and the bathroom scale says they weigh 200 pounds, then the measurement is valid. However, if the scale indicates different weights each time a person steps on it (even though their weight hasn’t changed), the measurement isn’t reliable. On the other hand, if the scale consistently indicates that a 200 pound person weighs only 150 pounds every time they weigh themselves, the measure is reliable (consistent) but not valid (accurate).

While the scale analogy is just as flawed as the clock analogy (a functioning scale would have to be classified as unreliable), extrapolating from it will reveal some limitations in the concept of reliability. Clocks measure a unique variable in that time is circular. Every twelve hours we come back to the same time, so even a broken clock is right twice a day. Weight, on the other hand, is a linear variable, and there is no instrument that produces reliable measures of weight across the range of possible weight values.

Don’t believe me? Drive your car across your bathroom scale and see what happens; or go weigh yourself on a truck scale.

Every good instrument has a range of values in which it reliably measures the target variable. Outside that range the results may be extremely consistent, but they are not consistent reflections of the target variable. For example, your bathroom scale will read anything over its maximum reading as equal to that maximum value. Though I may weight 472 pounds, your scale tells me I’m only 300.

The same is true of any assessment or psychological scale: Though the test manual may declare that past results had a .93 alpha coefficient or that trained raters agreed 87% of the time, these figures represent averages across all the scores that were recorded. It may be that the raters agreed mostly on the scores that were well above the cutoff score, or that the lowest error occurred for students with very low trait levels. Coefficient alpha, in particular, should be interpreted as a maximum level of score reliability.

Estimates of reliability, which are the product of group scores, are thus less applicable to an individual student’s score.

But given an adequate sample size we could estimate the a scale’s reliability across the levels of the target variable. Item response theory has a function for that: It’s called the information curve.

information curve

This particular information curve is from the results of my department’s dispositions assessment. Teacher candidates self-rate on important professional attitudes, and we use the anonymous results for program improvement. What’s useful in this information curve is that the most consistent scores are for students slightly below average (zero on the x-axis). Since we can do something about these low scores, this curve is encouraging. Were the most informative scores to come from students with high levels of professional dispositions, the scores would be less useful for advising programmatic improvements.

No, a broken clock is not reliable

 Posted by jeremy on November 14th, 2009

There’s a very common – and very inaccurate – analogy about reliability that spreads like a cancer. Though it is easy to understand, the metaphor conveys several misconceptions about reliability, and it’s worth debunking whenever you hear or read it.

First, a little background. The terms reliability and validity are old and used in several contexts, each of which adds distinct nuances. Most scholars learn the concepts in regards to research methods (sorry philosophers), and it is from that field that this horrible analogy originates.

A broken clock is reliable because reliability is consistency of measure. A broken clock always gives you the same reading; it’s very consistent. But a broken clock would only be valid twice a day: when the actual time happens to be the time displayed on the broken clock.

I won’t go into validity today, I think I’ve beat that drum to death, but I will use this flawed analogy to demonstrate why reliability can only refer to a specific type of consistency. My one-line response to this analogy is…

So a working and calibrated clock wouldn’t be reliable because it’s inconsistent? I mean, every time I look at it, it gives me a different reading.

The consistency in reliability refers to the consistency in measuring a variable. In context of the analogy, the variable is time and a clock that doesn’t change as the time changes does not consistently measure the target variable. Twice a day, when the clock accurately reflects the actual time, the measure would be accurate (has no error), but five minutes later it’s less accurate (has some error), and six hours later it would be completely inaccurate (have extreme error). Because of this inconsistency in regards to its measure of time, a broken clock cannot be deemed reliable.

The case of the broken clock also reveals an important, but often overlooked detail about reliability: It’s not the instrument (the clock) that is more-or-less reliable, but the results/readings/information that the instrument produces. It would be more appropriate to say that the time displayed on a broken clock is not reliable.

If anyone asks, “but what about test information in item response theory?” that’s going to be my next post. Let’s see if I can paint myself out of that corner.

Freshman Advice: What do you call the person who teaches your college courses?

 Posted by jeremy on September 3rd, 2009

You probably called your high school teachers “Mr. Smith,” “Mrs. Smith,” or “Ms. Smith.” That’s true even if they had higher titles: My biology teacher had a PhD in entomology – he was really “Dr. Cope” – but we still called him “Mr. Cope.” Just about everything is more complicated in college and teacher’s titles are no exception.

When someone earns a doctoral degree (PhD, ED, MD, Do, Vet, etc.), it is appropriate to call them “Doctor” (e.g. “Dr. Smith”) in professional and academic settings. There are some exceptions (we no longer call lawyers, who hold JD degrees, “Doctors”), but in academics, it’s always best to use that title.

You can also use the title “Professor” (“Professor Smith”), if that’s the teacher’s professional rank. They may be a Visiting Professor, Assistant Professor, Associate Professor, Full Professor, or Distinguished Professor. The only exception would be for Associate Faculty, also called Lecturers or Adjuncts who are not generally considered “Professors.”

The bottom line is this: I’ve never known anyone to get offended when someone called them “Doctor” and they haven’t earned a doctorate, or someone calls them “Professor,” and they’re not a professor. They would usually offer a polite correction. On the other hand, after working a long time to earn a doctoral degree and competing for a professorship, some would get offended if a student called them “Mr. (or Ms. or Mrs.) Smith.”

PS – The use of these titles is different in other countries.