Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50, 741-749.
p. 741
The traditional conception of validity divides it into three separate and substitutable types—namely, content, criterion, and construct validities. This view is fragmented and incomplete, especially because it fails to take into account both evidence of the value implications of score meaning as a basis for action and the social consequences of score use.
[I don't think he had self-reporting scale, like self-efficacy in mind with this. It's obviously bent towards "testing," but classical and modern. We should look at what's been published since 1995 on this vis-Ã -vis self-efficacy.]
unified validity integrates considerations of content, criteria, and consequences into a construct framework for the empirical testing of rational hypotheses about score meaning and theoretically relevant relationships, including those of an applied and a scientific nature.
[La ilaha illa allah. There is no validity but construct validty.]
[But construct validity has its aspects.]
These are content, substantive, structural, generalizability, external, and consequential aspects of construct validity.
Validity is an overall evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of interpretations and actions on the basis of test scores or other modes of assessment (Messick, 1989b). [...] not only of the items or stimulus conditions, but also of the persons responding as well as the context of the assessment.
validity is an evolving property and validation a continuing process.
[But my dissertation must be defended by February!]
p. 742
As a salient social value, validity assumes both a scientific and a political role that can by no means be fulfilled by a simple correlation coefficient between test scores and a purported criterion (i.e., classical criterion-related validity) or by expert judgments that test content is relevant to the proposed test use (i.e., traditional content validity).
[Like SDR, we cannot avoid politics. Let's deal with it instead.]
Therefore, it is fundamental that score validation is an empirical evaluation of the meaning and consequences of measurement.
[In strict operationalism,]
the measure is viewed as just one of an extensible set of indicators of the construct. Convergent empirical relationships reflecting communality among such indicators are taken to imply the operation of the construct to the degree that discriminant evidence discounts the intrusion of alternative constructs as plausible rival hypotheses.
—Sources of Invalidity—
construct underrepresentation
[If we cover the NETS-T, is that a priori representation?]
construct-irrelevant variance:
++construct-irrelevant difficulty for individuals and groups is a major source of bias in test scoring and interpretation and of unfairness in test use.
++construct-irrelevant easiness occurs when extraneous clues in item or task formats permit some individuals to respond correctly or appropriately in ways irrelevant to the construct being assessed.
p. 743
But both communication skill and mathematical knowledge are considered relevant parts of the higher-order construct of mathematical power, according to the content standards delineated by the National Council of Teachers of Mathematics (1989).
[Ah, so we can use the NETS-T to dictate what is relevant. Also, given that teachers in most states must pass a basic English test, can we assume that reading is a relevant trait as well?]
—Sources of Evidence in Construct Validity—
explanatory concepts that account for both test performance and score relationships with other variables
Almost any kind of information about a test can contribute to an understanding of score meaning, but the contribution becomes stronger if the degree of fit of the information with the theoretical rationale underlying score interpretation is explicitly evaluated (Cronbach, 1988; Kane, 1992; Messick, 1989b).
Probably even more illuminating in regard to score meaning are studies of expected performance differences over time, across groups and settings, and in response to experimental treatments and manipulations.
[This is what we're doing with the pre/post tests.]
Possibly most illuminating of all, however, are direct probes and modeling of the processes
underlying test responses, which are becoming both more accessible and more powerful with continuing developments in cognitive psychology (Frederiksen, Mislevy, & Bejar, 1993; Snow & Lohman, 1989).
[Again, I do not think this would be that useful on a self-report format.]
construct validity, as previously indicated, also subsumes content relevance and representativeness as well as criterion-relatedness.
p. 744
In other words, empirical relationships between predictor scores and criterion measures should make theoretical sense in terms of what the predictor test is interpreted to measure and what the criterion is presumed to embody (Gulliksen, 1950).
[This is the crux of the assignment score-gain score correlation.]
And to appraise how well a test does its job, one must inquire whether the potential and actual social consequences of test interpretation and use are not only supportive of the intended testing purposes, but also at the same time consistent with other social values.
[We must consider the consequences to the program and instructors, but they are not the ones taking the test.]
In the language of the Cronbach and Meehl (1955) seminal manifesto on construct validity, the intended consequences of the testing are strands in the construct’s nomological network representing presumed action implications of score meaning.
the general construct validity evidence may need to be buttressed in applied instances by specific evidence of relevance and utility.
[We have gathered a little evidence of this already. Perhaps an interview with a dean, another professor at another school?]
**Good opening Lines***
From the discussion thus far, it should also be clear that test validity cannot rely on any one of the supplementary forms of evidence just discussed. However, neither does validity require any one form, granted that there is defensible convergent and discriminant evidence supporting score meaning.
What is required is a compelling argument that the available evidence justifies the test interpretation and use,
*******
[This is also free license to ignore any of the aspects of construct validity.]
—Aspects of Construct Validity—
p. 745
[Each section of the lit review should begin with the corresponding description.]
+The content aspect of construct validity includes evidence of content relevance, representativeness, and technical quality (Lennon, 1956; Messick, 1989b);
[The NETS-T + item/reliability/factor analyses + relevancy scale]
+The substantive aspect refers to theoretical rationales for the observed consistencies in test responses, including process models of task performance (Embretson, 1983), along with empirical evidence that the theoretical processes are actually engaged by respondents in the assessment tasks;
[This is a bit more difficult. Does CFA via SEM work? A more formal think-aloud? We might have to argue that this is a testing aspect, not a self-report aspect.]
+The structural aspect appraises the fidelity of the scoring structure to the structure of the construct domain at issue (Loevinger, 1957; Messick 1989b);
[CFA goes here? Concurrent vs. convergent? Comparing gain and assignment scores? Or is that under external?]
+The generalizability aspect examines the extent to which score properties and interpretations generalize to and across population groups, settings, and tasks (Cook & Campbell, 1979; Shulman, 1970), including validity generalization of test criterion relationships (Hunter, Schmidt, & Jackson, 1982);
[Compare majors? El vs. 2nd vs. Special Ed?]
+The external aspect includes convergent and discriminant evidence from multitrait-multimethod
comparisons (Campbell & Fiske, 1959), as well as evidence of criterion relevance and applied utility (Cronbach & Gleser, 1965);
[Is this where the gain vs. assignment scores goes?]
+The consequential aspect appraises the value implications of score interpretation as a basis for action as well as the actual and potential consequences of test use, especially in regard to sources of invalidity related to issues of bias, fairness,
[We'll explain why we're not addressing this.]
—Content Aspect—
especially [useful is] domain theory, in other words, scientific inquiry into the nature of the domain processes and the ways in which they combine to produce effects or outcomes.
[Does the NETS-T have an empirical basis?]
it is not sufficient merely to select tasks that are relevant to the construct domain. In addition, the assessment should assemble tasks that are representative of the domain in some sense.
[The NETS-T does do this.]
Functional importance can be considered in terms of what people actually do in the performance domain
[But what about what we would like them them to do? Current practice may not be the target.]
—Substantitive Aspect—
Two important points are involved: One is the need for tasks providing appropriate sampling of domain processes in addition to traditional coverage of domain content; the other is the need to move beyond traditional professional judgment of content to accrue empirical evidence that the ostensibly sampled processes are actually engaged by respondents in task performance.
[Again, we rely on the NETS-T for this. Are they valid?]
the need for empirical evidence of response consistencies or performance regularities reflective of domain processes (Loevinger, 1957).
[CFA?]
Such evidence may derive from a variety of sources, for example, from “think aloud” protocols or eye movement records during task performance; from correlation patterns among part scores; from consistencies in response times for task segments; or from mathematical or computer modeling of task processes (Messick, 1989b, pp. 53-55; Snow & Lohman, 1989).
[CFA? but one would think that "Structural Equation Modeling" would go under "structural aspect".]
—Structural Aspect—
scoring models should be rationally consistent with what is known about the structural relations inherent in behavioral manifestations of the construct in question (Loevinger, 1957; Peak, 1953)
p. 746
also the rational development of construct-based scoring criteria and rubrics.
[Averages, factor scores, IRT]
the internal structure of the assessment (i.e., interrelations among the scored aspects of task and subtask performance) should be consistent with what is known about the internal structure of the construct domain (Messick, 1989b). [=] structural fidelity (Loevinger, 1957).
[Perhaps we need to move from the NETS-T and compare the results with known self-efficacy expectancies.]
—Generalizability Aspect—
Evidence of such generalizability depends on the degree of correlation of the assessed tasks with other tasks representing the construct or aspects of the construct.
[This isn't "external"?]
there is a conflict in performance assessment between time-intensive depth of examination and the breadth of domain coverage needed for generalizability of construct interpretation. This conflict between depth and breadth of coverage is often viewed as entailing a trade-off between validity and reliability (or generalizability).
[It's easier to score an instrument written to a specific task, but that's less generalizable. Have we succeeded in finding a happy medium?]
—External Aspect—
the extent to which the assessment scores’ relationships with other measures and nonassessment behaviors reflect the expected high, low, and interactive relations implicit in the theory of the construct being assessed.
[This is the crux of our study.]
That is, the constructs represented in the assessment should rationally account for the external pattern of correlations.
[A regression analysis might be necessary to remove the good-student effect from the domain performance. Still, it could be that assignments are not increasing self-efficacy and the TICS is measuring them correctly. How do we know?]
Discriminant evidence is particularly critical for discounting plausible rival alternatives to the focal construct interpretation.
[Perhaps a survey of some self-esteem trait other than self-efficacy? Or just a general self-efficacy measure?]
—Consequencial Aspect—
The primary measurement concern with respect to adverse consequences is that any negative impact on individuals or groups should not derive from any source of test invalidity,
[If the test is used to monitor the program, and the students will neither see their scores, nor affected by them does this apply?]
—Valitiy as Integrative Summary—
[an evaluation]
p. 747
This relation is embodied in theoretical rationales or persuasive arguments that the obtained evidence both supports the preferred inferences and undercuts plausible rival inferences. From this perspective, as Cronbach (1988) concluded, validation is evaluation argument.
The six aspects of construct validity afford a means of checking that the theoretical rationale or persuasive argument linking the evidence to the inferences drawn touches the important bases; if the bases are not covered, an argument that such omissions are defensible must be provided.
[So we can ignore the consequencial aspect so long as we justify its omission.]
The challenge in test validation is to link these inferences to convergent evidence supporting them and to discriminant evidence discounting plausible rival inferences.
p. 748
[Figure 1 is interesting, but of no import if we're no concerned with the consequencial aspect]
the evidence and rationales supporting the trustworthiness of score meaning are what is meant by construct validity,
Counterproposals to a proposed test use might involve quite different assessment techniques, such as observations or portfolios when educational performance standards are at issue.
[The high stakes assessments use these, but they are time and cost consuming. Our instrument is a low stakes, low cost, medium fidelity measure.]