Archive for the ‘Projects’ Category

Prospectus Defense – Passed?
Sunday, January 14th, 2007

It’s Sunday. I defended my prospectus, essentially the first three chapters of my dissertation, on Thursday. I still haven’t decided whether I passed.

Technically, I passed because the committee agreed they did not have to meet again. They told me everything they needed to see was in the 84-page document I sent to them a week earlier, but it needed to be revised.

Their suggested revisions constitute a re-visioning of my project. I will no longer be doing what I had planned, and the new project will have far less impact.

My original proposal was to investigate the validity of the inferences drawn from the scores of a test I created and will be presenting at a conference in March. Validity is a huge question, one that frequently takes years to investigate, so I had limited my working definition of the domain to a single theorist’s seminal article on the topic. This limitation (a) turned my otherwise mundane study into an examination of this theorist’s ideas and (b) acted as a levee to hold back the tide of endless possibility that tends to delay graduation.

With one sentence, a highly respected member of my committee demolished that dam.

Instead of limiting the scope of my project through the definition of validity, the committee decided to limit the types of inferences I was trying to validate. Before, I had planned to gather all types of validity evidence (described by this particular theorist), and then look back and deduce the types of inferences the evidence supported (or didn’t). Now, I will be specifying the inferences I intend to draw from the test scores, and then only gathering the evidence that supports (or refutes) the validity of those inferences.

In short, rather than conduct an investigation that results in both an argument of validity and informs measurement theory, I am simply conducting a validation exercise.

In another meeting with a peer and one of my committee members, my peer asked me how the defense went, and I told her. Because she understood what I had planned to do, and my motivations for doing it, she was shocked and stated, “That totally changes your project.”

The committee member looked me in the eye and commented, “You sound disappointed.”

It is apparent to me that my committee did not understand the study I intended to do because they did not understand my motivations for doing it. They read my proposal, assumed I wanted to conduct a simple validation study, and are now directing me to do that.

Additionally, having now been back through my document, I dispute the assumption that everything I need “is in the prospectus.” My research had cut across the field horizontally, gathering all types of evidence described by a single theorist. The new design cuts through the field vertically, gathering specified types of evidence described by any theorist. The only work that I can recycle from the original design is the small intersection of these two slices.

I’ve been told Chapter One (Introduction) must be rewritten. I can assume Chapter Two (Review of Literature) will also have to be rewritten to expand the definition of validity. Chapter Three (Methods) might remain the same, but be trimmed as surplus evidence-gathering methods are removed.

So, here it is, three days later, and I still don’t know if I really passed the defense, or if the whole thing is an epic misunderstanding.

Notes on Kane 2002
Monday, January 1st, 2007

Kane, 2002

Inferences About Variance Components and Reliability-Generalizability Coefficients in the Absence of Random Sampling

P. 165

“G studies typically employ random-effects ANOVA models to estimate the variance components for various conditions of observation that are likely to have an impact on subsequent D studies.”

P. 166

“The estimation of variance components in G studies, using random-effects ANOVA, is the fundamental estimation issue in G theory.”

[What if we used occasions as the object of measure, and persons as a facet (a source of variance)? That would put us in the same predicament because then the persons facet would account for a lot of variance.]

The object of measure, and the facets are generally assumed to be random. That is, they could have been anybody, any items, or any occasion.

“The random sampling assumptions are problematic for both persons and conditions of observation, but the sampling assumptions are especially implausible for some conditions of observation (e.g., items, occasions).”

P. 167

[The TICS is only as valid as the sample from which the analyzed data is collected.]

“First, the methods are always applied to populations in which the units that compose the population and the units that are excluded from the population are explicitly defied.
Second, every potential sample of a given size that could be drawn from the population, together with the probability of selecting each sample, must be specifiable prior to sampling taking place . . . .
Third, every sampling unit in the population must have a probability of selection that is greater than zero; . . . Samples selected through probability sampling procedures are said to represent the population from which they were drawn. (Jaeger, 1984, pp. 28-29)

All of these conditions are typically violated in G and D studies. The universes associated with most facets are not clearly defined. The probabilities associated with the selection of conditions of a facet are generally unequal and unknown. And many conditions have zero probability of being selected.”

First, items are selected by experts, not “sampled” from a population. Second, items are distinct from objects usually classified as populations. (Loevinger, 1965)

These issues are magnified in D-studies.

P. 168

“In the absence of random sampling of conditions of a facet, strict statistical inferences about the variance components associated with the universe of generalization are not possible.”

Physics, chemistry, etc., all sample conveniently… Why not social science? Because we’re more variable.

P. 169

“The essential requirement for reasonable application of sampling theory is that the universe be defined clearly enough that one can recognize elements belonging to it. This is a requirement of operational definition. (Cronbach et al., 1972, p. 368)”

[It seems that most of his universe definition and sampling concerns will be investigated in the content aspect.]

Drawing random samples from infinite universes is impossible. [Like in language testing.]

P. 170

Two-steps of inferences
Step 1: From sample to population of possibly-sampled participants (inferential statistics takes care of this.)
Step 2: From sampled population to universe (Logical leap?)

Shavelson & Webb: It’s not that the items were chosen randomly, but they are interchangeable with any other item from the defined universe.

P. 171

“A common practice is to observe a group of persons who are conveniently available to the investigator and then to generalize to a population of persons “like these.” . . . Scientists have found it better to apply statistical inference to samples obtained haphazardly than to refuse to use information from those samples or to take the sample data as purely descriptive and relevant only to the sample in hand. (Cronbach et al., 1972, p. 367)”

[So, we frequently complete the first step, but not the second.]

We need some substantive evidence to justify the second step.

P. 172

Representative sampling: “If it can be plausibly claimed that the sample of conditions in the G study is representative of a specified universe of admissible observations, then the G-study variance components can be linked to this universe… Guion (1978) considers the representativeness of sampling in his analysis of claims for content validity.”

!!!!!
“If a serious attempt to identify selective forces that could introduce substantial bias has failed to identify any reasonable suspects, we can provisionally treat the sample as representative.” (Popper, 1968)
!!!!!

P. 173

“in those situations where probability sampling is not possible (e.g., in most G and D studies and reliability studies), the minimization of the influence of selective forces in sampling may be the best available option.”

P. 174

“The sampling problem disappears to the extent that it is reasonable to assume that the universe is homogeneous.”

[I don't think any of the NETS-T is homogeneous.]

[Too often, research in our field approaches the chaos present in more anthropological domains. Where the physicist ca predict the direction the billiard balls will travel, those predictions grow less and less accurate as time and space are increased. In psychology, simple tasks can be represented while complex tasks cannot.... But the simple tasks can be predicted, while the complex cannot...]

P. 175

“Fixed facets play a particularly important role in evaluating sampling assumptions because they are often employed in stratified sampling plans. For example, if an achievement test includes a certain number of items on each of several content areas (as specified in a test plan), the content areas are likely to be considered fixed. The items within each content area will generally be considered to be randomly sampled, but the content areas are always the same. Items are not randomly sampled from the full universe, but rather are sampled within each stratum or content area.”

[HUH? This is an issue of granularity. If a "universe" is defined as all acceptable examples, then this paragraph is moot.]

“if the sampling plan divides the total universe of items associated with the item facet into a number of fixed strata (defined by item content, format, cognitive level, etc.) and specifies the proportion of items from each cell in the plan, the items in each test will cover the different parts of the universe.”

[Since I do not consider the NETS-T to be unitary, should this be done for each?]

“The items in each category are still written by item writers rather than being sampled from a universe of items in that category, but the potential impact of selective forces in sampling within cells is smaller than the potential impact of selective forces in sampling from the full universe.”

[I'm not convinced that items being "generated" for a test can/should be constrained to the "sampling" techniques. I understand that G-theory may assume "randomness," but... hmmm. I'll have to think about this.]

Because random sampling is not possible, stratified sampling is imperative. (Lindquist, 1940)

P. 176

“Cook and Campbell (1 979) suggest “that external validity is enhanced more by a number of smaller studies with haphazard samples than by a single study with initially representative samples if the latter could be implemented” (p. 73).”

“In the context of G theory, external validity involves inferences from the G study to subsequent D studies.”

[It seems his whole point was investigating the validity of D-studies based on the data of G-studies. At the end of the day, this may be a bit deeper than I want to go on a single facet.]

Notes from Kane 1999
Monday, January 1st, 2007

The Role of Generalizability in Validity

P. 2

“Generalizability analyses play a central role” in four ways:

1. G coefficients indicate “upper bounds” on validity
2. Generalizability is part of most interpretive arguments
3. This arguments “determines the appropriate estimate of generalizability” [Cyclical?]
4. “Generalizability provides justification for the syntactical content of construct labels.” [Wouldn't dimensionality analyses be a better choice for this?]

√α = Max. correlation with criterion. You cannot more accurately represent another criterion than you do the test itself.

P. 3

“A test-score interpretation does not have multiple validities.” [Note he is talking about the "interpretation," not the test or scores. I assume he includes "respondents," "settings," and "raters" in "interpretation."]

Going from α to the upper-bound of validity is “ambiguous” because there are multiple possible measures of reliability (Chronbach, inter-rater, etc.), which represent various sources of variance. G-theory, and its G-coeffecient, take into account all included sources of variance, and, therefore, removes this ambiguity.

The more facets defined in G-theory, the higher the error estimates, the lower the G-coefficient.

Test-retest is not an issue for test that attempt to establish the “state” of the respondent regarding a trait. Low test-retest reliability would not necessarily be bad.

P. 4

!!!!!
“If the test scores and the criterion vary in more-or-less the same way as a function of occasion, this dependency could enhance the test-criterion correlation as it lowers test-retest reliability.”
!!!!!

Interpretations = Chain of inferences leading from scores to conclusions and decisions. Those inferences + their assumptions = interpretive argument, provides explicit, detailed, specification of inferences, ergo, a path to validation. A validity argument would address each link in that chain. [This is WAY far from Messick.]

“Almost all interpretive arguments require generalization as one major inference.”
P. 5

The inference of generalizability rests on the assumption of invariance of performance, items, raters, etc. G-theory tests this assumption.

P. 6

[G-coefficient needs to be adjusted for this project because we expect certain components to be large that are normally small. We could use the G if we only included a single occasion, but then we have a single-facet design and reporting α would suffice.]

P. 7

“We can think of a failure of generalizability as invalidating once or more invariance assumptions, and therefore, as invalidating the interpretive aregument.”

P. 8

The proposed interpretation governs the considered sources of variance.

P. 9

If scores are not invariant across a facet, then the inferences cannot be said to hold across the facet.

Notes from Osterlind
Friday, December 29th, 2006

Osterlind

  1. 371

    1. Unidimensionality

    2. See chapter 7

    3. Appraise all tests for their unidimensionality

      1. Requires

        1. Stats

        2. Cog Psych

    4. Findings difficult to interpret

    5. Hattie (1985) Assessing unidimensionality of tests and items, Applied Psychological Measurement 9, 139-164.

  2. 372

    1. Begin with author’s intent, and the grounding theory of the measure

    2. “My focus in this discussion is more on how the methods may be used by measurement professionals for investigation dimensionality and less on explaining the procedures themselves, beyond the obvious overlap.”

  3. 373

    1. Psychological interpretations should predominate the analysis.

    2. Interpretations should be grounded in theory

    3. Do not:

      1. Give the results more meaning than the method allows

        1. Eg. They do not reveal all structure for the test

      2. Generalize beyond what the method allows

    4. “Factoring is part of building an evaluative argument for test validity.”

  4. 374

    1. Exploratory FA to discover commonalities (an identifying malfunctioning items)

    2. Confirmatory FA to test hypothesized commonalities

      1. Use Principal axis factoring or SEM

    3. Carefully consider which one is appropriate

  5. 377

    1. Factor selection methods

      1. Cattell’s Scree Criterion = Visual (scree plot)

        1. Arbitrary

      2. Kaiser’s criterion = eigenvalues of 1.0

  6. 379

    1. Rotation (Kaiser) finds more commonalities

    2. Rotations is SOP

  7. 381

    1. Orthogonal rotations produces uncorrelated factors

    2. Use Oblique rotations is theory says factors may be correlated

  8. 382

    1. In exploring a whole test in the early stages of development, use PCA with varimax. This will indicate the most malfunctioning items.

  9. 383

    1. PCA is, however unsuited for investigating unidimensionality.

    2. If a test has subtests each with subscales, PCA may only identify the first stratum.

  10. 386

    1. Limitations of FA and PCM

      1. Matrix must be positive definite (no diagonal values can exceed 1.0)

        1. Else negative eigenvalues may result

      2. Both use maximum likelihood, which may reveal inflated communalities.

      3. Heywood condition

        1. Unique to binary data

        2. Present in extremely easy or difficult data (lack of possible covariance = high correlations)

        3. Results in non-positive definite matrices.

  11. 387

    1. Full information Item FA (IRT-based, reflects cognitive theory)

      1. Factors are identified as dimensions of the scale, not just by commonalities in variance.

      2. Range of factors is infinite, the items only test a fraction of their range

      3. Doesn’t use Pearson&rsquo
        ;s correlations

      4. Uses tetrachoric correlations (Described in next section.)

      5. Hattie (1985) justifies this approach

      6. (Must still test local independence, as Erickson, 2000)

      7. Procedurally complex

        1. Marginal maximum likelihood procedures.

        2. See TESTFACT

  12. 391

    1. Tetrachoric correlations

      1. Used when the two dichotomous variables are theoretically assumed to be “estimates of two continuous latent variables.”

    2. Indices of unidimensionality

      1. Local independence index

      2. Pattern index

      3. Ratio of difference index

Something I had never done before
Sunday, December 24th, 2006

I hesitate to admit it, but I’ve been in school since 1994, minus 3 threes years off. I’m not hesitant to say the one piece of college life (besides binge drinking) that I’ve never tasted is pulling an all nighter. Until this week.

Sure, as an undergrad I had stayed up late, but usually not to study. I always budgeted my time well, even attempting study/sleep intervals of 6 hours each to meet an insane final exam schedule. But I had never needed to stay up all night until now.

On Wednesday, my chair informed me that he had promised his wife he would not work between Christmas and New Years…. Great. I mean, that’s great for him, and I fully support him spending more time with his family (when do I get to do that?), but where did that leave me?

He told me he would read my draft before Christmas if I could get it under his door by 8AM on Friday.

So, I worked the rest of Wednesday, and most of Thursday. When we tucked the kids in Thursday night, I told Brooke I was going to stay up until I was done, set up on the kitchen table, and went to work. The task: Finish Chapter 2, adding 35 new references I had found in the last week, and write Chapter 3. I had the outline, my notes, and all my references, so it was just a matter of putting everything into a semi-coherent format.

2AM: I started feeling it. I got a bowl of left-over chowder about the time Miles woke up. Brooke beat me to his room, so I got them a bottle and went back to work.

4AM: I was pretty close. I decided to skip updating my bibliography for this draft, and started noting, but not fixing my APA heading and pagination issues (I was way to tired to play with Word… When will I learn LaTex?).

4:30AM: I finished typing and started reading. Since I had been writing each chapter individually, I needed to clean up my acronyms, remove identical paragraphs, etc. The document was 56 pages long without references or appendices.

6:00AM: My draft was complete. Now it was time to run to the lab, print it, and tuck it under his office door. Wait. The color printer in the lab is horrible, so I needed to print the four pages with color figures (including a plot of Partial Credit Model conditional probability curves for five response categories) at home.

6:20AM: After my laptop refused to network to my printer (I was in no mood to play), I tried to open the file on my desktop forgetting I run OpenOffice on that box. OO does fine with Word docs, but the formatting isn’t 100% identical, and that’s a big deal. I finally used PDFcreator to get the four pages printed.

6:40AM: I arrived on campus and parked in a faculty stall. I had 20 minutes before they would enforce parking. By this time, I could hardly feel my legs, and would shiver if I stood still for too long. That had nothing to do with it being 20 degrees outside.

6:50AM: I printed the entire document (now almost 70-pages long), swapped out the color pages, put a note on the front, binder clipped it, and tried to slide it under his door. That wasn’t going to work… The document was too thick. So, I placed my life in the faculty’s box, and left a note on his door.

7:15AM: Home. Tired. I cleaned up the kitchen, and got ready for a long winter’s nap.

7:30AM: I got in bed.

7:31AM (approx.): I was asleep.

It felt like 9AM when I rolled over and saw it was 1:30PM. I stayed in bed until 2:30. I wouldn’t recommend it, though my chair sent me a note today saying he was impressed with my work. Here’s the note I attached to the document:

————–

Dr. X,

Here it is in all its glory.

It’s missing the bibliography.

There are some issues with the APA headings (I use the all caps level 5, but not the lowest level).

The pagination is not right (I shouldn’t have a “1″ on the title page).

I did move the section you suggested from the lit review to the introduction.

I’m most interested in how well this fits what you want. Give it a read. Mark this copy up as much as you want.

Let me know.

It’s 6:27am, and I haven’t slept.

–jeremy

PS – Does this letter look like some whacked out haiku to you too?

Don’t you just hate it when…
Saturday, December 16th, 2006

The library’s federated periodical search page is down (how did we ever live without it?), so I started googling for articles using think-aloud protocols, Samuel Messick’s (1995) recommended method of gathering evidence for the substantive aspect of construct validity, specifically pertaining to self-efficacy instruments. Here’s what I found:

The top two results are from my blog, simply telling me what I already know.

This isn’t as bad as when I was trying to learn Linux, and would Google for answers only to find my initial questions, which I had posted to Internet bulletin boards days earlier.

So, I guess Picasso was wrong when he said, “Computers are useless. They can only give you answers.”

How accurate do you have to be?
Wednesday, November 8th, 2006

I just lost a point on an exam for being 0.00094 off the correct answer.

However, the professor had a good point about why I was that far off, and I will never forget not to round my intermediate values.

Some notes on Messick’s contrsuct validity
Saturday, September 9th, 2006

“In essence, construct validity comprises the evidence and rationales supporting the trustworthiness of score interpretation in terms of explanatory concepts that account for both test performance and score relationships with other variables”(Messick, 1995).

Messick’s construct validity is a much larger umbrella than even the traditional idea of validity. Evidence from criterion measures, expert rating in item relevancy and other appearance traits, evidence supporting theoretical relationship, and arguments for the appropriateness of the interpretations of the test results, all contribute to construct validity. More useful than these tradition sources of evidence are cognitive exercises to tie the content and results of the instrument to the construct it claims to measure. Such exercises include think aloud protocols to discover how participants are processing each item, longitudinal and treatment-control studies to assure that the measure is sensitive to perceived changes in the measured trait, and the traditional criterion-oriented validity activities (Messick, 1995).

In each of his aspects, Messick re-evaluates the established validity theory with an emphasis on what its evidence “can contribute to an understanding of score meaning” (Messick 1995). Thus, the questions move beyond questions of correlation between one measure and another, to include issues of construct representativeness and construct-(ir)relevant variance. While Messick introduce no new sources of validity evidence, he reorganizes the whole of validity theory.

–Content Aspect–

“The content aspect of construct validity includes evidence of content relevance, representativeness, and technical quality (Lennon, 1956; Messick, 1989b)” (Messick, 1995).

The content aspect involves defining the “boundary of the construct domain to be assessed,” and then assessing the how well the items represent that domain (coverage) and not some other construct. A domain theory-based analysis of the construct fulfills the first task, and expert judgments of the coverage and relevance of the items, a method usually assigned to face validity, accomplishes the second.

[NEED another DOMAIN THEORY ANALYSIS EXAMPLE.]

Often processes used to gathered evidence that would fit under this aspect ignore domain analysis and gather both types of evidence from expert raters. In developing a scale teacher self-efficacy in teaching tobacco use prevention, Perry (1996) sent her instrument to a national panel of experts for review. Interestingly, this was not the simple content coverage analysis because she asked the experts to judge whether the items fit the description of self-efficacy as defined by Bandura (1977). Because this process addressed issues of the underlying psychological trait the instrument was designed to measure, it clearly fits under the consent aspect of construct validity, but there was no effort to address relevance or coverage. This seems to be a less effective use of resources, as evidence gathered from content matter experts would be better used to judge coverage of their domain of expertise than the relation between the test and the psychological trait it purports to measure, with which the may or may not be familiar. In fairness to the study, it was not built with Messick’s model in mind.

Although the developers of the New General Self-Efficacy Scale (NGSE; Hough et al., 1990) likewise do not cite Messick, they used two panels of graduate students to judge the whether their items represented self-efficacy, self-esteem, or some other trait. This is considered superior to Perry’s validation effort because they used construct matter experts to judge construct relevance, though they admitted that they lacked evidence of “the degree to which the domain is sufficiently sampled.” Because their target was “general” self-efficacy, there was no need for task-specific subject matter expert ratings.
The development of the Pearson’s (2001) Self-efficacy for Musical Studies (SEMS) scale provides an example of a complete content aspect validation. It included a domain analysis that was approved by subject matter experts before item development was completed. This greatly aided the development of her items and facilitated the process of gathering domain representativeness. After completing item development, she asked individual experts to categorize each item under the subscales identified in her domain analysis. The percentage of raters who classified the item in its a priori subscale was interpreted as an item-domain congruence rating, and some indication of domain representativeness. Pearson also asked her experts to rank each item according to its relevancy to the domain outlined in the analysis. She then reported Aiken’s V indices (19[80?]), a coefficient that summarizes these ratings and can be tested for significance, for each item. The only shortcoming in Pearson’s validation vis-à-vis Messick’s content aspect is that, even with the congruence rating, the only discussion of domain coverage was each item’s representativeness of and relevance in the domain. There was no comment on how well the scale’s items covered the entire domain.

–Substantive Aspect–

“The substantive aspect refers to theoretical rationales for the observed consistencies in test responses, including process models of task performance (Embretson, 1983), along with empirical evidence that the theoretical processes are actually engaged by respondents in the assessment tasks” (Messick, 1995).

If the content aspect regards how well the items sampled the content domain, the substantive aspect regards how well the items sample the cognitive processes expected to impact the trait in question. Response consistencies that are predicted by the theory, “think aloud” protocols, and structure equation modeling would provide evidence of this aspect (Messick 1995).

–Structural Aspect–

“The structural aspect appraises the fidelity of the scoring structure to the structure of the construct domain at issue (Loevinger, 1957; Messick 1989b)” (Messick, 1995).

This ambiguously named simply requires a rational connection between the construct’s domain structure, possible revealed in the domain anaylsis, and the scale’s scoring system. Messick does not list any empirical evidence, but suggests the argument be made “rationally.”

–Generalizability Aspect–

“The generalizability aspect examines the extent to which score properties and interpretations generalize to and across population groups, settings, and tasks (Cook & Campbell, 1979; Shulman, 1970), including validity generalization of test criterion relationships (Hunter, Schmidt, & Jackson, 1982)” (Messick, 1995).

Measures should be tested across targeted populations, on multiple occasions, and, if applicable, with different raters. Such testing would reveal the “limits of score meanings.” Ideally, every psychological measure would also be tested for its correlation with the behavior it is meant to predict. Although this would demonstrate the extent to which the scale generalized to its targeted construct, observing even relatively simple tasks can be time-intensive and not always practical.

–External Aspect–

“The external aspect includes convergent and discriminant evidence from multitrait-multimethod comparisons (Campbell & Fiske, 1959), as well as evidence of criterion relevance and applied utility (Cronbach & Gleser, 1965)” (Messick, 1995).

No other aspect is so compatible with standard validation procedures. It is important to note that Messick requires the typical criterion-oriented evidences be evaluated in light of the interpretation of the scores (Messick, 1995). Typically, correlations between the scores and convergent measures are reported as if they value was obvious. Discriminant correlations tend to be evaluated more in the light of score interpretations simply because the researcher must explain why a low correlation is expected.

The creators of the NGSE tested compared results of their scale to participants in ten different occupations, in three different countries, and correlated the results with task-specific self-efficacy measures. They also administered the scale before and after a midterm examination in a university psychology course, comparing the effect the testing experience had on the students’ self-efficacy.

In creating the NGSE, the developers were very concerned that the scale not measure self-esteem, which is a theoretically distinct trait. Therefore, they administered self-esteems scales concurrently, hoping they did not strongly correlated. Because self-efficacy and self-esteem are related traits, a sizable correlation could be expected, but the resulting .[WHAT WAS IT] was deemed acceptable.

–Consequential Aspect–

The consequential aspect appraises the value implications of score interpretation as a basis for action as well as the actual and potential consequences of test use, especially in regard to sources of invalidity related to issues of bias, fairness, and distributive justice (Messick, 1980, 1989b). (Messick, 1995)

His most controversial and original aspect, Messick had argued for validators to consider the consequences of the test for three decades before publishing his unified model (Messick 1964, 1965, 1975, 1980). He (Messick, 1980) divided consequential aspect into consequences of test use and consequences of test interpretation, and pointed out that more has been done on the former than the latter. Simply put, the consequential aspect searches out “any negative impact on individuals or groups should not derive from any source of test invalidity” (Messick, 1995). The validator should investigate both immediate and long-term consequences, and expend effort on imagining any possible unforeseen consequences.

When investigating the consequences of test use, the validator may use what Churchman (1971) termed Kantian inquiry, wherein the potential benefits of the test are contrasted against the potential negative consequences. A Hegellian approach, where ethical and antiethical perspectives are juxtaposed, may also be fruitful (Messick, 1980). Evaluating the consequences of score interpretation is more difficult as, “That social values impinge upon theoretical interpretation may not be as obvious, but it is no less serious” (Messick, 1980). In either case, the broader the scope of the construct being measured, the more difficult fully evaluating the consequences of test interpretation becomes.

Notes from Messick (1995)
Thursday, August 31st, 2006

Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50, 741-749.

p. 741

The traditional conception of validity divides it into three separate and substitutable types—namely, content, criterion, and construct validities. This view is fragmented and incomplete, especially because it fails to take into account both evidence of the value implications of score meaning as a basis for action and the social consequences of score use.
[I don't think he had self-reporting scale, like self-efficacy in mind with this. It's obviously bent towards "testing," but classical and modern. We should look at what's been published since 1995 on this vis-à-vis self-efficacy.]

unified validity integrates considerations of content, criteria, and consequences into a construct framework for the empirical testing of rational hypotheses about score meaning and theoretically relevant relationships, including those of an applied and a scientific nature.
[La ilaha illa allah. There is no validity but construct validty.]

[But construct validity has its aspects.]
These are content, substantive, structural, generalizability, external, and consequential aspects of construct validity.

Validity is an overall evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of interpretations and actions on the basis of test scores or other modes of assessment (Messick, 1989b). [...] not only of the items or stimulus conditions, but also of the persons responding as well as the context of the assessment.

validity is an evolving property and validation a continuing process.
[But my dissertation must be defended by February!]

p. 742

As a salient social value, validity assumes both a scientific and a political role that can by no means be fulfilled by a simple correlation coefficient between test scores and a purported criterion (i.e., classical criterion-related validity) or by expert judgments that test content is relevant to the proposed test use (i.e., traditional content validity).
[Like SDR, we cannot avoid politics. Let's deal with it instead.]

Therefore, it is fundamental that score validation is an empirical evaluation of the meaning and consequences of measurement.

[In strict operationalism,]
the measure is viewed as just one of an extensible set of indicators of the construct. Convergent empirical relationships reflecting communality among such indicators are taken to imply the operation of the construct to the degree that discriminant evidence discounts the intrusion of alternative constructs as plausible rival hypotheses.

—Sources of Invalidity—

construct underrepresentation
[If we cover the NETS-T, is that a priori representation?]

construct-irrelevant variance:
++construct-irrelevant difficulty for individuals and groups is a major source of bias in test scoring and interpretation and of unfairness in test use.
++construct-irrelevant easiness occurs when extraneous clues in item or task formats permit some individuals to respond correctly or appropriately in ways irrelevant to the construct being assessed.

p. 743

But both communication skill and mathematical knowledge are considered relevant parts of the higher-order construct of mathematical power, according to the content standards delineated by the National Council of Teachers of Mathematics (1989).
[Ah, so we can use the NETS-T to dictate what is relevant. Also, given that teachers in most states must pass a basic English test, can we assume that reading is a relevant trait as well?]

—Sources of Evidence in Construct Validity—

explanatory concepts that account for both test performance and score relationships with other variables

Almost any kind of information about a test can contribute to an understanding of score meaning, but the contribution becomes stronger if the degree of fit of the information with the theoretical rationale underlying score interpretation is explicitly evaluated (Cronbach, 1988; Kane, 1992; Messick, 1989b).

Probably even more illuminating in regard to score meaning are studies of expected performance differences over time, across groups and settings, and in response to experimental treatments and manipulations.
[This is what we're doing with the pre/post tests.]

Possibly most illuminating of all, however, are direct probes and modeling of the processes
underlying test responses, which are becoming both more accessible and more powerful with continuing developments in cognitive psychology (Frederiksen, Mislevy, & Bejar, 1993; Snow & Lohman, 1989).
[Again, I do not think this would be that useful on a self-report format.]

construct validity, as previously indicated, also subsumes content relevance and representativeness as well as criterion-relatedness.

p. 744

In other words, empirical relationships between predictor scores and criterion measures should make theoretical sense in terms of what the predictor test is interpreted to measure and what the criterion is presumed to embody (Gulliksen, 1950).
[This is the crux of the assignment score-gain score correlation.]

And to appraise how well a test does its job, one must inquire whether the potential and actual social consequences of test interpretation and use are not only supportive of the intended testing purposes, but also at the same time consistent with other social values.
[We must consider the consequences to the program and instructors, but they are not the ones taking the test.]

In the language of the Cronbach and Meehl (1955) seminal manifesto on construct validity, the intended consequences of the testing are strands in the construct’s nomological network representing presumed action implications of score meaning.

the general construct validity evidence may need to be buttressed in applied instances by specific evidence of relevance and utility.
[We have gathered a little evidence of this already. Perhaps an interview with a dean, another professor at another school?]

**Good opening Lines***
From the discussion thus far, it should also be clear that test validity cannot rely on any one of the supplementary forms of evidence just discussed. However, neither does validity require any one form, granted that there is defensible convergent and discriminant evidence supporting score meaning.

What is required is a compelling argument that the available evidence justifies the test interpretation and use,
*******
[This is also free license to ignore any of the aspects of construct validity.]

—Aspects of Construct Validity—

p. 745

[Each section of the lit review should begin with the corresponding description.]

+The content aspect of construct validity includes evidence of content relevance, representativeness, and technical quality (Lennon, 1956; Messick, 1989b);
[The NETS-T + item/reliability/factor analyses + relevancy scale]

+The substantive aspect refers to theoretical rationales for the observed consistencies in test responses, including process models of task performance (Embretson, 1983), along with empirical evidence that the theoretical processes are actually engaged by respondents in the assessment tasks;
[This is a bit more difficult. Does CFA via SEM work? A more formal think-aloud? We might have to argue that this is a testing aspect, not a self-report aspect.]

+The structural aspect appraises the fidelity of the scoring structure to the structure of the construct domain at issue (Loevinger, 1957; Messick 1989b);
[CFA goes here? Concurrent vs. convergent? Comparing gain and assignment scores? Or is that under external?]

+The generalizability aspect examines the extent to which score properties and interpretations generalize to and across population groups, settings, and tasks (Cook & Campbell, 1979; Shulman, 1970), including validity generalization of test criterion relationships (Hunter, Schmidt, & Jackson, 1982);
[Compare majors? El vs. 2nd vs. Special Ed?]

+The external aspect includes convergent and discriminant evidence from multitrait-multimethod
comparisons (Campbell & Fiske, 1959), as well as evidence of criterion relevance and applied utility (Cronbach & Gleser, 1965);
[Is this where the gain vs. assignment scores goes?]

+The consequential aspect appraises the value implications of score interpretation as a basis for action as well as the actual and potential consequences of test use, especially in regard to sources of invalidity related to issues of bias, fairness,
[We'll explain why we're not addressing this.]

—Content Aspect—

especially [useful is] domain theory, in other words, scientific inquiry into the nature of the domain processes and the ways in which they combine to produce effects or outcomes.
[Does the NETS-T have an empirical basis?]

it is not sufficient merely to select tasks that are relevant to the construct domain. In addition, the assessment should assemble tasks that are representative of the domain in some sense.
[The NETS-T does do this.]

Functional importance can be considered in terms of what people actually do in the performance domain
[But what about what we would like them them to do? Current practice may not be the target.]

—Substantitive Aspect—

Two important points are involved: One is the need for tasks providing appropriate sampling of domain processes in addition to traditional coverage of domain content; the other is the need to move beyond traditional professional judgment of content to accrue empirical evidence that the ostensibly sampled processes are actually engaged by respondents in task performance.
[Again, we rely on the NETS-T for this. Are they valid?]

the need for empirical evidence of response consistencies or performance regularities reflective of domain processes (Loevinger, 1957).
[CFA?]

Such evidence may derive from a variety of sources, for example, from “think aloud” protocols or eye movement records during task performance; from correlation patterns among part scores; from consistencies in response times for task segments; or from mathematical or computer modeling of task processes (Messick, 1989b, pp. 53-55; Snow & Lohman, 1989).
[CFA? but one would think that "Structural Equation Modeling" would go under "structural aspect".]

—Structural Aspect—

scoring models should be rationally consistent with what is known about the structural relations inherent in behavioral manifestations of the construct in question (Loevinger, 1957; Peak, 1953)

p. 746

also the rational development of construct-based scoring criteria and rubrics.
[Averages, factor scores, IRT]

the internal structure of the assessment (i.e., interrelations among the scored aspects of task and subtask performance) should be consistent with what is known about the internal structure of the construct domain (Messick, 1989b). [=] structural fidelity (Loevinger, 1957).
[Perhaps we need to move from the NETS-T and compare the results with known self-efficacy expectancies.]

—Generalizability Aspect—

Evidence of such generalizability depends on the degree of correlation of the assessed tasks with other tasks representing the construct or aspects of the construct.
[This isn't "external"?]

there is a conflict in performance assessment between time-intensive depth of examination and the breadth of domain coverage needed for generalizability of construct interpretation. This conflict between depth and breadth of coverage is often viewed as entailing a trade-off between validity and reliability (or generalizability).
[It's easier to score an instrument written to a specific task, but that's less generalizable. Have we succeeded in finding a happy medium?]

—External Aspect—

the extent to which the assessment scores’ relationships with other measures and nonassessment behaviors reflect the expected high, low, and interactive relations implicit in the theory of the construct being assessed.
[This is the crux of our study.]

That is, the constructs represented in the assessment should rationally account for the external pattern of correlations.
[A regression analysis might be necessary to remove the good-student effect from the domain performance. Still, it could be that assignments are not increasing self-efficacy and the TICS is measuring them correctly. How do we know?]

Discriminant evidence is particularly critical for discounting plausible rival alternatives to the focal construct interpretation.
[Perhaps a survey of some self-esteem trait other than self-efficacy? Or just a general self-efficacy measure?]

—Consequencial Aspect—

The primary measurement concern with respect to adverse consequences is that any negative impact on individuals or groups should not derive from any source of test invalidity,
[If the test is used to monitor the program, and the students will neither see their scores, nor affected by them does this apply?]

—Valitiy as Integrative Summary—
[an evaluation]

p. 747

This relation is embodied in theoretical rationales or persuasive arguments that the obtained evidence both supports the preferred inferences and undercuts plausible rival inferences. From this perspective, as Cronbach (1988) concluded, validation is evaluation argument.

The six aspects of construct validity afford a means of checking that the theoretical rationale or persuasive argument linking the evidence to the inferences drawn touches the important bases; if the bases are not covered, an argument that such omissions are defensible must be provided.
[So we can ignore the consequencial aspect so long as we justify its omission.]

The challenge in test validation is to link these inferences to convergent evidence supporting them and to discriminant evidence discounting plausible rival inferences.

p. 748

[Figure 1 is interesting, but of no import if we're no concerned with the consequencial aspect]

the evidence and rationales supporting the trustworthiness of score meaning are what is meant by construct validity,

Counterproposals to a proposed test use might involve quite different assessment techniques, such as observations or portfolios when educational performance standards are at issue.
[The high stakes assessments use these, but they are time and cost consuming. Our instrument is a low stakes, low cost, medium fidelity measure.]

Structural equation modeling notes
Wednesday, July 26th, 2006

Here are a couple of quotes from the first three chapters of A Beginnger’s Guide to Structural Equation Modeling (Schumacker and Lomax, 1996).

Chapter 1: Introduction

P. 3

“[Pairwise deletion] may lead to non-positive definite covariance matrix.”

P. 4

“[The regression replacement] method as well as Method 3 may lead to heteroscedastic error variances or to non-normal distributions of the now complete data despite normally distributed incomplete data.”

P. 5

“Outliers often alter the covariance matrix, can seriously impact the results of SEM.”
But retain them if they are accurate observations.
Run covariance matrix with and without extreme case, observe effect.

Check for skewness and kurtosis of distribution.

P. 6

If data not normal, normalize it.
(Why not just normalize it to begin with?)

Chapter 2: Correlation

P. 25

Non-positive definite covariance matrices (NPD)
Read Wothke (1993)

P. 26

“If the determinant of the matrix is zero, then the matrix is known as a singular or non-positive definite matrix. This also means that the inverse of the covariance matrix does not exist.”
In other words, there is at least one zero/negative eigenvalue.

NPD usually comes from pairwise deletion or a linear dependency in the observed variables (eg. collinearity). For example, having one variable that is the sum of two or more other variables, and including them all in the covariance matrix.

Chapter 3: Structural Equation Modeling Approach to Regression, Path, and Factor Analysis

P. 39

Path Analysis

“Path analysis does not provide a way to specify the model [of relationships between variables], but rather estimates the effects among the variables once the model has been specified a priori on the basis of theoretical considerations.”

P. 40

“In path analysis, one or more multiple regression analyses are performed depending upon the variable relationships in the path model.”

P. 44

Model fit is a chi-square statistic comparing the original correlation between two variables with their model-generated correlation. A significant result means the model does not fit the data.

P. 45

Factor Analysis

P. 46

When diagramming an hypothesized factor relationship, F = factor, X = observed variable, U = error.

P. 47

“In confirmatory factor analysis, a reproduced factor model is compared with the original sample matrix to test model fit.” (Covariance structure analysis)

P. 48

Results in the percentage of factor variance explained by the model.

P. 49

Structural equation modeling approach

“[Structural equation models] typically consist of two parts: the measurement model, and the structural equation model. The measurement model specifies how the latent variables or hypothetical constructs are measured in terms of the observed (measured) variables and describes their measurement properties (reliability and validity). The structural equation model specifies the direct and indirect relationships among the latent variables and is used to describe the amount of explained and unexplained variance.”

P. 50

Confirmatory factor analysis in a type of SEM.