The Legal Climate
Equal employment laws, regulations, and guidelines have pressured employers using tests to develop evidence of their validity. Underlying these pressures in many cases is a substantial financial exposure. Public employers are probably under the most difficult set of constraints. Many have civil service regulations which require that a large number of "examinations" be developed and given each year involving both entry-level jobs and promotions. The amount and the complexity of the research that would be required to accumulate the evidence needed to demonstrate that the inferences from these test scores are valid is well beyond the resources and the budgets of the agencies involved. The professional definition of validity is quoted at the beginning of the chapter by Goldstein and Zedeck. It is taken from the Standards for Educational and Psychological Testing (1985) usually referred to as the APA Standards.
"Validity is the most important consideration in test evaluation. The concept refers to the appropriateness, meaningfulness, and usefulness of the specific inferences from the test scores. Test validation is the process of accumulating evidence to support such inferences (p. 9)."
Probably the only feasible strategy open to these agencies if they wished to satisfy both professional and legal standards is to form consortia so that the costs of the research could be shared. Unfortunately rather than adopt a research-based strategy, many agencies have decided to try to claim that their tests are content valid by stretching the definition of content validity well beyond professionally acceptable limits. The most primitive approach is referred to here as "semantic validity."
The process is simple. The exam writer does a job analysis and then labels the knowledges, skills, and abilities (KSAs) that he or she theorizes are needed to perform the job duties. The next step is to write test items and use the same set of labels that were used in the job analysis. The result is a single set of labels that are assigned to both domains. This result can be described to non-professionals as a demonstration that the content of the test is the same as the content of the job and therefore the test is content valid. Some of these exam writers have used the term "rational validity" to enhance their claim to legitimacy. Although when the process is described in this way its absurdity seems evident, it can be surprisingly difficult to convince some laymen (e.g. judges) that it is not an acceptable showing of job-relatedness. Even more difficult to challenge are the pseudo validation strategies that are complex extensions of face validity.
Professionals have consistently distinguished between actual validity and face validity. Anastasi (1988) begins a section on face validity as follows:
"Content validity should not be confused with face validity. The latter is not validity in the technical sense; it refers, not to what the test actually measures, but to what it appears superficially to measure. Face validity pertains to whether the test "looks valid" to the examinees who take it. the administrative personnel who decide on its use, and other technically untrained observers (p.144)."
Describing the "administrative personnel" or the "technically untrained observers" as subject matter experts (SMEs) and asking them to offer an opinion on whether the test "looks valid" does not alter the methodology. A non-professional is being asked to determine whether the test is valid or not. Labeling the non-professional an SME does not transform face validity into an acceptable validation strategy.
Collecting the opinions of the non-professional SMEs on forms and asking them to assign numbers to their opinions produces "data," but does not remove the process from the face validity category. The data is simply a quantification of opinion. It allows the calculation of means, standard deviations, interrater correlations, and many other possible statistics. Once the trappings of empirical research are applied to the SMEs' opinions it is easy to lose sight of the fact that they are, after all, the opinions of laymen about the degree to which the test "looks valid" to them.
The label: "quantitative face validity" was chosen as a name for this procedure to emphasize the fact that despite the "scientific" appearance of the report, it is still only face validity.
I have examined many of these reports. Although it is only an anecdotal finding, I have concluded that SMEs will almost always report that the test (and/or its individual items) "looks valid" to them. Thus a quantitative face validity procedure will almost invariably provide apparent support for a validity claim. This is true regardless of whether or not the test is actually valid.
Defining Content Validity
Definitions of content validity by the Society for Industrial and Organizational Psychology (SIOP) and by the federal Uniform Guidelines are quoted at the beginning of the chapter by Goldstein and Zedeck. The Principles for the Validation and Use of Personnel Selection Procedures (Society for Industrial and Organizational Psychology, [SIOP], 1987) state that content validity is an appropriate strategy when the "job domain is defined through job analysis by identifying important tasks, behaviors, or knowledge and the test . . . is a representative sample of tasks, behaviors, or knowledge drawn from that domain." (p. 19). The Uniform Guidelines on Employee Selection Procedures (1978) state that "To demonstrate the content validity of a selection procedure, a user should show that the behavior(s) demonstrated in the selection procedure are a representative sample of the behavior(s) of the job in question or that the selection procedure provides a representative sample of the work product of the job." (Section 14C(4)).
Notice that in both cases the key to the definition is the idea
of a representative sample. Just as measures of relationships
(such as the correlation coefficient) are at the core in evidence
supporting criterion-related validity, the nature and quality
of the sampling process is central in providing evidence of content
validity. The most important implication of the centrality of
the sampling process is the truism that whatever is sampled is
a member of the domain from which the sample is drawn. Thus the
relationship between the sample and the domain is "same as."
Since the test and the job domain sampled are the same their is
no need to collect empirical data to determine their relationship.
The other aspect of the implied "sameness" between the
test and the job domain is that if they are not the same then
content validity cannot be demonstrated. The relationship between
the two domains must then be determined using empirical research.
This requirement is a frequent occurrence when a content-oriented
test development strategy is used.
Content-Oriented Test Development
As long as the critical difference between test development and test validation is recognized, content-oriented test development offers a rich set of possibilities for innovation. Simulations, theoretical measures that attempt to replicate the elements thought to underlie superior performance, and many the other creative measurement approaches that can be derived from thoughtfully observing job content and job performance become possibilities. However, as with any other test development strategy, an empirical validation process must then determine that the inferences about job performance are valid.
What Is A Link?
In many of the situations where it cannot be shown that the test items are sampled from a job domain and the test developer wishes to avoid the expense of empirical research, a procedure is devised to "link" the test domain and the job domain. Professionals know what sampling is and they know what a correlation coefficient is, but what is a link? A survey of dictionary definitions leads to the conclusion that it is some sort of connection. What are the methodological or psychometric characteristics of a link? How does one determine when an attempt to establish a link has failed? To say that a "link" as a scientific construct is well short of minimal professional standards is stating the obvious. The use of the word "link" or a synonym which claims to connect the job and test domains is an almost infallible indicator of semantic or face validity.
A Sophisticated Example Of Apparent Content Validity
As an example of how some of these issues can come together, the chapter by Goldstein and Zedeck serves as an excellent vehicle. Most importantly, it is a creative, sophisticated approach by two eminent I/O Psychologists with well deserved reputations for excellence. Also see Goldstein, Zedeck and Schneider (1993).
First of all, it is a good model for content-oriented test development. The fire-scene simulation that they refer to is an example of the way that content oriented test development can produce measures that theoretically should have higher validities than the standard ability and aptitude tests. The authors also point out some of the ways that content-orientated test development can go astray.
I would argue, however, that there is still one gap that needs to be closed. The support for the inferences required by the definition of validity rests entirely on judgments by professionals and/or the opinions of SMEs.
I have had direct experience with tests that were developed using the content-oriented approach, but which produced opposite results. The Berger Programming Test which begins by defining a small, highly abstract programming language; produced empirical validities well above "programmer aptitude tests" and standard ability tests. Exactly opposite results occurred in the development of a selection battery for power plant control room operators. A computer simulation which "looked valid" to both seasoned I/O Psychologists and to the SMEs who were consulted, failed to show a relationship to job performance. More traditional ability tests showed substantial validities, so neither the research design nor the performance measures were at fault. The problem was that apparently the simulation didn't accurately simulate.
It is probably only a matter of time before enough examples of failed judgments and/or opinions have occurred to discredit what I believe to be a promising procedure. A way must be found to move the process from the quantitative-face-validity category to a methodology that could correct for overly enthusiastic professionals and/or the apparent positive bias of SMEs. Perhaps the best way to do this might involve a combination of synthetic validation and construct validation (Schmitt & Landy, 1993). Meta-analytic techniques might eventually also become useful. The specifics of how this might work are probably better developed by an evolutionary process based on real research, than by attempting to define them in the abstract. The general approach would be to focus on constructs covering parts of jobs and not trying to use the whole job as the unit.
Content Validity as Defined by the Uniform Guidelines on Employee Selection Procedures (UGESP)
The Guidelines were written and adopted as regulations by the four federal agencies with civil rights enforcement responsibilities. Since these agencies disagreed sharply among themselves on some issues, (content validity was one of them) the final wording was negotiated and thus is more convoluted than is desirable. There are actually two varieties of content validity discussed in the Guidelines. I shall refer to them as "classic" and "extended" content validity. The standards that are applied to them are, in some instances, substantially different.
Classic Content Validity
The theme that unifies all of the content validity documentation requirements is that the user is expected to provide the detail and specificity needed to clearly relate the content of the test to the content of the job so that the "inferential leap" is very small. The classic approach is appropriate only if a test can be constructed by taking a representative sample of job behavior(s) or of a work product. An example of a work product would be a properly welded angle joint as one item in a sample drawn from a welder's job which is then used in a test for selecting welders. A work behavior is something that the worker does. The standards for demonstrating classic content validity require that all aspects of the test closely resemble the job. Classic content validity is essentially the same as the basic professional view.
Extending Content Validity to Knowledges, Skills, and Abilities
The adopting agencies reacted to the comments that were received in response to the publication of a draft by attempting to specify how KSAs can by justified using content validity. The task of writing regulations which extend content validity to include KSAs is extremely difficult. The goal is to allow the situations where content validation is appropriate and to require criterion-related validation in the situations it is not.
First a user must show, "that the selection procedure measures and is a representative sample of that knowledge skill, or ability." (See Section 14C(4)) Two conditions were added to cover the problem that a test can be valid with respect to a knowledge domain, but the knowledge is not needed for successful performance. The first is that the KSA can be "operationally defined" using the restrictive standards in 14C(4). The second is a showing that the KSA is "used in and is a necessary prerequisite to performance of critical or important work behavior(s)." (14C(4)) Another method used "to put a fence around KSAs" as it was referred to at the time was to define the three terms in a much more restricted way than their standard dictionary or professional definitions (16(M), 16(T), and 16(A)).
Many of the abuses of content validity are attributable to the use of broad dictionary definitions of KSAs which, if accepted by the adopting agencies, would allow claims of content validity in almost any situation. In many of these situations where content validity is inappropriately used, a criterion-related study would show that the test is, in fact, not job-related.
"A selection procedure based upon inferences about mental processes cannot be supported solely or primarily on the basis of content validity. Thus, a content strategy is not appropriate for demonstrating the validity of selection procedures which purport to measure traits or constructs, such as intelligence, aptitude, personality, commonsense, judgment, leadership, and spatial ability." (14C(1))
An extended version of this section is available on this site.
(Content Validity as Defined
by the Uniform Guidelines on Employee Selection Procedures)
Content validity is paradoxical because, if it is appropriate it should be used before any other validation design. However in the great majority of situations, it is not appropriate. If the test is a sample and thus has a "same as" relation to a job domain, administering the test is equivalent to being able to obtain a performance evaluation before the decision is made to hire or promote. The validity coefficient equals the reliability coefficient. So the recommendation to employers is: if you can use content validity, you should use it. The issues then focus on using it in accordance with professional standards and the Uniform Guidelines. My assumption is that most employers will want to do both.
Classic Content Validity
My recommendation to employers is to simply follow the provisions in the Uniform Guidelines on classic content validity. This will automatically include adherence to professional standards. The definitions of skills and abilities in the Guidelines are so restrictive that they can simply be included under classic content validity.
Knowledge Tests: A Special Case
Knowledge tests are different. The domain sampled is a knowledge domain, not a job domain. Using the professional perspective, whether the knowledge domain is an appropriate selection requirement is determined by the job analysis. The "operational definitions", required by the Guidelines, require a showing that the knowledge is "used in" and is "a necessary prerequisite to performance of . . . work behaviors." This presents a problem because job analyses do not normally include lists of all the "work behaviors."
In working with a test publisher on knowledge tests, an approach has evolved which is both helpful to those writing the test items and which I believe satisfies both professional standards and Guidelines requirements. It is applied on an item-by-item basis, and thus avoids broad generalizations. Four standards are applied to each item:
1. The knowledge measured by the question is clearly defined. This provides a succinct definition of the knowledge element to be tapped by the item. The list of these elements provides a clear, detailed definition of the knowledge domain which is sampled by the test. This information can also be used to eliminate items if a particular job does not require the knowledge that the item represents.
2. The way that the question represents the knowledge is clearly explained. This provides information on the relationship between the fact required by the question and the knowledge as defined in the prior standard. In many cases the response here will simply be "direct inquiry." In other words the knowledge element is the correct answer to the item.
3. How and when the knowledge is used in various work behavior(s) is clearly explained. This defines the work behaviors in which the knowledge is used. It is the first part of the required operational definition.
4. A clear explanation of why the knowledge is a necessary prerequisite to successful performance on the job is provided. This is the second part of the operational definition. It provides extremely important information on the general issue of "job-relatedness" which in some ways goes beyond the Guidelines.
The use of these standards changes the focus from just writing items to a more direct concern with what the job behaviors require. The resulting test items tend to be more straightforward.
A Practical Rule-Of-Thumb
The first consideration in deciding to use content validity evidence to support a job-relatedness claim is to apply the "same as" criterion: the test content must be the same as the job domain or, for a knowledge test, the knowledge domain. Otherwise empirical evidence must be added.
As a final step in deciding whether using a test can be supported
by content validity you might apply my rule-of-thumb if you could
use the test as part of an incumbent's performance evaluation,
then it is probably content valid." For example, if the job
description for a typist sets an expected typing speed, then administering
a typing test as a performance measure might be justified, but
administering a mental ability test would not be appropriate.
Anastasi, A. (1988). Psychological testing. New York, NY: Macmillan.
Goldstein, I. L., Zedeck, S., & Schneider, B. (1993). An exploration of the job analysis-content validity process. In N. Schmitt & W. C. Borman (Eds.), Personnel selection in organizations. San Francisco, CA: Jossey-Bass.
Schmitt, N. & Landy, F. J. (1993). The concept of validity. . In N. Schmitt & W. C. Borman (Eds.), Personnel selection in organizations. San Francisco, CA: Jossey- Bass.
Society for Industrial and Organizational Psychology, Inc. (1987). Principles for the validation and use of personnel selection procedures. (Third Edition). College Park, MD: Author.
Standards for educational and psychological testing. (1985). Washington, DC: American Psychological Association.
Uniform guidelines on employee selection procedures. 29 C.F.R.