Understanding Reliability and Validity in Skill Assessment Design

What makes a hiring test accurate? We explain the critical difference between test-retest reliability and construct validity.

Dr. Russell T. WarneChief Scientist

Understanding Reliability and Validity in Skill Assessment Design

When organizations select a skill assessment for hiring or development, they rarely scrutinize the reliability and validity data before making a purchase. This oversight is understandable, as these psychometric concepts are highly technical and frequently buried in vendor manuals. However, reliability and validity are far more than academic concerns; they provide the only empirical evidence that an assessment actually measures what it claims to measure and that its scores predict meaningful outcomes. Without this foundational evidence, an assessment is simply an expensive mechanism for generating data that may have no actual relationship to job performance.

Reliability: Consistency Before Anything Else

Reliability refers to the degree to which an assessment produces consistent, stable scores. If a candidate takes a well-constructed evaluation twice under similar conditions, their scores should be nearly identical. This is known as test-retest reliability. Furthermore, if the test utilizes multiple items to measure a single underlying skill, the examinee's responses across those items should reveal a consistent pattern, demonstrating internal consistency. Finally, in scenarios requiring human judgment, such as evaluating a work sample, two trained raters should reach the same conclusion, which establishes inter-rater reliability. This last form is notoriously difficult to achieve, as human judgment introduces natural variability that even highly structured rubrics can only partially constrain.

Quantified as a coefficient between 0 and 1, professional assessments target an internal consistency of 0.80 or higher for high-stakes decisions. Conversely, any value below 0.60 strongly suggests that the test items fail to measure a coherent construct. Yet, while reliability is a strictly necessary condition for a functioning assessment, it is not sufficient on its own. A test measuring how quickly a candidate clicks through a questionnaire might yield highly consistent scores, but if click speed is irrelevant to the job, that consistency is useless. Reliability essentially sets a ceiling on the predictive power an assessment can achieve, but it does not guarantee it.

Validity: What the Scores Actually Mean

Validity is not a singular property that a test either possesses or lacks; rather, it is a cumulative body of evidence supporting a specific interpretation of test scores for a particular purpose. Modern psychometrics evaluates this through multiple lenses. Content evidence asks whether the test items accurately represent the domain being evaluated—for example, whether a data analysis test actually requires the candidate to manipulate datasets. Construct evidence examines whether the internal structure of the scores aligns with theoretical expectations. Most importantly for hiring, criterion-related evidence asks whether the scores accurately predict real-world outcomes, such as future job performance.

Criterion-related validity is typically expressed as a correlation coefficient. In personnel selection, these coefficients rarely exceed 0.50 even for the absolute strongest predictors. In real-world hiring environments, where the variance in candidate ability is naturally narrowed, observed validity coefficients generally fall between 0.20 and 0.40. While vendors sometimes imply higher accuracy, a coefficient of 0.30 still translates into hiring outcomes that are meaningfully better than chance.

How Validity Estimates Have Shifted

For over two decades, the landmark 1998 Schmidt and Hunter meta-analysis served as the industry standard for validity estimates, positioning general cognitive ability as the single strongest predictor of job performance with an operational validity of 0.51 for medium-complexity roles. However, a major 2022 reanalysis by Sackett and colleagues identified systemic flaws in how previous researchers corrected for range restriction. This updated research concluded that historical estimates had been substantially inflated. In the revised analysis, structured interviews emerged as the strongest predictor at 0.42, while cognitive ability estimates were adjusted downward.

This recalibration does not invalidate cognitive testing, but it highlights the substantial contextual variability in validity coefficients. For instance, intelligence tests show much higher predictive validity for cognitively intensive roles, like software engineering, than for manual labor positions. Consequently, validity estimates should be understood as population averages with meaningful variance, not as fixed properties that apply universally to every job.

The Criterion Problem and Job Analysis

A frequently overlooked complication in validating skill assessments is that the benchmark for success—usually a supervisor's performance rating—is itself flawed. Supervisor ratings are not objective ground truth; they are heavily influenced by observation frequency, interpersonal dynamics, and inherent biases. This measurement error in the criterion statistically constrains how highly an assessment can correlate with it, a phenomenon researchers call attenuation. When vendors report statistically corrected validity coefficients, they are attempting to estimate what the relationship would look like if job performance could be measured perfectly. Practitioners must remember that these numbers are estimates with built-in uncertainty, not absolute guarantees of accuracy.

Before any of this statistical evidence can be gathered, an assessment must be built upon a coherent foundation, which is achieved through job analysis. This systematic process identifies the exact tasks, knowledge, and abilities a specific role requires. A rigorous skill assessment draws its content directly from these findings rather than relying on a test creator's intuition. When developers skip this step, they often produce assessments featuring questions that seem intellectually interesting but lack content validity, ultimately yielding scores with zero predictive value.

What to Look For in Assessment Documentation

An assessment provider unable to produce comprehensive documentation of its psychometric evidence should not be trusted with high-stakes hiring decisions. Minimum documentation must include internal consistency coefficients, test-retest stability metrics, the foundational job analysis, and criterion-related validity studies using actual job performance data rather than proxy metrics. Furthermore, blanket marketing claims stating an assessment is universally "validated" are scientifically meaningless. Valid assessments are those with documented evidence supporting specific uses in specific populations.

These stringent requirements are codified in the Standards for Educational and Psychological Testing, published jointly by the American Educational Research Association, the American Psychological Association, and the National Council on Measurement in Education. Creating an instrument that meets this bar requires deep expertise. The Reasoning and Intelligence Online Test (RIOT), developed by Dr. Russell Warne after more than 15 years of intelligence research, exemplifies this level of rigor. Featuring expert content review, systematic item analysis, and the first properly representative US-based norm sample for an online cognitive test, RIOT demonstrates what true psychometric validity requires in practice: an accumulated, documented body of scientific evidence.

Ultimately, the relationship between reliability and validity is asymmetrical. A highly unreliable test cannot be valid, but a highly reliable test can still measure the wrong thing entirely. When evaluating a vendor, the critical question is not simply whether the test is valid, but rather what specific validity evidence exists, for which outcomes, in which populations, and what the reliability coefficients actually are. Providers who can answer these questions with documented studies deserve serious consideration; those who cannot should be met with heavy skepticism.