Are Intelligence Tests Perfect?

Dr. Russell T. Warne

Jun 9, 2025

Standardized tests ... are too limited, too imprecise, and too easily misunderstood to form the basis of crucial decisions about students. (D. W. Miller, 2001, p. A14)

. . . talent is great, but tests of talent stink. There’s certainly an argument to be made that tests of talent – and tests of anything else psychologists study ... are highly imperfect.
(Duckworth, 2016, p. 34, emphasis in original)

Nothing in this world is perfect, and that includes intelligence tests. Though they are a useful tool for a variety of purposes, intelligence tests – and other tests that measure g – sometimes produce inaccurate scores for individual examinees. And inaccurate scores can lead to incorrect decisions. Sometimes the consequences of using an inaccurate test score can have a lasting impact on an examinee, such as in college admissions testing, diagnosing a disability, or employee selection or promotion. Under extreme circumstances, test score accuracy can be a matter of life and death. After the US Supreme Court ruled that executing someone with an intellectual disability was unconstitutional in 2002 (Atkins v. Virginia), accurately estimating the IQ score of an inmate with an intellectual disability can save that person’s life.

The question is not whether intelligence tests are perfect – everyone agrees that they have flaws. Rather, the misconception I address in this chapter is that intelligence tests (and other measures of g) are so flawed that they cannot be used for research or practical decision making. In this chapter, I show that intelligence tests are good enough for these purposes.

Measuring Score Imperfections

Professional test creators – called psychometricians – have long been aware that no test is a perfect tool for measuring the trait that it is designed to measure. This awareness is summed up in the fundamental equation of the scientific field of testing:

X = T + E

In this equation, X is the score that a person obtains on a test. T is the examinee’s actual level of the trait being measured, which is called true score. Finally, E stands for error, which is anything that influences a test score that is different from the trait that the examiner wants to measure (Allen & Yen, 1979; R. P. McDonald, 1999). While the equation appears simple, it has the profound implication that any observed score on a test is the result of a mix of the trait being measured and other, irrelevant influences (i.e., error). Test creators are very aware that their tests are imperfect and that score inaccuracies happen. The goal of every psychometrician is to reduce error and maximize a test’s ability to measure the true score of a person’s trait.

Error can be positive (and boost a person’s observed score) or negative (which would decrease a person’s observed score). Positive error might arise from a lenient test scorer, a lucky guess on a test question, or other favorable circumstances. Negative error may result from a hungry examinee, a stressful event on the way to the testing location, or a distracting environment. Across test items, test versions, administration times, settings, etc., error is theorized to be random. As such, it cancels out across test items because the positive error and the negative error counteract each other. When a test is designed to minimize error, this cancelling out can happen very quickly and consistently.

This cancelling out was apparent in the section in Chapter 7 about reliability. Error is the source of score instability, which means that high consistency requires low error in a test score. Chapter 7 also showed that reliability increases (and, therefore, error decreases) as test length increases. A reliability value of 1 is an unobtainable ideal because it would unrealistically indicate that error somehow does not influence the observed score on a test.

Most intelligence tests tend to produce scores with high reliability. As an example, the ACT produces scores that have a reliability value of 0.94 (ACT, Inc., 2017, Table 10.1). The overall SAT score has a similar reliability value of .96 (College Board, 2017, Table A-6.2). Because many colleges and universities use this score to decide who is admitted (a hugely important decision for applicants), this high reliability value is important. On the other hand, in one study I did on how adolescents solve difficult cognitive test items, the reliability values ranged from .681 to .886 (Warne et al., 2016), but because the test scores were only used in a research setting, this lower reliability was acceptable.

Table 9.1 Standard error of measurement (SEM) of IQ scores, given reliability values

Reliability	SEM
0.00	± 15.0 points
0.50	± 10.6 points
0.70	± 8.2 points
0.80	± 6.7 points
0.85	± 5.8 points
0.90	± 4.7 points
0.95	± 3.4 points
0.98	± 2.1 points
0.99	± 1.5 points
1.00	± 0.0 points

Because error is random, reliability is a measurement of how consistent observed scores are across time points, test versions, test questions, etc. Reliability statistics can be used to estimate the range of scores we can expect if an examinee retakes a test, which is called the standard error of measurement (SEM). In IQ points, where the average is 100 and the standard deviation (see the Introduction) is 15, Table 9.1 shows the SEM values for different reliability values.

In the table, notice how high reliability values are paired with low SEM values, which confirms that high reliability indicates high consistency (and therefore low error) in scores. But there is always some degree of error in scores, as long as reliability is not 1. Additionally, it is important to note that for reliability values that are typical for tests used to make decisions (about .85 or higher), test scores are fairly consistent.

Decision Accuracy

Thus, the critics of tests are correct that the tests are not perfect. But tests used for practical purposes tend to produce highly consistent data. As a result, the question is whether the tests are good enough to use as part of making decisions. The evidence is overwhelming that they are.

Academic tests produce highly consistent results. As an example, the psychometricians who create the ACT use their test scores to estimate whether or not examinees are “college ready” (defined as having at least a 50% chance of earning a B and a 75% chance of earning a C in a freshman-level college general education course). Across four subjects – English, mathematics, reading, and science – the accuracy of these classifications ranged from 85% to 89% (ACT, Inc., 2017, Table 10.4), an impressive level of correctness. It is unlikely that most humans (especially if they have not met a student) would be able to classify students’ college readiness accurately 85–89% of the time.

The college admissions research also shows that to predict college grades, admissions test scores are about as accurate as high school grades (Zwick, 2007). This does not mean that college admissions test scores are redundant, though. Combining both grades and test scores to make a prediction is better than using either alone. Therefore, high school grades and college admissions test scores provide information that the other does not. In addition to measuring knowledge, grades measure students’ long-term behaviors and non-cognitive traits that lead to academic success (e.g., ability to meet deadlines, attention to assignment requirements). College admissions test scores measure g and also provide a common score that can be compared across high schools or state lines, which compensates for inconsistencies in grading systems (e.g., Warne, Nagaishi, Slade, Hermesmeyer, & Peck, 2014). Results are similarly impressive in hiring job applicants (see Chapter 23).

The Perfect as the Enemy of the Good

These accuracy classification studies are impressive, but none reach 100% accuracy. Errors still occur, and they can have unfortunate consequences for examinees (Lubinski & Humphreys, 1996). However, this is not a reason to eliminate tests. The standard for usefulness is not whether the tests have perfect accuracy; rather, tests should be judged by whether they are more accurate than alternative decision-making strategies. Decades ago, when discussing using tests for selecting job applicants, Paterson (1938, pp. 44, 45) criticized the proposal to judge tests by the standard of perfect accuracy:

Strangely enough, those who demand perfect tests are the very ones who are complacent in the face of the far larger errors being committed daily in school and shop through sole reliance upon traditional methods ... Our perfectionists however show another strange symptom. They survey with hypercritical eyes existing tests and measurements and find them wanting when tested by the severe standard of perfect validity ... I refer to those who reject tests and measurements but parade before the public an array of guidance techniques that are far less reliable and valid. What is the reliability and validity of a guidance interview? Of an occupational pamphlet? Of a lad’s earnest but misguided desire to study medicine?

Paterson (1938) also applied this logic to medicine and showed the absurdity of demanding perfection before intervening in people’s lives. If medical tests and interventions must be perfect before replacing existing treatments, then no medical advances would be possible at all. People demanding perfection from scientific tests and treatments are proposing unrealistic standards of perfection that would prevent any possible scientific progress or improvement in people’s lives.

Conclusion

Thus, even though every intelligence test produces an imperfect score, the tests are still highly useful in making decisions. Indeed, demanding that the tests be perfect before they can be used is such an unrealistic standard that it would prevent any intelligence test from ever being used (Gottfredson, 2009). Holding any tool used for decision making to this standard would be the equivalent of banning the tool completely. For some critics of intelligence tests, that is probably the point.

Whether intelligence tests can be used to make decisions does not depend on whether the tests are perfect. Rather, whether to use a test for decision making depends on whether the test is better than alternative methods of decision making. The need to select individuals (for jobs, college admission, promotions, or gifted programs) does not magically disappear if tests are banned. Any time that the number of applicants exceeds the number of positions available, selection has to occur. If intelligence tests can make more accurate judgments than other tools – as is often the case – then the tests should be used whenever possible (especially in combination with other variables). Doing so will result in fewer errors, more fair selection, and more successful experiences in educational programs and jobs.

From Chapter 9 of "In the Know: Debunking 35 Myths About Human Intelligence" by Dr. Russell Warne (2020)

We hope you found this information useful. For further questions, please join our Discord server to ask a Riot IQ team member or email us at support@riotiq.com. If you are interested in IQ and Intelligence, we co-moderate a related subreddit forum and have started a Youtube channel. Please feel free to join us.