🇺🇸

The official website of Riot IQ

Is it Difficult to Measure Intelligence?

Russell T. Warne
Russell T. Warne
Jun 7, 2025
Equally challenging (as defining intelligence) has been finding ways of measuring intelligence. (Pastorino & Doyle-Portillo, 2016, p. 328)

Given the variety of approaches to the components of intelligence, it is not surprising that measuring intelligence has proved challenging. (R. S. Feldman, 2015, p. 270)


The two quotes above come from introductory psychology textbooks, and the authors clearly believe that intelligence is difficult to measure. If this belief is true, then the intelligence testing enterprise is fraught with uncertainty, and any interpretations of IQ scores are tentative at best (and misleading at worst). As a result, people who believe that intelligence is difficult to measure also often believe that intelligence research is not trustworthy.

In reality, these textbook authors are completely wrong. Intelligence is extremely easy to measure because – as stated in the Introduction – any task that requires some degree of cognitive work or judgment measures intelligence (Jensen, 1998). All of these tasks correlate positively with one another and measure g (see Chapter 1). As a result, all it takes to measure intelligence is to administer at least one task (preferably more) that requires cognitive work; the resulting score is an estimate of the examinee’s intelligence level.



(Accidentally) Measuring Intelligence


Intelligence is so easy to measure that some people have created tests with the intention of measuring another trait or ability and accidentally created a test to measure intelligence instead. Chapter 1 already gave two examples of this occurring: the Cognitive Assessment System (CAS) and the Cognitive Abilities Measurement (CAM) battery. Both the CAS and CAM were designed to measure cognitive processes – not g. But they still measure g anyway (Keith et al., 2001; Stauffer et al., 1996).

It should not be completely surprising if tests of “cognitive processes” measure g because intelligence is the ability to solve problems, which is clearly a cognitive process. But other psychological tests measure g, even though that was not their creators’ intention. For example, a popular test of moral reasoning, called the Defining Issues Test (DIT), is designed to measure moral development and reasoning in examinees. Yet the test correlates with measures of verbal intelligence (Sanders, Lubinski, & Benbow, 1995). The best evidence indicates that examinees reason about the questions on the DIT, even though the test’s content does not relate to academic topics. Even studies designed to demonstrate the integrity of the DIT as a measure of moral development show that DIT scores are moderately good measures of verbal intelligence (e.g., Derryberry, Jones, Grieve, & Barger, 2007), though DIT scores are not completely interchangeable with IQ scores.

The accidental creation of intelligence tests does not just happen in psychology. Gottfredson (2004) described how the National Adult Literacy Survey (NALS) functions as a test of intelligence. The NALS is designed to measure reading comprehension, but NALS scores have the same pattern of correlations with life outcomes that intelligence test scores have. Moreover, factor analysis of NALS data shows that the items produce just one general factor, which is exactly what happens when intelligence tests are subjected to factor analysis. When the staff and researchers at the US Department of Education created the NALS as a measure of adult literacy, they did not intend for NALS scores to mimic intelligence test scores so closely. But they do nonetheless, and NALS scores can function as proxies for IQ scores (e.g., Gottfredson, 1997b). As a result, the interpretation of low NALS scores as being a product of low literacy is insufficient. In reality, the deficits of people with low NALS scores extend beyond low literacy because the scores are a manifestation of low general intelligence (Humphreys, 1988).

Even more specific than a literacy test like the NALS is the Test of Functional Health Literacy of Adults (TOFHLA), a short test that measures examinees’ ability to comprehend health-related texts, such as doctor’s orders and prescription instructions. Just like the NALS, the TOFHLA functions in exactly the same way that an intelligence scale does. TOFHLA scores correlate with traditional intelligence test scores (r = .53 to .74) and measure a general ability, even after controlling for a person’s years of education, quality of education, occupational prestige, age, race, and gender (Apolinario, Mansur, Carthery-Goulart, Brucki, & Nitrini, 2014; Gottfredson, 2004). Moreover, the pattern of TOFHLA correlations with health outcomes mimics the correlations between IQ scores and health outcomes (Gottfredson, 2004). In fact, these results have led some researchers to suggest that “health literacy” is nothing more than g manifested in a health-care setting (Reeve & Basalik, 2014). The fact that the TOFHLA material is strictly related to health-care information and intelligence tests are more general in content is irrelevant.

These facts are surprising to many people. A test question that requires examinees to read a prescription label (like the TOFHLA) or a bus schedule (like the NALS) appears to have little in common with the tasks on traditional intelligence tests, which often require little – if any – reading. The reason why these other tests can function like intelligence tests is that the surface content of a test is not what determines the trait that a test measures. Rather, the mental abilities or functions that a test requires examinees to use are what determines what the test measures (Gottfredson & Saklofske, 2009; Warne, Yoon, & Price, 2014). Because intelligence is such a general ability, many different tasks require examinees to use their intelligence. As a result, a wide variety of test question formats can measure intelligence – even if these tests do not resemble each other at all. Lubinski and Humphreys recognized that these tests can take many different forms and have different labels when they wrote, “Many psychological measures with different names and distinct items (such as academic ability, aptitude, scholastic ability, scholastic achievement) can, and often do, measure essentially the same thing” (Lubinski & Humphreys, 1997, p. 163).



The Indifference of the Indicator


The fact that the CAS, CAM, DIT, NALS, TOFHLA, and many more tests all measure intelligence is more evidence for the indifference of the indicator. This was a concept that Charles Spearman (1927, pp. 197–198) proposed (see Chapter 1). Today, the evidence is overwhelming that he was right that any task that requires cognitive effort or work measures g, regardless of the task appearance (Cucina & Howardson, 2017; Jensen, 1998). It is because of the indifference of the indicator that scores on different tests are positively correlated – they all measure intelligence. In fact, tasks don’t even have to appear on an intelligence test for people to use their intelligence to respond (Gordon, 1997; Gottfredson, 1997b; Lubinski & Humphreys, 1997).

Although the indifference of the indicator is widely accepted as fact among intelligence researchers, it is a concept that is poorly understood outside the field. Many individuals who try to analyze tests by the surface content produce incomplete or incorrect interpretations of intelligence test data. For example, some individuals (e.g., Helms, 1992; K. Richardson, 2002) have claimed that intelligence tests do not measure reasoning ability at all, but instead measure culturally specific knowledge that is acquired through formal schooling or exposure to certain cultural experiences. While this interpretation could be viable for information items and vocabulary items, it cannot explain why these tasks correlate with test formats that have no apparent cultural content (e.g., matrix reasoning tests) and reaction time tasks. This interpretation also does not explain why items with cultural content correlate with biological variables (see Chapter 3).

Another consequence of a poor understanding of the indifference of the indicator is that it leads to misinterpretations of test scores. For example, most law schools try to select students with the highest scores on the Law School Admission Test (LSAT). After finishing their education, law school graduates must pass a bar exam to practice law. Both tests are designed to measure reasoning ability, though bar exams’ content is drawn from legal principles and information (Bonner, 2006; Bonner & D’Agostino, 2012) and LSAT content is more general and abstract. Both tests correlate with one another (Kuncel & Hezlett, 2007) and are measures – at least partially – of g. Law schools often publish their students’ passing rates on the bar exam as evidence of the quality of education they provide. But they are largely ignoring the fact that bar exam passing rates are, to an extent, the consequence of the intelligence level of students that the law school enrolled. (That is, schools that select smarter students, as judged by LSAT scores, have higher bar passing rates.) Therefore, ranking law schools on the basis of bar exam passing rates is largely an exercise in ranking schools by the intelligence level of their students. A similar misinterpretation happens when K-12 schools are ranked by the average test score on the end-of-year academic achievement tests. A higher score does not indicate a better school (or better teachers) because these tests are mostly measuring students’ g levels. It is likely that some schools with lackluster test scores have dedicated teachers, good funding, and superb educational programs; likewise, high test scores at some schools are likely to be just a product of the high levels of intelligence that the student body would exhibit in almost any typical educational environment. Thus, interpretations of test scores that are widespread in the accountability movement or in college rankings probably do not reflect educational quality to the extent that policy makers, legislators, or educators believe because these groups do not realize that the tests are measuring g.

The indifference of the indicator has an important implication about g. Because test item content and appearance do not matter when measuring g, the nature of g is independent of any test item. In other words, g is not a product of test design. Instead, test questions elicit g by encouraging people to demonstrate the behaviors that are caused by g, such as abstract problem solving and engaging in complex cognitive work (Jensen, 1980a). There is no known test that measures cognitive abilities without also measuring g, and even test creators who attempt to minimize the influence of g and emphasize broad Stratum II abilities fail in their attempts and end up creating tests that mostly measure g (Canivez & Youngstrom, 2019).



Lengthy Testing Not Needed


Traditional intelligence tests, such as a Stanford–Binet or a Wechsler test, take approximately 90-120 minutes to administer. Many group battery tests of intelligence, such as the SAT, take a few hours. As a result, there is the impression that measuring intelligence is a long, drawn-out testing process. There is value in using a lengthy test to measure intelligence, but often it is not needed. This is because one of the reasons intelligence is easy to measure is that it produces highly stable scores very quickly.

The technical term for the stability of scores is called reliability. High reliability is vital for any score that supposedly measures a stable trait like intelligence. If scores have poor reliability, then it means that either (1) the trait is not stable, or (2) the scores fluctuate too much to provide a useful measure of the trait. Poor reliability also depresses correlations and makes them artificially closer to zero. That means it is harder to identify a correlation for a score with low reliability (R. M. Kaplan & Saccuzzo, 2018).

Reliability is usually measured on a scale from 0 (corresponding to purely random scores) and 1 (for scores with perfect stability, which is not possible). The desired level of reliability depends on how scores will be used. If scores are not to be used to impact examinees’ lives, or if the decisions are temporary and/or easily reversible, then lower reliability is acceptable. For high-stakes situations, though, reliability should be much higher. A common rule of thumb is that reliability should be at least .70 for scores that will only be used for research purposes. Reliability of .85 or .90 might be necessary for diagnostic purposes. And when scores are to be used for a decision that is extremely important and/or irrevocable – like whether someone is mentally competent to stand trial – then reliability should be at least .97.

By itself, a single intelligence test item produces a score that is not reliable: only about .25 (Lubinski, 2004; Lubinski & Humphreys, 1997). This means that a score on a 1-item test is too unstable to be useful. However, when items are combined, the total reliability based on those items increases. With 7 items, score reliability increases to .70 – good enough for research purposes. An intelligence test with 12 items has an estimated reliability of .80. And it only takes 27 items (about the length of a single-subject academic test for children) to reach a reliability of .90. Thus, it does not take many questions on an intelligence test to produce reliable scores – another way that intelligence is not a difficult trait to measure.

These numbers show that – generally – longer tests produce more reliable scores. But this relationship is not regular. At higher levels of reliability, it takes more items to produce small increases in reliability. To raise reliability from .90 to .95 requires a test to expand from 27 items to 57 items. Reliability of .97 requires 97 items, while reliability values of .98 and .99 require 147 and 297 items, respectively. This is why tests of g that are used to make very important decisions (e.g., college admissions, diagnosing a disability) tend to be very long. Still, a 297-item test is not unreasonable. Examinees would need to take breaks, and perhaps the test would be spread across multiple testing sessions, but a 297-item test is still shorter than some other tests in psychology.



Caveats


There is one important caveat to this discussion: while any cognitive task measures g to some extent, different tasks are often not equally good at measuring intelligence. In other words, some are better measures of g than others (Jensen, 1980b, 1985). Matrix reasoning and vocabulary knowledge tasks are extremely good measures of intelligence, which is why they appear on many intelligence tests. Other tasks are not as good, such as maze tests (e.g., Porteus, 1915), which require examinees to complete a two-dimensional maze without errors. These maze tests used to appear on some intelligence tests, but they were a much poorer measure of g compared to other question formats that are widely available. As a result, maze tasks have been eliminated from most tests and are no longer in widespread use.

Another caveat to remember is that professional test development consists of more than just writing and administering items (Schmeiser & Welch, 2006). Although writing items that measure g is not difficult (especially when a test creator uses formats that have been shown to measure g well), it takes a lot of training and work to create an intelligence test that is good enough for professional use. The professional standards of test creation – established by the American Educational Research Association (AERA), American Psychological Association, and the National Council on Measurement in Education (2014) – are complex and must be met for ethical testing practices to occur.



Conclusion


Nonetheless, because of the indifference of the indicator and the fact that high reliability does not take many test items, it is not true that intelligence is difficult to measure. In fact, intelligence is incredibly easy to measure. K-12 school accountability tests, licensing tests for jobs, college admissions tests, spelling bees, driver’s license tests, and many other tests are all measures of g – though many measure other abilities also (e.g., job knowledge, memorization). And they are not equally good measures of g.

Ranking examinees on these tests from highest score to lowest score will produce a rank order that is similar to a rank order based on the IQ scores or level of g of the same examinees. Thus, compared to other psychological traits, measuring intelligence is relatively easy. It is likely that most readers have taken a test that measures intelligence without realizing it.


From Chapter 7 of "In the Know: Debunking 35 Myths About Human Intelligence" by Dr. Russell Warne (2020)





We hope you found this information useful. For further questions, please join our Discord server to ask a Riot IQ team member or email us at support@riotiq.com. If you are interested in IQ and Intelligence, we co-moderate a related subreddit forum and have started a Youtube channel. Please feel free to join us.

Author: Dr. Russell T. Warne
LinkedIn: linkedin.com/in/russell-warne
Email: research@riotiq.com