Is Intelligence Whatever Collection of Tasks a Psychologist Puts on a Test?

Dr. Russell T. Warne

Jun 1, 2025

. . . many psychologists simply accept an operational definition of intelligence by spelling out the procedures they use to measure it ... Thus, by selecting items for an intelligence test, a psychologist is saying in a direct way, “This is what I mean by intelligence.” A test that measures memory, reasoning, and verbal fluency offers a very different definition of intelligence than one that measures strength of grip, shoe size, hunting skills, or the person’s best Candy Crush mobile game score. (Coon & Mitterer, 2016, p. 290)

I found that quotation in a general psychology textbook written for college students. Setting aside the question of why anyone today would use grip strength or shoe size to measure intelligence, almost everything that the authors state in this quotation is incorrect. But it is a common belief, even among psychologists, that intelligence is nothing more than an arbitrary collection of abilities (Warne, Astle, & Hill, 2018).

Gottfredson (2009, p. 30) stated that people who believed this idea are arguing that, “Intelligence is a marble collection.” They see intelligence as being like a bag of marbles, where each marble represents a different mental ability. In this view, the only reason why “intelligence” seems to exist is because a psychologist put all these abilities together and forced them to produce one overall IQ score. Under this incorrect reasoning, intelligence is the sum of a collection of tasks a psychologist arbitrarily chooses to put on an intelligence test. Scientists who think that memory is important will create a test that emphasizes that ability; others who believe that logical reasoning or language abilities are important will emphasize those abilities. If this idea were correct, there would be no way to know which abilities are the “right” or “wrong” components of intelligence, and it would be theoretically possible to have two people each create their own intelligence tests that measure completely different abilities. The resulting scores from these intelligence tests would be – theoretically – unrelated.

The first reason why this reasoning is wrong is that g itself is not a simple sum of a set of mental abilities (Jensen, 1998). Rather, factor analysis (a statistical procedure explained in the Introduction) finds the overlap of the variances of scores from different tasks and eliminates the unique component of each of these scores. This overlapping portion across all scores is the general ability factor, or g. Because g is made up of the ability that is measured across all tasks on an intelligence test, the measure of g (in other words, an IQ score) has little to do with specific tasks. Anything unique to any specific task is pulled out of g during the course of factor analysis (B. Thompson, 2004). One way of explaining this distinction is as follows:

It is also important to understand what g is not. It is not a mixture or average of a number of diverse tests representing many different abilities. Rather, it is a distillate, representing the single factor that all different manifestations of cognition have in common ... It does not reflect the tests’ contents per se, or any particular kind of performance. (Arthur Jensen, quoted in D. H. Robinson & Wainer, 2006, p. 331)

It is because all these tasks have a common characteristic – g – that measuring a global mental ability like intelligence is even possible. Additionally, because factor analysis distills g and removes the unique portions from a score, the collection of tasks on a test really does not matter much, as long as there are several types of tasks on a test and they are all cognitive in nature. All cognitive tasks measure g to some degree.

This last point was discovered by Charles Spearman (1927, pp. 197–198), and he named this principle the indifference of the indicator. For Spearman, the indicator was the surface content of a test. For example, in the Introduction, I discussed vocabulary, matrix, digit span, information, spatial reasoning, and coding items. Each of these types of items would be what Spearman called an indicator. When using the word “indifference,” Spearman wasn’t saying that psychologists didn’t care about test content. Instead, the phrase “indifference of the indicator” means that the surface content of the test does not matter; all cognitive items measure intelligence, and g is indifferent to the format of a test item. Spearman’s claim was radical in 1927, but it has since been strongly supported by research (Carroll, 1993; Cucina & Howardson, 2017; Gottfredson, 1997b).

However, this does not mean that every cognitive task on an intelligence test is an equally good measure of g (Jensen, 1980b). Some tasks are better than others at measuring intelligence. How well a task measures intelligence is called its g loading, a value ranging from 0 to 1 that is produced by factor analysis. Vocabulary and matrix reasoning items tend to have very high g loadings (up to .80 on many professionally developed tests), while measures of short-term recall and reaction time tasks tend to have low g loadings (Carroll, 1993). Generally, more complex tasks have higher g loadings, while simpler tasks have lower g loadings (Gottfredson, 1997b). Test creators don’t have to choose tasks with high g loadings when they create their tests, but a test that consists of tasks with high g loadings can be shorter and produces a better estimate of a person’s intelligence than a test that is made up of tasks with low g loadings. Tasks that have g loadings of 0 (and therefore do not measure g) are tasks that are clearly not cognitive in nature – like running speed. Thus, because every cognitive task measures g – at least to some extent – it does not matter much what tasks are on an intelligence test, though there is a strong preference among psychologists for tasks that have high g loadings.

Apart from test construction, there is another source of evidence showing that the claim that intelligence is merely the sum of an arbitrary set of test items is not correct. This evidence is found in studies that administer multiple intelligence tests to a sample of people in order to determine how strongly the two g factors from the tests are correlated. If there is a strong correlation between the two tests, it would indicate that the g factor in each test is the same – even if the tasks on the tests are different. A correlation near zero would indicate that (a) what each test labels as g is different, (b) the combination of tasks on each test produces two very different measures of intelligence that are not interchangeable, and (c) what each test measures is just a unique combination of the tasks that a test creator chose to put on the test.

The authors of one of the earliest studies of this type (Stauffer, Ree, & Carretta, 1996) gave 10 common pencil-and-paper intelligence subtests and a series of 25 computerized tasks called the Cognitive Abilities Measurement (CAM) battery. The CAM battery was intended to measure processing speed, working memory, declarative knowledge (i.e., information that the person can state that they know), and procedural knowledge (which is the knowledge of how to complete tasks). The intelligence subtests and the CAM battery each produced a g factor that correlated almost perfectly (r = .950 to .994).

In a more recent study (Keith, Kranzler, & Flanagan, 2001), a team of psychologists administered two intelligence tests, the Woodcock–Johnson III (WJ-III) and Cognitive Assessment System (CAS), to a sample of 155 children. Keith et al. (2001) used factor analysis to identify each test’s g factor and found that the correlation between the two was r = 0.98 (p. 108). What makes this result more remarkable is that the CAS was created by psychologists who did not intend to create a test that measured g. As a result, most of the tasks on the CAS do not resemble tasks on the WJ-III at all. Nevertheless, the CAS still produced a g factor, and the CAS’s g factor is identical to the g on the WJ-III test.

Floyd, Reynolds, Farmer, and Kranzler (2013) conducted a more elaborate follow-up with six samples of children or adolescents that took two intelligence tests out of a group of five tests: the Differential Ability Scales (DAS), DAS II, Wechsler Intelligence Scale for Children (WISC) IV, WISC-III, WJ-III, and Kaufman Assessment Battery for Children II.2 The sample sizes ranged from 83 to 200, and the correlations between these tests’ g factors ranged from r = .89 to r = 1.00 and averaged r = .95. Again, this shows that the g factors produced by different tests are largely identical. Additionally, Floyd et al. (2013) found that the similar Stratum II factors that each test produced were largely the same (e.g., the processing speed factor on one test was highly correlated with another test’s processing speed factor). This means that Stratum II abilities in the Cattell–Horn–Carroll model can also have a high degree of similarity across tests.

A team headed by psychologist Wendy Johnson found similar results with even larger samples. In a group of 436 adults who took three test batteries (the Comprehensive Ability Battery, the Hawaii Battery supplemented with some additional tests, and the Weschler Adult Intelligence Scale), the different g factors from these test batteries all correlated r = .99 or r = 1.00 (W. Johnson, Bouchard, Krueger, McGue, & Gottesman, 2004). The researchers summed up their findings by saying that across these three tests there was, “Just one g” (p. 95). Johnson and her colleagues followed up this work with another study of 500 Dutch seamen. With four different tests (a test battery for the Royal Dutch Navy, a battery of 12 subtests from the Twente Institute of Business Psychology, the General Aptitude Test Battery, and the Groninger Intelligence Test), the correlations of their g factors were all between r = .95 and r = 1.00 (W. Johnson, te Nijenhuis, & Bouchard, 2008, p. 88).

The idea that intelligence is just a set of arbitrarily chosen tasks that are thrown together on an intelligence test is simply not true. Regardless of the content that psychologists choose to put on a test, any cognitive task measures intelligence to some extent. When the scores from these tasks are combined via factor analysis, the unique aspects of each test are stripped away, and only a score based on the common variance among the tasks – the g factor – remains. Scores from these g factors correlate so highly that they can be considered equal. As a result, the idea that intelligence is an arbitrary collection of test items is completely false. Instead, intelligence, as measured by the g factor, is a unitary ability, regardless of what tasks are used to measure it.

From Chapter 1 of "In the Know: Debunking 35 Myths About Human Intelligence" by Dr. Russell Warne (2020)

We hope you found this information useful. For further questions, please join our Discord server to ask a Riot IQ team member or email us at support@riotiq.com. If you are interested in IQ and Intelligence, we co-moderate a related subreddit forum and have started a Youtube channel. Please feel free to join us.