What Are IQ Tests and How Do They Work?

Wondering how intelligence is measured? Discover the science behind IQ testing, subtests, and scoring. Read the full article and take the RIOT IQ test!

Dr. Russell T. WarneChief Scientist

People encounter the phrase "IQ test" constantly — in news coverage, in school evaluations, in job application processes, in popular culture. But very few have a clear picture of what an IQ test actually is at a mechanical level: what kinds of tasks appear on them, how those tasks are scored, why test developers choose some tasks over others, and what turns a collection of questions into a single, meaningful number. This article answers those questions directly.

The Foundation: What IQ Tests Are Trying to Measure

Every IQ test is built on a theoretical model of intelligence. Without a model, there is no principled basis for deciding which tasks belong on the test, how they should be weighted, or what the resulting score actually represents.

The model that underlies virtually all modern professional IQ tests is the Cattell-Horn-Carroll (CHC) theory of intelligence. CHC theory organizes cognitive abilities into a three-level hierarchy. At the top sits g — general intelligence — a broad factor that influences performance across every type of cognitive task. Research consistently finds that g accounts for roughly 40 to 50 percent of the variance in performance on any given cognitive test, which is why an overall IQ score carries real predictive power across diverse domains of life.

Directly below g are what CHC theorists call "broad abilities" — clusters of related cognitive skills that share common characteristics. These include fluid reasoning (the ability to solve novel problems without relying on prior knowledge), crystallized intelligence (accumulated verbal knowledge and learned skills), working memory (the capacity to hold and manipulate information in real time), processing speed (how quickly the mind can perform simple cognitive operations), and spatial ability (the capacity to mentally manipulate two- or three-dimensional objects and relationships). At the lowest level of the hierarchy are narrow, highly specific abilities — for example, the specific skill of mentally rotating objects, which sits within the broader category of spatial ability.

This three-level structure is the blueprint from which test developers work. A comprehensive IQ test battery samples from multiple broad abilities, not just one, because g — the target — is better estimated when measured through diverse cognitive channels.

The Indifference of the Indicator: Why Task Content Doesn't Dictate Validity

One of the most counterintuitive findings in intelligence research is that the specific tasks on an IQ test matter less than most people assume. Charles Spearman called this principle "the indifference of the indicator," and modern research has confirmed it. As long as a task requires genuine reasoning, judgment, or problem solving, it will measure g to some meaningful degree — regardless of whether it involves words, numbers, shapes, or symbols.

This is why IQ tests can look so different from one another on the surface and still produce correlated scores. A vocabulary task and a matrix reasoning task seem to have nothing in common. And yet both tap into the same positive manifold — the finding, introduced by Spearman in his landmark 1904 study, that performance across all cognitive tasks correlates positively. A person who performs well on one tends to perform well on the others, regardless of surface content.

The practical implication is significant: test developers have genuine flexibility in choosing task formats. A task that works well in an online environment but poorly in a face-to-face clinical setting can simply be replaced by a different task that is better suited to that format — without sacrificing the ability to measure g. This is one reason that intelligence testing has been able to adapt to new technologies and delivery formats over its 120-year history.

What Subtests Actually Look Like

A comprehensive IQ test is not a single type of question asked repeatedly. It is a battery of subtests, each targeting a different cognitive ability. The diversity is intentional: sampling broadly across the ability hierarchy produces a more accurate estimate of g than any single task could. Here are the most common subtest types found on professional IQ batteries.

Verbal reasoning and vocabulary. These subtests ask examinees to define words, identify relationships between word pairs, or explain the logic underlying a verbal statement. They are strong measures of crystallized intelligence — the accumulated body of learned knowledge a person has built up through exposure to language and education. Vocabulary subtests are among the most stable measures of intelligence across the adult lifespan, declining more slowly with age than most other cognitive tasks.

Matrix reasoning. The best-known format in this category was developed by John Raven in the 1930s. The examinee views a 3x3 grid of geometric patterns with one cell missing and must identify which of several options correctly completes the pattern. Matrix reasoning is considered one of the best single measures of fluid reasoning and g, because it requires discerning abstract rules without any reliance on learned knowledge.

Working memory tasks. These subtests ask examinees to hold and manipulate information simultaneously. A common format is digit span — the examiner reads a sequence of numbers aloud, and the examinee must repeat them in forward or reverse order. More complex working memory tasks might ask examinees to alphabetize a set of letters while keeping a number sequence in mind. Working memory capacity is strongly linked to fluid reasoning and predicts academic achievement independently of general g.

Processing speed tasks. These measure how quickly the mind can execute simple, repetitive cognitive operations. A common format involves scanning a row of symbols and marking specific targets as quickly as possible within a time limit. Processing speed subtests are especially sensitive to neurological changes and are among the first abilities to decline with normal aging.

Spatial reasoning tasks. These require the mental manipulation of two- or three-dimensional objects — rotating shapes, identifying which puzzle pieces fit together, or determining what a folded-up paper would look like when unfolded. Males tend to show slightly higher average performance on spatial tasks than females, though the mean IQ difference across the full test is trivial.

Reaction time tasks. Some tests, including the RIOT, measure how quickly examinees respond to simple stimuli. Reaction time is a biological index of cognitive processing efficiency and correlates meaningfully with g. It is one of the few IQ-related variables that connects intelligence directly to a measurable neurological property.

How Individual Subtests Become a Single IQ Score

Translating performance across multiple different subtests into a single, interpretable number requires several layers of statistical work. Understanding that process clarifies what an IQ score actually represents and why it should be taken seriously.

The first step is converting raw subtest scores into standardized scores. A raw score — the number of items answered correctly, or the number completed within a time limit — is meaningless in isolation. To interpret it, the score must be compared to a reference group. Every professionally developed IQ test is administered to a large, demographically representative norm sample before public release. This sample's performance establishes what "average" means on each subtest, and each examinee's raw score is then expressed as a standard score showing how far above or below that average they fell.

The standardized subtest scores are combined, typically through a weighted formula derived from factor analysis, to produce index scores for each broad ability domain (e.g., a Fluid Reasoning Index or a Working Memory Index). These index scores, in turn, are combined to produce the overall IQ — also called the Full Scale IQ — which reflects the examinee's general cognitive ability relative to their age group.

The final IQ scale is set so that the population mean equals 100 and one standard deviation equals 15 points. This means that roughly 68 percent of people score between 85 and 115, about 95 percent score between 70 and 130, and fewer than 3 percent score above 130 or below 70.

One critical detail that professional tests make explicit is the standard error of measurement (SEM) — a statistical estimate of how much the obtained score might differ from the examinee's true score due to random measurement error. A well-constructed IQ test with a reliability coefficient of .95 has a SEM of roughly 3 to 4 points, which means the examinee's true score probably falls within a range of about 6 to 8 points of the reported number. Reputable tests communicate this as a confidence interval around the score, not as a single point estimate.

How Items Are Developed and Screened

A common misconception is that IQ test items are simply invented by researchers who think they sound like good measures of intelligence. The actual process is far more rigorous — and far more expensive — than that, which is part of why legitimate IQ tests cost money to develop.

The process begins with item writing: constructing a large pool of candidate questions, typically many times more than will appear on the final test. Item writers must understand the cognitive demands of the ability being measured, write items at the appropriate difficulty level, and avoid introducing irrelevant factors — cultural assumptions, specialized knowledge, linguistic complexity unrelated to the target ability — that would contaminate the measurement.

Candidate items then go through a pilot testing phase. A diverse sample of examinees takes the pilot items, and the resulting data is analyzed using item response theory (IRT) — a statistical framework that models the probability of a correct response as a function of both item difficulty and the examinee's underlying ability level. IRT allows test developers to identify items that are too easy, too hard, or that fail to discriminate reliably between examinees of different ability levels. Items that perform poorly by these criteria are revised or discarded.

A critical part of the screening process is differential item functioning (DIF) analysis — a statistical test for bias. An item shows DIF when examinees of similar overall ability but different demographic backgrounds respond to it at systematically different rates. Items showing meaningful DIF are flagged and typically removed before the test is released. This is standard practice on all professionally developed IQ tests and has been since the 1980s.

After the final item pool is assembled, the complete test is administered to the norm sample. For a U.S.-normed test, that sample is stratified by age, sex, race and ethnicity, geographic region, and educational attainment to reflect the actual population. The norm sample's data establishes the scoring scale and provides the baseline against which all future examinees are compared.

Administration: How a Test Session Actually Works

The experience of taking an IQ test varies somewhat depending on the format, but certain structural features are common across professionally developed tests.

Individually administered tests are conducted face-to-face — either in person or via video call — by a trained clinician. The session typically begins with rapport-building and an explanation of what the test involves. Subtests are presented in a standardized order, with standardized instructions read from a script. Standardization is not bureaucratic formality — it ensures that every examinee's score is comparable, because the conditions under which the test was taken match the conditions under which the norm sample took it.

Most individually administered tests use what are called "basal" and "ceiling" rules to limit unnecessary testing. A basal rule establishes that once an examinee answers a certain number of consecutive items correctly, the tester can assume they would have answered easier items correctly and skip past them. A ceiling rule stops a subtest when an examinee has failed a certain number of consecutive items — there is no value in continuing to administer questions the examinee cannot answer, and doing so would introduce unnecessary frustration.

Group-administered tests follow a simpler protocol: all examinees receive the same instructions, work through the same items, and are held to the same time limits. These tests sacrifice the flexibility of individual administration in exchange for efficiency and scale.

Online IQ tests, when professionally developed, can preserve many of the features of group-administered tests — standardized instructions, time limits, and a fixed item sequence — while adding the convenience of self-pacing within those constraints. The Reasoning and Intelligence Online Test (RIOT) allows examinees to pause between subtests and resume at a different time, a design choice that accommodates real-world scheduling without compromising the standardized administration within each subtest. Online delivery is simply a method of administration; it does not, by itself, determine whether a test is trustworthy or valid.

What IQ Tests Do Not Measure

Understanding what a test measures is incomplete without understanding its boundaries. IQ tests are not designed to capture everything that matters about a person's cognitive life, and professional test creators are clear about this.

IQ tests are optimized for measuring general cognitive ability — the capacity to reason, learn, and solve novel problems. They do not measure motivation, effort, personality, or the accumulated expertise that comes from years of practice in a specific domain. A person with an IQ of 105 who works hard, plans carefully, and builds skills systematically over years will outperform a person with an IQ of 125 who does not. Intelligence is a resource, not a guarantee of outcome.

IQ tests also do not capture creativity in its fullest sense, or the ability to generate genuinely novel ideas in domains like art, music, or entrepreneurship. Fluid reasoning tasks on IQ tests measure the ability to find rules and apply them — which overlaps with some aspects of creative thinking but is not the same thing.

Finally, IQ tests are calibrated for the populations they were designed for. Using a test outside its intended population — administering a test normed on American adults to people in a country with a very different educational system, for example — introduces the risk of misinterpretation. This is not a flaw in the test; it is a limitation of any precision measurement instrument used beyond its specified application. Professional test documentation is explicit about the populations for whom valid score interpretations have been established.

The First Professional Online IQ Test

For most of its history, professional-grade IQ testing was limited to clinical and educational settings — not because online testing was impossible in principle, but because no one had taken the time to build an online test that met the same rigorous standards as traditional in-person assessments.

The Reasoning and Intelligence Online Test (RIOT) changes that. I developed it after 15 years of research in intelligence and psychological testing. It is built on the CHC model and follows the same development pipeline described in this article: expert content review, pilot testing, IRT item analysis, DIF bias screening, and norming on a representative U.S. sample. Its development met the Standards for Educational and Psychological Testing from the APA, AERA, and NCME.

The RIOT measures verbal reasoning, fluid reasoning, spatial ability, working memory, processing speed, and reaction time — the full range of CHC broad abilities covered here. Examinees receive a Full Scale IQ, index scores for each broad ability domain, and a detailed score report. The rigor of the development process is what makes the results meaningful.

Sources

Carroll, J. B. (1993). Human cognitive abilities: A survey of factor-analytic studies. Cambridge University Press. Referenced in McGrew, K. S. (2009). CHC theory and the human cognitive abilities project. Intelligence, 37(1), 1–10.
Spearman, C. (1904). General intelligence, objectively determined and measured. American Journal of Psychology, 15(2), 201–293. https://doi.org/10.2307/1412107
Kvist, A. V., & Gustafsson, J. E. (2008). The relation between fluid intelligence and the general factor as a function of cultural background. Intelligence, 36(5), 463–470. https://doi.org/10.1016/S0160-2896(03)00062-X
Gottfredson, L. S. (1997). Mainstream science on intelligence. Intelligence, 24, 13–23. https://doi.org/10.1016/S0160-2896(97)90011-8
Warne, R. T. (2020). In the know: Debunking 35 myths about human intelligence. Cambridge University Press. https://doi.org/10.1017/9781108593298
Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Principles and applications. Kluwer Academic. Referenced in Borsboom, D. (2016). Psychometric perspectives on diagnostic systems. Journal of Clinical Psychology, 64(9), 1089–1108.
Raven, J. C., & Penrose, L. S. (1936). Mental tests used in genetic studies. British Journal of Medical Psychology, 16, 97–104. https://doi.org/10.1111/j.2044-8341.1936.tb00690.x
Warne, R. T. (2025). Technical manual for the Reasoning and Intelligence Online Test, version 1.0. RIOT IQ.
AERA, APA, & NCME. (2014). Standards for educational and psychological testing. https://www.testingstandards.net/
Jensen, A. R. (1998). The g factor: The science of mental ability. Praeger. Referenced in: https://doi.org/10.1037/a0036503
Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology. Psychological Bulletin, 124(2), 262–274. https://doi.org/10.1037/0033-2909.124.2.262
Deary, I. J., Strand, S., Smith, P., & Fernandes, C. (2007). Intelligence and educational achievement. Intelligence, 35(1), 13–21. https://doi.org/10.1016/j.intell.2006.02.001