🇺🇸

The official website of Riot IQ

Are Intelligence Tests Biased Against Diverse Populations?

Russell T. Warne
Russell T. Warne
Jun 10, 2025
It has already been established that standardized tests are biased and unfair to persons from [non dominant] cultural and socio-cultural groups since most psychometric tests reflect largely white, middle class values and they do not reflect the experiences of and the linguistic, cognitive and cultural styles and values of minority or foreign groups. (Zindi, 2013, p. 164)

The literature presents an abundance of data and criticism indicating that such [traditional intelligence] tests, standardized as they are on White middle-class norms, show bias in favor of Whites … (Harris & Ford, 1991, p. 6; see also Ford, 1995, p. 56)


Of the 35 misconceptions in this book, one of the most common is the belief that intelligence tests are biased against African Americans, Hispanics, and Native Americans. In one study of introductory psychology textbooks, this was the most common inaccuracy that authors perpetuated (Warne et al., 2018). Indeed, this belief often extends to academic tests and tests used for hiring and promotion (Reeve & Charles, 2008). Because these tests also measure g (see Chapter 7), it is unsurprising that people often believe that these tests, too, are biased against diverse groups.

Like many incorrect beliefs discussed in this book, the idea that intelligence tests are biased against diverse examinees is not completely unrealistic. Most of these groups score – on average – lower on intelligence tests than examinees of European descent (Gottfredson, 1997a; Neisser et al., 1996). Generally, within the United States, European Americans have an average IQ of approximately 100, followed by Hispanic Americans and Native Americans (average IQ ≈ 90), and African Americans scoring lowest (average IQ ≈ 85). Conversely, Asian Americans tend to score higher than all other large racial groups (average IQ ≈ 105). Given these differences, it is natural for some people to suspect that something is wrong with how the tests measure intelligence in examinees with Hispanic, Native American, or African ancestry.



Figure 10.1 Distribution of IQ scores for the major American racial groups. Left to right, these are African Americans, Hispanics and Native Americans, European Americans, and Asian Americans. Note that there is a lot of overlap among the distributions and that people from all groups can be found at all IQ score levels.


It is important to note, though, that these are merely averages, and these numbers do not apply to every member of these groups. As Figure 10.1 shows, there is tremendous overlap among these groups, and it is possible to find people from every group at every intelligence level (Frisby, 2013; Gottfredson, 1997a). In other words, there are some people with low IQ scores who belong to groups with a higher average, and there are some people with high IQ scores who belong to a group with a lower average score. These group averages, therefore, often do not apply to particular individuals.

No one disputes that average IQ scores differ across racial groups, and this rank order of groups’ averages is remarkably consistent across tests of g (Humphreys, 1988). The real dispute is over what causes these different mean scores. One proposed explanation for these average differences is that intelligence tests are not functioning correctly, and that the tests are biased against low-scoring examinees, thereby penalizing them and underestimating their true level of intelligence. (Chapters 28–30 will discuss other proposed causes of these mean group differences.) This belief that a fundamental problem with the test is the cause of these different average scores is at the core of the arguments that intelligence tests are biased against diverse examinees.



A Professional View of Test Bias


What Is Test Bias? In contrast to the widespread belief that intelligence tests are biased, the mainstream viewpoint among psychometricians and psychologists who use tests is that “the issue of test bias is scientifically dead” (Hunter & Schmidt, 2000, p. 151) and that professionally developed intelligence and academic tests are not biased against native English speakers who are born in the United States – regardless of their racial heritage (Reeve & Charles, 2008; Reynolds, 2000). Outside the United States, it is standard practice to design tests to be administered without bias in other multicultural societies, as long as examinees are born in that country and speak the test language as a native.

The reason that professional test creators and the public have opposite beliefs may stem from the different ways in which the two groups use the word “bias” (Kuncel & Hezlett, 2010; Reynolds & Lowe, 2009; Warne, Yoon, & Price, 2014). In everyday usage, “bias” is a synonym for “unfairness,” and when one group scores higher than another, that can easily seem biased in the sense that it is unfair. But the statistical definition of bias is more complex. In the testing world, bias occurs when two people with equal levels of a trait consistently obtain different scores solely because they belong to different groups. For example, if men and women with the same intelligence level take an intelligence test, but women consistently receive scores 5 points higher solely because they are female, then test bias would be present. This is a much more nuanced definition of bias than the everyday definition.

What Test Bias Is Not. The everyday definition of bias – though intuitive – is not adequate for scientific purposes for two reasons. The first is that what is fair or unfair is a value judgment. People with different ethical or moral values may have different opinions about what is fair (or unfair), and there is no scientific way to decide which values are best.

A second problem is that the existence of differences in average scores is not enough to prove that bias is present on a test because the score differences might reflect real differences in what the test measures (Clarizio, 1979; Frisby, 2013; Jensen, 1980a). In an example I have used before, imagine that a psychologist gives a test of job satisfaction to two groups: medical interns and tenured college professors. When the tests are scored, the medical interns received job satisfaction scores that were – on average – lower than the scores of tenured college professors.

A typical intern’s schedule includes 80 hours of work per week, nights on call, and very stressful working conditions, while tenured university faculty have a great deal of work flexibility, high job security, and tend to enjoy their jobs. Under these circumstances, the professors should outscore the medical interns on a test of job satisfaction. Any other results would lead a reasonable observer to strongly question the validity of the test results. The lesson from this thought experiment is that mean score gaps are not evidence of test bias because there may be other explanations of score gaps. In fact, score gaps may indicate that the test is operating exactly as it should and measures real differences among groups ... (Warne, Yoon, & Price, 2014, p. 572, emphasis in original)

This is not to say that differences in average scores are irrelevant. Often these differences are an indication that the possibility of bias should be investigated. But the average differences are not – by themselves – sufficient evidence of test bias (Dyck, 1996; Frisby & Henry, 2016; Jensen, 1980a; Linn & Drasgow, 1987; Sackett et al., 2008; Scarr, 1994).

Professional Reactions to Test Bias. The statistical procedures used to identify test bias are too complex to explain in detail. In short, all of these methods attempt to match examinees from different groups on their actual ability level and then ascertain whether test scores or test items are functioning the same way for both groups. The ethical standards of the testing profession require professional test creators to screen tests for the presence of bias (AERA et al., 2014). To comply with this mandate, test creators routinely perform examinations for bias, and biased tests are revised to remove the bias long before test creators release them to the public. As a result, professionally developed tests are unbiased against all major racial and ethnic groups that make up the examinee population in the countries that the tests are designed for.

Individual items – not just entire tests – can also show bias. When professional test creators find individual items that are biased, they can take one of two courses of action. One option is to eliminate the item from the test. (This is often viable because test creators usually write many more items than they intend to put on the final version of a test.) Another response is to balance biased items so that there is the same number of items favoring one group as there are items disadvantaging that group. Thus, the individual bias in different items cancels out (Warne, Yoon, & Price, 2014).

Because procedures to identify and eliminate test bias are so routine – and mandated as part of the profession’s ethical code – it is nearly impossible to sell a test that hasn’t been subjected to careful scrutiny for bias. If anyone tries, there are two likely consequences. First, the test would not be commercially successful, because consumers would be so concerned about the potential for bias that they would not purchase and use the test. Any company selling a biased test would have to contend with competitors touting their unbiased test as being superior to the biased test. Second, any customers who use the test for decision making – especially in education or employment – would be vulnerable to a lawsuit because using a biased test to make decisions about people in the group whose scores are underestimated would be discriminatory. Thus, there are very strong ethical, legal, and economic incentives for test developers to create and sell unbiased tests. As a result, it is incorrect to make blanket claims that intelligence tests are biased.

Caveat. It is important to note that this discussion about test bias – and its absence from professionally designed tests – only applies to groups that speak the test language as a native and who were born in the country the test was designed for. In the United States, this means that tests of g are unbiased for native English speakers born in that country. Everyone in the debate about test bias agrees that it is inappropriate to administer a test to a person who does not speak the language of the test and then interpret the low score as evidence of low intelligence. Indeed, this is a gross violation of the ethical standards of the field (AERA et al., 2014), and professional test creators are very specific about the language proficiency that is needed to take a test. Research shows that it takes about three years of residency in the United States for foreign-born children to be able to take a test in English without having their educational achievement test scores penalized for poor language proficiency. Native-born bilingual children whose parents speak a non-English language are not disadvantaged by taking an intelligence test in English (Akresh & Akresh, 2011).

Another common point of agreement is that the test content must be culturally appropriate to the examinee for a test to produce an interpretable score. When tests are used in a new culture, often they must be adapted to ensure that culturally loaded test content is understandable to examinees in the new culture (AERA et al., 2014). Professional test creators have known this for over 100 years. For example, when Binet’s test was translated into English for use in the United States, it was obvious to American psychologists that an arithmetic task that required knowledge of French money needed to be changed because there were no 1⁄2-cent or 2-cent coins in the United States, and using French money would be baffling to American children (Terman, 1916). When comparing scores of examinees from different backgrounds, it is generally recognized that the test content must be culturally appropriate for all examinees. It is standard practice when translating or adapting a test to another culture, language, or country to ensure that all test items are culturally accessible to examinees.



Exploring Critics’ Claims of Test Bias


Criticism of Test Content. Some critics of intelligence tests (and similar tests) have arguments about test bias that are more sophisticated than merely claiming that different average scores are proof of test bias. One common claim – exemplified by the quotes at the beginning of this chapter – is that test content is decided by people who are generally middle-class individuals of European descent in a Western culture, which means that intelligence tests merely measure one’s conformity or exposure to this culture. As a result, racially diverse individuals or people from other cultures are disadvantaged by the tests because their culturally specific ways of thinking are not rewarded on intelligence tests (e.g., Kwate, 2001; Moore, Ford, & Milner, 2005; Ogbu, 2002).

This argument has been thoroughly disproved. One problem with the claim that intelligence tests merely measure knowledge or conformity to Western middle-class culture is the fact that the racial group with the highest average on these tests is not Europeans, but East Asians (Gottfredson, 1997a); this has been true since the 1920s (e.g., Goodenough, 1926). Another piece of contradictory evidence comes from testing indigenous people in other nations. If intelligence tests measure acculturation to Western culture, then indigenous communities who have had more contact with Europeans should score higher than people in communities in the same nation who have had less contact. However, this is not the case (Porteus, 1965). Finally, the content of many test formats (e.g., matrix tests, digit span) contains very little information that is unique to Western culture. It is not clear how numbers or geometric patterns are special to Western middle-class individuals.

Many claims of biased test content are based on examinations of item content, with critics of intelligence testing claiming that a particular item is so culturally loaded that it cannot measure intelligence in diverse populations. Critics sometimes cherry pick a few items to argue that intelligence tests are biased, but rarely mention that items that appear biased are a small fraction of intelligence test items. Testing opponents seldom have much to say about non-verbal items, for example (Elliott, 1987; Jensen, 1980a). A classic example of this cherry-picking tendency is the “fight item,” which appeared on a now-obsolete version of the WISC: “What is the thing to do if a fellow much smaller than yourself starts a fight with you?” Based on little more than reading the item, people have attacked the WISC as being entirely biased because, in their view, it might be appropriate in some cultures – such as African American culture – for a child to fight back if someone acts aggressively towards them (Reschly, 1980). However, this item functions nearly identically for African American and European American children (Miele, 1979), which indicates that there is no unique influence (e.g., a cultural bias) that makes the item easier or harder for one group or another.

This example shows that it is not possible to judge by merely reading an item whether cultural differences influence responses. An illustration of this fact occurred in a court case (PASE v. Hannon, 1980) in which the Chicago school system was sued for using intelligence tests to identify minority children for special education. Of the hundreds of items on three intelligence tests (the 1960 version of the Stanford–Binet, the WISC, and the first revised version of the WISC), the judge identified just nine (including the “fight item”) that he believed were culturally biased against minority students. Seven of these items were on two subtests on the WISC; when a team of psychologists (Koh, Abbatiello, & McLoughlin, 1984) examined the WISC items for bias, they found that none of them was biased (in the statistical sense of the term) against African Americans. Moreover, for three of the seven items, the response that the judge believed that African American children would be culturally disposed to give was actually given more frequently given by European American children (Koh et al., 1984). Therefore, merely reading test items provides no clues about whether a test question really is biased against a cultural group (Jensen, 1980a).

In a more constructive vein, some people have made suggestions to try to change test content to reduce or eliminate average score gaps between racial groups while maintaining the ability of a test to measure intelligence. Unfortunately, these efforts have been unsuccessful. One suggested technique is to eliminate the test questions that show the largest differences in passing rates for different racial groups. The problem with this method is that it eliminates the items that tend to be the best at measuring intelligence while retaining test questions that are poorer measures of intelligence (Linn & Drasgow, 1987; Phillips, 2000). The result is a test that correlates poorly with important criteria, such as success in school.

Another proposal has been to design tests that have content that is culturally relevant to non-European American examinees. The most famous example is the Black Intelligence Test of Cultural Homogeneity (BITCH). Originally announced in 1972 by psychologist Robert L. Williams, the BITCH is a culturally specific test designed to measure knowledge of concepts and language that are unique to African Americans. The test items were all multiple-choice questions in which the examinee had to select the correct definition of a word or phrase taken from African American dialects or culture at the time (R. L. Williams, 1972). For example, Question 22 requires examinees to select whether “Deuce-and-a-Quarter” refers to (a) money, (b) a car, (c) a house, or (d) dice (Long & Anthony, 1974, p. 311).

Just as Williams expected, his test was difficult for European American examinees, whereas African Americans excelled (Matarazzo & Wiens, 1977; R. L. Williams, 1972, 1975). He interpreted this as evidence that a test designed for one culture could not be used on a population from a different culture (R. L. Williams, 1972). But evidence that the BITCH measures intelligence is non-existent. BITCH scores for African American examinees correlate weakly (r = .04 to .39) with traditional intelligence and academic tests (Long & Anthony, 1974; Matarazzo & Wiens, 1977; R. L. Williams, 1972), though this is exactly what would occur if traditional tests are grossly inappropriate for African Americans (R. L. Williams, 1975). However, there is no evidence that BITCH performance correlates with successful functioning in an African American culture or context, which is necessary for a culturally specific test to be a better measure of African Americans’ intelligence than traditional intelligence tests (R. L. Williams, 1972, 1975). The same is true for similar tests (Jensen, 1980a). The BITCH most likely measures knowledge of 1970s African American slang and idioms, but there is no evidence that it measures anything else.

A modern – and more promising – approach is to ensure that the format of a test is culturally appropriate for examinees. An example of a culturally sensitive test that does this is the Panga Munthu test, developed in Zambia as a way of measuring African children’s intelligence (Kathura & Serpell, 1998). Whereas R. L.Williams (1970) and others (e.g., Ford, 1995; Harris& Ford, 1991) argue that thinking and learning styles in disparate cultures are so different that the each group must have its own tests that are developed, scored, and interpreted in culturally specific ways, the Panga Munthu’s creators believe that intelligence is universal, but that traditional tests need to be modified if examinees are not familiar with the demands of the test. Instead of responding to questions verbally or using a pencil and paper (often unavailable in rural Zambia), the Panga Munthu requires children to sculpt a human figure in clay or wire – a common activity for children in Zambia. The examiner then scores the figure, with more complex figures indicating higher intelligence. Thus, the creators of the Panga Munthu believe that intelligence is part of Zambian psychology (as it is for Westerners), but that the tasks on a test must be understandable and appropriate for examinees’ culture for an intelligence test to produce meaningful results.

Unlike the BITCH, research supports the claim that the Panga Munthu measures intelligence in its African examinee population. For example, scores correlate positively (r = .19 to .44) with the highest grade that examinees complete and their literacy scores in English and their native language (r = .29 to .43; Serpell & Jere-Folotiya, 2008). The Panga Munthu is not the only test that is adapted to the practices of a specific culture. I believe that cross-cultural testing would benefit from more customization of tests to examinees’ cultures (Warne & Burningham, 2019), especially in light of evidence that g likely exists in all human groups (see Chapter 4).

Tests as Tools of Oppression. A more serious claim is that intelligence tests are designed with the explicit goal of oppressing non-European populations (e.g., Carter & Goodwin, 1994; Helms, 1992; Mercer, 1979; Moss, 2008). This is basically a conspiracy theory that would require decades of complicity from thousands of individuals who work in the testing industry and even more people in education, employment, and law who decide when and how tests are used. In reality, “No reputable standardized ability test was ever devised expressly for the purpose of discriminating [against] racial, ethnic, or social-class groups” (Jensen, 1980a, p. 42). And some tests of g were developed explicitly to tear down social barriers to education or jobs (see Chapter 21).

One common example of how intelligence tests were supposedly designed to discriminate is in the immigration process in the United States in the early twentieth century (e.g., Gould, 1981, 1996). It is true that in the 1910s, American immigration inspectors started using intelligence tests to help in identifying “feeble-minded” individuals (who could not legally immigrate to the United States). But these tests were not designed to discriminate against any nationality of immigrants. In reality, the government physicians at Ellis Island developed some of the earliest non-verbal intelligence tests to create a fair method of measuring intelligence that did not disadvantage people who were unfamiliar with the English language or American culture (J. T. E. Richardson, 2011; Mullan, 1917).

It would not have been feasible to give every immigrant at the time an intelligence test. Instead, the American government instituted a screening process to identify immigrants who were ineligible for entry into the country. The process for identifying people at Ellis Island (the most common point of entry for potential immigrants) with low intelligence is shown in Figures 10.2 through 10.5. First, immigrants were medically and psychologically screened while in processing lines. As part of the examination, two physicians individually asked each immigrant in their native language basic questions, such as their name, their nationality, their occupation, or simple addition problems (Mullan, 1917). About 9% of immigrants failed this screening procedure, and these individuals received another brief examination in a separate room to screen for low intelligence and psychological conditions like delusions, hallucinations, dementia, or (in modern terminology) bipolar disorder, and schizophrenia.

Those who failed (about 11–22% of those who failed the original screening procedure and about 1–2% of all prospective immigrants) were tested again after 1–7 days of rest. On that later date, a physician screened the immigrants again with basic questions, and those that passed were released. Those who did not received an individual 20–60 minute examination from a different physician later that same day. Non-passers received a third examination on a later date; failing this resulted in some immigrants being labeled “feeble-minded” and barred from entering the United States. Others received a fourth or fifth examination at another time and those who passed this latest examination were allowed to enter the country.

Figure 10.2 Initial medical screening of potential immigrants at Ellis Island in the early twentieth century. As part of this screening, physicians (shown in this photograph standing with their backs turned towards the camera) would ask immigrants in their native language basic questions. Immigrants who struggled with these questions or who acted erratically or otherwise abnormally were led to a large room for a brief mental examination, shown in Figure 10.3.

Source: National Institute of Health, https://bit.ly/2W1IMXv


Immigrants could not be diagnosed as “feeble-minded” unless they failed the original screening procedure, a brief examination the same day, and at least three later individual examinations (Mullan, 1917). But to be labeled as “normal” and allowed to enter the United States, an examinee only had to pass once. The onus was on the inspecting physicians to show that the immigrant indeed had low intelligence. According to the inspection manual, “The immigrant should be given the benefit of any doubt which may arise as to his mental status and therefore regarded as normal until it has been clearly shown that he is not” (United States Public Health Service, 1918, p. 35).

Figure 10.3 Potential immigrants at Ellis Island awaiting a brief psychological exam – including an intelligence test – after failing an initial screening procedure. The seated uniformed men wearing hats are government employees, possibly physicians and/or interpreters.
Source: Mullan, 1917, Figure 3.


The individual examinations were conducted in the immigrant’s native language, and were a mix of verbal questions and tasks that had few or no language demands. Some of the questions were derived from Binet’s tests, and others were designed for the immigrants specifically (United States Public Health Service, 1918). The tasks included watching an examiner tap a set of blocks in a specific order and then repeating the sequence (J. T. E. Richardson, 2011) and putting together simple wooden puzzles, such as the one pictured in Figure 10.6.

Figure 10.4 Two government employees (the seated men in the foreground), at least one of whom is a physician, test a potential immigrant at Ellis Island after she had failed the original screening procedure and the brief follow-up examination on a previous day. The seated individuals in the rear of the photograph are other potential immigrants awaiting their examinations. They had also failed the screening procedure and brief follow-up examination.

Source: Mullan, 1917, Figure 4.


Based on official government statistics, only a tiny proportion of prospective immigrants were turned away due to low intelligence. Between 1892 and 1931 – when 21,862,790 immigrants arrived in the United States – a total of 4,303 prospective immigrants were turned away for being (in the language of the time) “idiots,” “imbeciles,” or “feeble-minded.” Therefore, a total of 0.02% of immigrants were rejected for low intelligence. The annual percentage of immigrants who were rejected for low intelligence reached its peak in 1915, when 0.103% of immigrants were turned away for this reason (data from Unrau, 1984, Vol. 1, pp. 185, 200–202). If intelligence tests really were designed to discriminate against some groups of immigrants, they were remarkably ineffective. Far more potential immigrants were turned away for carrying contagious diseases, having a physical disability, or being stowaways (Unrau, 1984, Vol. 1, pp. 200–202).

Figure 10.5 Two government employees (the seated men), at least one of whom is a physician, test a potential immigrant at Ellis Island who seems to be the same examinee as shown in Figure 10.4. This woman has already failed the screening procedure and follow-up examination on the day she arrived at Ellis Island. She failed two individual examinations on a later day and in this picture is taking her third or fourth individual examination (on yet another day). According to the photograph’s original caption, she failed this examination too and was designated “feeble-minded.”

Source: Mullan, 1917, Figure 5.



Unbiased ≠ Fair


While the common assertion that intelligence tests are biased is not supported by data, that does not mean that society has a blank check to use intelligence tests. This is because using the test may still be unfair – even if the test is unbiased in the technical sense of the word. Whereas bias is a scientific issue, fairness is an ethical or moral issue, and the two ideas are not interchangeable (Jensen, 1980a). People will inevitably have different moral or ethical values; when these values clash, there may be disagreements about whether and how to use tests. Some people may have good reasons to not use intelligence tests – even if they are unbiased (e.g., to foster a more diverse workforce). Unlike bias, fairness cannot be settled scientifically because science is morally neutral; its tools can be used for a variety of beneficial or harmful purposes. Ethical and moral arguments are best resolved by public decision making through the mechanisms of a free society – such as open debate, legislatures, and the court system. Chapters 33 and 34 will discuss the issue of fairness in more detail.

Figure 10.6 Wooden puzzle that served as a non-verbal intelligence test for immigrants at Ellis Island whom physicians suspected were “feeble-minded.” The physician would assemble the puzzle two or three times as the immigrant watched and then ask the immigrant to assemble it.

Source: National Park Service (https://bit.ly/2WyEmrw).



Conclusion


Among people without training in psychological testing, there is a widespread belief that intelligence tests (and many academic or employment tests) are biased against racially diverse examinees – especially people of African, Hispanic, and Native American descent. Sometimes these arguments are based on the mere fact that the average score on these tests varies across racial groups; sometimes the arguments are more sophisticated and are based on test content or the appropriateness of testing diverse examinees. But the evidence is overwhelming that professionally designed tests are not statistically biased against native speakers of the test language who are born in the country that the test is designed for. Professional developers go to great lengths to ensure that bias is minimized and that the content of professionally designed tests is appropriate for diverse individuals. Nevertheless, it may not be fair to use an unbiased test for some examinees, and values and ethics are important in determining fairness of test use.


From Chapter 10 of "In the Know: Debunking 35 Myths About Human Intelligence" by Dr. Russell Warne (2020)




We hope you found this information useful. For further questions, please join our Discord server to ask a Riot IQ team member or email us at support@riotiq.com. If you are interested in IQ and Intelligence, we co-moderate a related subreddit forum and have started a Youtube channel. Please feel free to join us.


Author: Dr. Russell T. Warne
LinkedIn: linkedin.com/in/russell-warne
Email: research@riotiq.com