When discussing the question of human races yesterday, I wondered if anyone had ever tried to diagnose an individual based on his/her complement of genes (“genotype”), and said that I was unaware of any such attempt. Clearly I haven’t been keeping up with the human-genetics literature, because several people called my attention to a paper in Nature (2008) by John Novembre et al. (free at the link), which does just that. It doesn’t really bear on the question of “races”—except showing that discrete racial groups don’t exist in Europe—but it does show that you can do a pretty good job telling where people came from by looking at their DNA.
I’ll be brief here: Novembre et al. did a “high-throughput” DNA analysis of variable bits of DNA in the genome of 1387 people from every country in Europe, ranging from Russia to Portugal. For each of these individuals they looked at an astounding 197,146 bits of genes: variable nucleotide sites known as “SNPs” (single nucleotide polymorphisms). Yes, this degree of analysis is possible on a single chip, in fact, one can examine variation at 500,568 sites!. From the genetic differences among people they used a statistical algorithm to group together individuals with similar genotypes. They also knew the “ancestry” of each individual, defined as the country of origin of each individual’s grandparents, as well as the place where the sampled individual lived.
The first observation is that statistical analysis clearly showed individuals falling into clusters corresponding to their geographic location. This figure, a plot from “principal components analysis”, is a way to get the most information out of individuals’ genotypes using two axes of differences. The plot shows where each individual fell on the combination of axes. (the big dots are the median values for individuals from each country; click to enlarge):
The resulting figure bears a notable resemblance to a geographic map of Europe (Fig. 1a). Individuals from the same geographic region cluster together and major populations are distinguishable. Geographically adjacent populations typically abut each other, and recognizable geographical features of Europe such as the Iberian peninsula, the Italian peninsula, southeastern Europe, Cyprus and Turkey are apparent.
In other words, genetically closer populations are more genetically similar, as expected if individuals tend to mate with other individuals from the same country, and close by. This is an “isolation by distance” model: genetic similarity falls off gradually with distance. As the authors note, this does not support the existence of “discrete, well-differentiated populations,” i.e., there are no races. None are expected in such a small area, particularly because biological “races” are those populations that (at least at one time) were geographically isolated and genetically differentiated. That geographical isolation never happened in Europe.
The authors also note that “the data reveal structure even among French-, German- and Italian-speaking groups within Switzerland.” Here’s what that small land looks like genetically:
How about using an individual’s DNA to predict his/her ancestry? The analysis here involved “training” a computer algorithm on the centers of each country of ancestral origin, and then using that and a multiple-regression approach (presumably based on the decay of genetic similarity with distance observed by the authors), “predicted” the ancestral origin of their genes (i.e. the location of their grandparents). The prediction did pretty well: here’s a figure showing the “predicted” location of each individual, labeled by actual country of. Note the close correspondence between prediction based on genes and actual country of origin based on self-report, and how individuals of the same color group together (meaning that genetically similar individuals tend to have geographically similar origins):
And here’s a bar graph showing how accurate one can predict the geographic origin of each individual from its genes. The “accuracy” shows the discrepancy between the individual’s actual origin and the place of origin predicted by his/her DNA. It’s cumulative (accuracies must sum to 1 over the total distance, 2500 km), but the darkest bar at the bottom, for instance, shows the proportion of individuals in each country whose place of origin was assigned, by genetics, to within 400 km of their actual place of origin.
As the fine-scale spatial structure evident in Fig. 1 [first figure shown above] suggests, European DNA samples can be very informative about the geographical origins of their donors. Using a multiple-regression-based assignment approach, one can place 50% of individuals within 310 km of their reported origin and 90% within 700 km of their origin (Fig. 2 and Supplementary Table 4, results based on populations with n.6). Across all populations, 50% of individuals are placed within 540 km of their reported origin, and 90% of individuals within 840 km.
Obviously, Europe has not been so intermixed genetically that you can’t diagnose where an individual’s ancestors came from. This also means that if a an individual whose ancestors came from Europe, but who was unsure of where, was subject to this kind of genetic analysis, you could tell that individual with high probability where his ancestors resided. Unfortunately, this is an expensive procedure, far more accurate than the DNA tests you can buy for about $125, but eventually you’ll be able to do this for a reasonable amount of money.
Given that the genetic differences between worldwide populations is substantially larger than differences among European countries, this method could obviously be used to diagnose an individual’s recent ancestry from any place in the world, assuming of course that one had huge samples of human genotypes, analyzed for many SNPs, from many places on Earth. That hasn’t been done yet, but I’m sure it’s in the works. When that happens, you’ll be able to plunk down a hundred bucks and find out with pretty good accuracy where your ancestors resided.
As I said, this doesn’t show that there are discrete “races” in Europe, and I don’t think there are obviously discrete “races” anywhere these days, though there is large-scale genetic differentiation among worldwide population suggesting that such races once existed as relatively discrete and geographically isolated populations. The discreteness that once existed, or so I think, is now blurring out as transportation and migration are beginning to mix the discrete groups into not a melting pot, but sort of a lumpy pudding of humanity.
What is clear is that, with considerable accuracy, you can diagnose an individual’s geographic origin from his genes. Nearly everyone’s DNA contains reliable information about their recent and ancient past. We are not all genetically alike. If we were, you couldn’t do studies like the one of Novembre et al. But neither are we radically different genetically, for if we were, you wouldn’t need hundreds of thousands of genes for such accurate predictions.
Novembre, J., T. Johnson, K. Bryc, Z. Kutalik, A. R. Boyko, A. Auton, A. Indap, K. S. King, S. Bergmann, M. R. Nelson, M. Stephens, and C. D. Bustamante. 2008. Genes mirror geography within Europe. Nature 456:98-101.