Noah Rosenberg laboratory at the University of Michigan

Abstracts of Rosenberg lab publications


[48] JM VanLiere, NA Rosenberg (2008) Mathematical properties of the r2 measure of linkage disequilibrium. Theoretical Population Biology 74: 130-137.

Statistics for linkage disequilibrium (LD), the non-random association of alleles at two loci, depend on the frequencies of the alleles at the loci under consideration. Here, we examine the r2 measure of LD and its mathematical relationship to allele frequencies, quantifying the constraints on its maximum value. Assuming independent uniform distributions for the allele frequencies of two biallelic loci, we find that the mean maximum value of r2 is ~0.43051, and that r2 can exceed a threshold of 4/5 in only ~14.232% of the allele frequency space. If one locus is assumed to have known allele frequencies --- the situation in an association study in which LD between a known marker locus and an unknown trait locus is of interest --- we find that the mean maximum value of r2 is greatest when the known locus has a minor allele frequency of ~0.30131. We find that in 1/4 of the space of allowed values of minor allele frequencies and haplotype frequencies at a pair of loci, the unconstrained maximum r2 allowing for the possibility of recombination between the loci exceeds the constrained maximum assuming that no recombination has occurred. Finally, we use r2max to examine the connection between r2 and the D' measure of linkage disequilibrium, finding that r2/r2max = D'2 for ~72.683% of the space of allowed values of (pa,pb,pab). Our results concerning the properties of r2 have the potential to inform the interpretation of unusual LD behavior and to assist in the design of LD-based association-mapping studies.


[47] TJ Pemberton*, M Jakobsson*, DF Conrad, G Coop, JD Wall, JK Pritchard, PI Patel, NA Rosenberg (2008) Using population mixtures to optimize the utility of genomic databases: linkage disequilibrium and association study design in India. Annals of Human Genetics 72: 535-546. [Data]

When performing association studies in populations that have not been the focus of large-scale investigations of haplotype variation, it is often helpful to rely on genomic databases in other populations for study design and analysis --- such as in the selection of tag SNPs and in the imputation of missing genotypes. One way of improving the use of these databases is to rely on a mixture of database samples that is similar to the population of interest, rather than using the single most similar database sample. We demonstrate the effectiveness of the mixture approach in the application of African, European, and East Asian HapMap samples for tag SNP selection in populations from India, a genetically intermediate region underrepresented in genomic studies of haplotype variation.


[46] O François, MGB Blum, M Jakobsson, NA Rosenberg (2008) Demographic history of European populations of Arabidopsis thaliana. PLoS Genetics 4: e1000075. [Full text at journal website] [PDF] [Supplement]

The model plant species Arabidopsis thaliana is successful at colonizing land that has recently undergone human-mediated disturbance. To investigate the prehistoric spread of A. thaliana, we applied approximate Bayesian computation and explicit spatial modeling to 76 European accessions sequenced at 876 nuclear loci. We find evidence that a major migration wave occurred from east to west, affecting most of the sampled individuals. The longitudinal gradient appears to result from the plant having spread in Europe from the east ~10,000 years ago, with a rate of westward spread of ~0.9 km/year. This wave-of-advance model is consistent with a natural colonization from an eastern glacial refugium that overwhelmed ancient western lineages. However, the speed and time frame of the model also suggest that the migration of A. thaliana into Europe may have accompanied the spread of agriculture during the Neolithic transition.


[45] JM Macpherson, J Gonzalez, DM Witten, JC Davis, NA Rosenberg, AE Hirsh, DA Petrov (2008) Nonadaptive explanations for signatures of partial selective sweeps in Drosophila. Molecular Biology and Evolution 25: 1025-1042.

A beneficial mutation that has nearly but not yet fixed in a population produces a characteristic haplotype configuration, called a partial selective sweep. Whether nonadaptive processes might generate similar haplotype configurations has not been extensively explored. Here, we consider 5 population genetic data sets taken from regions flanking high-frequency transposable elements in North American strains of Drosophila melanogaster, each of which appears to be consistent with the expectations of a partial selective sweep. We use coalescent simulations to explore whether incorporation of the species' demographic history, purifying selection against the element, or suppression of recombination caused by the element could generate putatively adaptive haplotype configurations. Whereas most of the data sets would be rejected as nonneutral under the standard neutral null model, only the data set for which there is strong external evidence in support of an adaptive transposition appears to be nonneutral under the more complex null model and in particular when demography is taken into account. High-frequency, derived mutations from a recently bottlenecked population, such as we study here, are of great interest to evolutionary genetics in the context of scans for adaptive events; we discuss the broader implications of our findings in this context.


[44] NA Rosenberg, R Tao (2008) Discordance of species trees with their most likely gene trees: the case of five taxa. Systematic Biology 57: 131-140. [Full text at journal website] [PDF] [Supplement]

Under a coalescent model for within-species evolution, gene trees may differ from species trees to such an extent that the gene tree topology most likely to evolve along the branches of a species tree can disagree with the species tree topology. Gene tree topologies that are more likely to be produced than the topology that matches that of the species tree are termed anomalous, and the region of branch-length space that gives rise to anomalous gene trees (AGTs) is the anomaly zone. We examine the occurrence of anomalous gene trees for the case of five taxa, the smallest number of taxa for which every species tree topology has a nonempty anomaly zone. Considering all sets of branch lengths that give rise to anomalous gene trees, the largest value possible for the smallest branch length in the species tree is greater in the five-taxon case (0.1934 coalescent time units) than in the previously studied case of four taxa (0.1568). The five-taxon case demonstrates the existence of three phenomena that do not occur in the four-taxon case. First, anomalous gene trees can have the same unlabeled topology as the species tree. Second, the anomaly zone does not necessarily enclose a ball centered at the origin in branch-length space, in which all branches are short. Third, as a branch length increases, it is possible for the number of AGTs to increase rather than decrease or remain constant. These results, which help to describe how the properties of anomalous gene trees increase in complexity as the number of taxa increases, will be useful in formulating strategies for evading the problem of anomalous gene trees during species tree inference from multilocus data.


[43] M Jakobsson*, SW Scholz*, P Scheet*, JR Gibbs, JM VanLiere, H-C Fung, ZA Szpiech, JH Degnan, K Wang, R Guerreiro, JM Bras, JC Schymick, DG Hernandez, BJ Traynor, J Simon-Sanchez, M Matarin, A Britton, J van de Leemput, I Rafferty, M Bucan, HM Cann, JA Hardy, NA Rosenberg, AB Singleton (2008) Genotype, haplotype and copy-number variation in worldwide human populations. Nature 451: 998-1003.

Genome-wide patterns of variation across individuals provide a powerful source of data for uncovering the history of migration, range expansion, and adaptation of the human species. However, high-resolution surveys of variation in genotype, haplotype and copy number have generally focused on a small number of population groups. Here we report the analysis of high-quality genotypes at 525,910 single-nucleotide polymorphisms (SNPs) and 396 copy-number-variable loci in a worldwide sample of 29 populations. Analysis of SNP genotypes yields strongly supported fine-scale inferences about population structure. Increasing linkage disequilibrium is observed with increasing geographic distance from Africa, as expected under a serial founder effect for the out-of-Africa spread of human populations. New approaches for haplotype analysis produce inferences about population structure that complement results based on unphased SNPs. Despite a difference from SNPs in the frequency spectrum of the copy-number variants (CNVs) detected --- including a comparatively large number of CNVs in previously unexamined populations from Oceania and the Americas --- the global distribution of CNVs largely accords with population structure analyses for SNP data sets of similar size. Our results produce new inferences about inter-population variation, support the utility of CNVs in human population-genetic research, and serve as a genomic resource for human-genetic studies in diverse worldwide populations.


[42] K Zhang, NA Rosenberg (2007) On the genealogy of a duplicated microsatellite. Genetics 177: 2109-2122.

When a microsatellite locus is duplicated in a diploid organism, a single pair of PCR primers may amplify as many as four distinct alleles. To study the evolution of a duplicated microsatellite, we consider a coalescent model with symmetric stepwise mutation. Conditional on the time of duplication and a mutation rate, both in a model of completely unlinked loci and in a model of completely linked loci, we compute the probabilities for a sampled diploid individual to amplify one, two, three, or four distinct alleles with one pair of microsatellite PCR primers. These probabilities are then studied to examine the nature of their dependence on the duplication time and the mutation rate. The mutation rate is observed to have a stronger effect than the duplication time on the four probabilities, and the unlinked and linked cases are seen to behave similarly. Our results can be useful for helping to interpret genetic variation at microsatellite loci in species with a very recent history of gene and genome duplication.


[41] S Wang*, CM Lewis Jr*, M Jakobsson*, S Ramachandran, N Ray, G Bedoya, W Rojas, MV Parra, JA Molina, C Gallo, G Mazzotti, G Poletti, K Hill, AM Hurtado, D Labuda, W Klitz, R Barrantes, MC Bortolini, FM Salzano, ML Petzl-Erler, LT Tsuneto, E Llop, F Rothhammer, L Excoffier, MW Feldman, NA Rosenberg, A Ruiz-Linares (2007) Genetic variation and population structure in Native Americans. PLoS Genetics 3: 2049-2067. [Full text at journal website] [PDF] [Supplement] [Data] [Readme for datafile]

We examined genetic diversity and population structure in the American landmass using 678 autosomal microsatellite markers genotyped in 422 individuals representing 24 Native American populations sampled from North, Central, and South America. These data were analyzed jointly with similar data available in 54 other indigenous populations worldwide, including an additional five Native American groups. The Native American populations have lower genetic diversity and greater differentiation than populations from other continental regions. We observe gradients both of decreasing genetic diversity as a function of geographic distance from the Bering Strait and of decreasing genetic similarity to Siberians --- signals of the southward dispersal of human populations from the northwestern tip of the Americas. We also observe evidence of: (1) a higher level of diversity and lower level of population structure in western South America compared to eastern South America, (2) a relative lack of differentiation between Mesoamerican and Andean populations, (3) a scenario in which coastal routes were easier for migrating peoples to traverse in comparison with inland routes, and (4) a partial agreement on a local scale between genetic similarity and the linguistic classification of populations. These findings offer new insights into the process of population dispersal and differentiation during the peopling of the Americas.


[40] M Jakobsson, NA Rosenberg (2007) CLUMPP: a cluster matching and permutation program for dealing with label switching and multimodality in analysis of population structure. Bioinformatics 23: 1801-1806. [Full text at journal website] [PDF] [Software]

Motivation: Clustering of individuals into populations on the basis of multilocus genotypes is informative in a variety of settings. In population-genetic clustering algorithms, such as BAPS, STRUCTURE and TESS, individual multilocus genotypes are partitioned over a set of clusters, often using unsupervised approaches that involve stochastic simulation. As a result, replicate cluster analyses of the same data may produce several distinct solutions for estimated cluster membership coefficients, even though the same initial conditions were used. Major differences among clustering solutions have two main sources: (1) `label switching' of clusters across replicates, caused by the arbitrary way in which clusters in an unsupervised analysis are labeled, and (2) `genuine multimodality,' truly distinct solutions across replicates.
Results: To facilitate the interpretation of population-genetic clustering results, we describe three algorithms for aligning multiple replicate analyses of the same data set. We have implemented these algorithms in the computer program CLUMPP (CLUster Matching and Permutation Program). We illustrate the use of CLUMPP by aligning the cluster membership coefficients from 100 replicate cluster analyses of 600 chickens from 20 different breeds.
Availability: CLUMPP is freely available at http://rosenberglab.bioinformatics.med.umich.edu/clumpp.html
Contact: Mattias Jakobsson


[39] MGB Blum, NA Rosenberg (2007) Estimating the number of ancestral lineages using a maximum likelihood method based on rejection sampling. Genetics 176: 1741-1757.

Estimating the number of ancestral lineages of a sample of DNA sequences at time t in the past can be viewed as a variation on the problem of estimating the time to the most recent common ancestor. To estimate the number of ancestral lineages, we develop a maximum-likelihood approach that takes advantage of a prior model of population demography, in addition to the molecular data summarized by the pattern of polymorphic sites. The method relies on a rejection sampling algorithm that is introduced for simulating conditional coalescent trees given a fixed number of ancestral lineages at time t. Computer simulations show that the number of ancestral lineages can be estimated accurately, provided that the number of mutations that occurred since time t is sufficiently large. The method is applied to 986 present-day human sequences located in hypervariable region 1 of the mitochondrion to estimate the number of ancestral lineages of modern humans at the time of potential admixture with the Neanderthal population. Our estimates support a view that the proportion of the modern population consisting of Neanderthal contributions must be relatively small, less than ~5%, if the admixture happened as recently as 30,000 years ago.


[38] NA Rosenberg (2007) Counting coalescent histories. Journal of Computational Biology 14: 360-377.

Given a species tree and a gene tree, a valid coalescent history is a list of the branches of the species tree on which coalescences in the gene tree take place. I develop a recursion for the number of valid coalescent histories that exist for an arbitrary gene tree/species tree pair, when one gene lineage is studied per species. The result is obtained by defining a concept of m-extended coalescent histories, enumerating and counting these histories, and taking the special case of m=1. As a sum over valid coalescent histories appears in a formula for the probability that a random gene tree evolving along the branches of a fixed species tree has a specified labeled topology, the enumeration of valid coalescent histories can considerably reduce the effort required for evaluating this formula.


[37] NA Rosenberg, MGB Blum (2007) Sampling properties of homozygosity-based statistics for linkage disequilibrium. Mathematical Biosciences 208: 33-47.

Homozygosity-based statistics such as Ohta's identity-in-state (IIS) excess offer the potential to measure linkage disequilibrium for multiallelic loci in small samples. However, previous observations have suggested that for independent loci, in small samples these statistics might produce values that more frequently lie on one side rather than on the other side of zero. Here we investigate the sampling properties of the IIS excess. We find that for any pair of independent polymorphic loci, as sample size n approaches infinity, the sampling distribution of the IIS excess approaches a normal distribution. For large samples, the IIS excess tends towards symmetry around zero, and the probabilities of positive and of negative IIS excess both approach 1/2. Surprisingly, however, we also find that for sufficiently large n, independent loci can be chosen so that the probability of a sample having positive IIS excess is arbitrarily close to either 0 or 1. The results are applied to interpretation of data from human populations, and we conclude that before employing homozygosity-based statistics to measure LD in a particular sample, especially for loci with either very small or very large homozygosities, it is useful to verify that loci with the observed homozygosity values are not likely to produce a large bias in IIS excess in samples of the given size.


[36] M Jakobsson, NA Rosenberg (2007) The probability distribution under a population divergence model of the number of genetic founding lineages of a population or species. Theoretical Population Biology 71: 502-523.

The composition of genetic variation in a population or species is shaped by the number of events that led to the founding of the group. We consider a neutral coalescent model of two populations, where a derived population is founded as an offshoot of an ancestral population. For a given locus, using both recursive and nonrecursive approaches, we compute the probability distribution of the number of genetic founding lineages that have given rise to the derived population. This number of genetic founding lineages is defined as the number of ancestral individuals that contributed at the locus to the present-day derived population, and is formulated in terms of interspecific coalescence events. The effects of sample size and divergence time on the probability distribution of the number of founding lineages are studied in detail. For 99.99% of the loci in the derived population to each have one founding lineage, the two populations must be separated for >=9.9N generations. However, only ~0.87N generations must pass since divergence for 99.99% of the loci to have <6 founding lineages. Our results are useful as a prior expectation on the number of founding lineages in scenarios that involve the evolution of one population from the splitting of an ancestral group, such as in the colonization of islands, the formation of polyploid species, and the domestication of crops and livestock from wild ancestors.


[35] L David, NA Rosenberg, U Lavi, MW Feldman, J Hillel (2007). Genetic diversity and population structure inferred from the partially duplicated genome of domesticated carp, Cyprinus carpio L. Genetics Selection Evolution 39: 319-340.

Genetic relationships among eight populations of domesticated carp (Cyprinus carpio L.), a species with a partially duplicated genome, were studied using 12 microsatellites and 505 AFLP bands. The populations included three aquacultured carp strains and five ornamental carp (koi) variants. Grass carp (Ctenopharyngodon idella) was used as an outgroup. AFLP-based gene diversity varied from 5% (grass carp) to 32% (koi) and reflected the reasonably well understood histories and breeding practices of the populations. A large fraction of the molecular variance was due to differences between aquacultured and ornamental carps. Further analyses based on microsatellite data, including cluster analysis and neighbor-joining trees, supported the genetic distinctiveness of aquacultured and ornamental carps, despite the recent divergence of the two groups. In contrast to what was observed for AFLP-based diversity, the frequency of heterozygotes based on microsatellites was comparable among all populations. This discrepancy can potentially be explained by duplication of some loci in Cyprinus carpio L., and a model that shows how duplication can increase heterozygosity estimates for microsatellites but not for AFLP loci is discussed. Our analyses in carp can help in understanding the consequences of genotyping duplicated loci and in interpreting discrepancies between dominant and co-dominant markers in species with recent genome duplication.


[35] L David, NA Rosenberg, U Lavi, MW Feldman, J Hillel (2007). Genetic diversity and population structure inferred from the partially duplicated genome of domesticated carp, Cyprinus carpio L. Genetics Selection Evolution 39: 319-340.

Genetic relationships among eight populations of domesticated carp (Cyprinus carpio L.), a species with a partially duplicated genome, were studied using 12 microsatellites and 505 AFLP bands. The populations included three aquacultured carp strains and five ornamental carp (koi) variants. Grass carp (Ctenopharyngodon idella) was used as an outgroup. AFLP-based gene diversity varied from 5% (grass carp) to 32% (koi) and reflected the reasonably well understood histories and breeding practices of the populations. A large fraction of the molecular variance was due to differences between aquacultured and ornamental carps. Further analyses based on microsatellite data, including cluster analysis and neighbor-joining trees, supported the genetic distinctiveness of aquacultured and ornamental carps, despite the recent divergence of the two groups. In contrast to what was observed for AFLP-based diversity, the frequency of heterozygotes based on microsatellites was comparable among all populations. This discrepancy can potentially be explained by duplication of some loci in Cyprinus carpio L., and a model that shows how duplication can increase heterozygosity estimates for microsatellites but not for AFLP loci is discussed. Our analyses in carp can help in understanding the consequences of genotyping duplicated loci and in interpreting discrepancies between dominant and co-dominant markers in species with recent genome duplication.


[34] KB Schroeder, TG Schurr, JC Long, NA Rosenberg, MH Crawford, LA Tarskaia, LP Osipova, SI Zhadanov, DG Smith (2007). A private allele ubiquitous in the Americas. Biology Letters 3: 218-223.

The three-wave migration hypothesis of Greenberg et al. has permeated the genetic literature on the peopling of the Americas. Greenberg et al. proposed that Na-Dene, Aleut-Eskimo and Amerind are language phyla which represent separate migrations from Asia to the Americas. We show that a unique allele at autosomal microsatellite locus D9S1120 is present in all sampled North and South American populations, including the Na-Dene and Aleut-Eskimo, and in related Western Beringian groups, at an average frequency of 31.7%. This allele was not observed in any sampled putative Asian source populations or in other worldwide populations. Neither selection nor admixture explains the distribution of this regionally specific marker. The simplest explanation for the ubiquity of this allele across the Americas is that the same founding population contributed a large fraction of ancestry to all modern Native American populations.


[33] NA Rosenberg (2007) Statistical tests for taxonomic distinctiveness from observations of monophyly. Evolution 61: 317-323.

The observation of monophyly for a specified set of genealogical lineages is often used to place the lineages into a distinctive taxonomic entity. However, it is sometimes possible that monophyly of the lineages can occur by chance as an outcome of the random branching of lineages within a single taxon. Thus, especially for small samples, an observation of monophyly for a set of lineages --- even if strongly supported statistically --- does not necessarily indicate that the lineages are from a distinctive group. Here I develop a test of the null hypothesis that monophyly is a chance outcome of random branching. I also compute the sample size required so that the probability of chance occurrence of monophyly of a specified set of lineages lies below a prescribed tolerance. Under the null model of random branching, the probability that monophyly of the lineages in an index group occurs by chance is substantial if the sample is highly asymmetric, that is, if only a few of the sampled lineages are from the index group, or if only a few lineages are external to the group. If sample sizes are similar inside and outside the group of interest, however, chance occurrence of monophyly can be rejected at stringent significance levels (P < 10^{-5}) even for quite small samples (~20 total lineages). For a fixed total sample size, rejection of the null hypothesis of random branching in a single taxon occurs at the most stringent level if samples of nearly equal size inside and outside the index group --- with a slightly greater size within the index group --- are used. Similar results apply, with smaller sample sizes needed, when reciprocal monophyly of two groups, rather than monophyly of a single group, is of interest. The results suggest minimal sample sizes required for inferences to be made about taxonomic distinctiveness from observations of monophyly.


[32] NA Rosenberg, S Mahajan, C Gonzalez-Quevedo, MGB Blum, L Nino-Rosales, V Ninis, P Das, M Hegde, L Molinari, G Zapata, JL Weber, JW Belmont, PI Patel (2006) Low levels of genetic divergence across geographically and linguistically diverse populations from India. PLoS Genetics 2: 2052-2061. [Full-text at journal website] [PDF] [Supplementary Tables 1-3 (DOC)] [Supplementary Tables 1-3 (PDF)]

Ongoing modernization in India has elevated the prevalence of many complex genetic diseases associated with a western lifestyle and diet to near-epidemic proportions. However, although India comprises more than one sixth of the world's human population, it has largely been omitted from genomic surveys that provide the backdrop for association studies of genetic disease. Here, by genotyping India-born individuals sampled in the United States, we carry out an extensive study of Indian genetic variation. We analyze 1,200 genome-wide polymorphisms in 432 individuals from 15 Indian populations. We find that populations from India, and populations from South Asia more generally, constitute one of the major human subgroups with increased similarity of genetic ancestry. However, only a relatively small amount of genetic differentiation exists among the Indian populations. Although caution is warranted due to the fact that United States-sampled Indian populations do not represent a random sample from India, these results suggest that the frequencies of many genetic variants are distinctive in India compared to other parts of the world and that the effects of population heterogeneity on the production of false positives in association studies may be smaller in Indians (and particularly in Indian-Americans) than might be expected for such a geographically and linguistically diverse subset of the human population.


[31] DF Conrad*, M Jakobsson*, G Coop*, X Wen, JD Wall, NA Rosenberg, JK Pritchard (2006) A worldwide survey of haplotype variation and linkage disequilibrium in the human genome. Nature Genetics 38: 1251-1260. [PDF] [Supplement (methods, note, and figures] [Supplementary Table 1] [Data]

Recent genomic surveys have produced high-resolution haplotype information, but only in a small number of human populations. We report haplotype structure across 12 Mb of DNA sequence in 927 individuals representing 52 populations. The geographic distribution of haplotypes reflects human history, with a loss of haplotype diversity as distance increases from Africa. Although the extent of linkage disequilibrium (LD) varies markedly across populations, considerable sharing of haplotype structure exists, and inferred recombination hotspot locations generally match across groups. The four samples in the International HapMap Project contain the majority of common haplotypes found in most populations: averaging across populations, 83% of common 20-kb haplotypes in a population are also common in the most similar HapMap sample. Consequently, although the portability of tag SNPs based on the HapMap is reduced in low-LD Africans, the HapMap will be helpful for the design of genome-wide association mapping studies in nearly all human populations.


[30] NA Rosenberg (2006) Standardized subsets of the HGDP-CEPH Human Genome Diversity Cell Line Panel, accounting for atypical and duplicated samples and pairs of close relatives. Annals of Human Genetics 70: 841-847. [PDF] [Supplement] [Data] [Spreadsheet with recommended subsets (txt format)] [Spreadsheet with recommended subsets (xls format)]

The HGDP-CEPH Human Genome Diversity Cell Line Panel is a widely-used resource for studies of human genetic variation. Here, pairs of close relatives that have been included in the panel are identified. Together with information on atypical and duplicated samples, the inferred relative pairs suggest standardized subsets of the panel for use in future population-genetic studies.


[29] NA Rosenberg, M Nordborg (2006) A general population-genetic model for the production by population structure of spurious genotype-phenotype associations in discrete, admixed or spatially distributed populations. Genetics 173: 1665-1678. [PDF]

In linkage disequilibrium mapping of genetic variants causally associated with phenotypes, spurious associations can potentially be generated by any of a variety of types of population structure. However, mathematical theory of the production of spurious associations has largely been restricted to population structure models that involve the sampling of individuals from a collection of discrete subpopulations. Here, we introduce a general model of spurious association in structured populations, appropriate whether the population structure involves discrete groups, admixture among such groups, or continuous variation across space. Under the assumptions of the model, we find that a single common principle — applicable to both the discrete and admixed settings as well as to spatial populations — gives a necessary and sufficient condition for the occurrence of spurious associations. Using a mathematical connection between the discrete and admixed cases, we show that in admixed populations, spurious associations are less severe than in corresponding mixtures of discrete subpopulations, especially when the variance of admixture across individuals is small. This observation, together with the results of simulations that examine the relative influences of various model parameters, has important implications for the design and analysis of genetic association studies in structured populations.


[28] JH Degnan, NA Rosenberg (2006) Discordance of species trees with their most likely gene trees. PLoS Genetics 2: 762-768. [Full-text at journal website] [PDF]

Because of the stochastic way in which lineages sort during speciation, gene trees may differ in topology from each other and from species trees. Surprisingly, assuming that genetic lineages follow a coalescent model of within-species evolution, we find that for any species tree topology with five or more species, there exist branch lengths for which gene tree discordance is so common that the most likely gene tree topology to evolve along the branches of a species tree differs from the species phylogeny. This counterintuitive result implies that in combining data on multiple loci, the straightforward procedure of using the most frequently observed gene tree topology as an estimate of the species tree topology can be asymptotically guaranteed to produce an incorrect estimate. We conclude with suggestions that can aid in overcoming this new obstacle to accurate genomic inference of species phylogenies.


[27] NA Rosenberg (2006) The mean and variance of the numbers of r-pronged nodes and r-caterpillars in Yule-generated genealogical trees. Annals of Combinatorics 10: 129-146. [PDF]

The Yule model is a frequently-used evolutionary model that can be utilized to generate random genealogical trees. Under this model, using a backwards counting method differing from the approach previously employed by Heard (Evolution 46: 1818-1826), for a genealogical tree of n lineages, the mean number of nodes with exactly r descendants is computed (2 &le r &le n-1). The variance of the number of r-pronged nodes is also obtained, as are the mean and variance of the number of r-caterpillars. These results generalize computations of McKenzie and Steel for the case of r=2 (Math. Biosci. 164: 81-92, 2000). For a given n, the two means are largest at r=2, equaling 2n/3 for n &ge 5. However, for n &ge 9, the variances are largest at r=3, equaling 23n/420 for n &ge 7. As n &rarr &infin, the fraction of internal nodes that are r-caterpillars for some r approaches (e2-5)/4 &asymp 0.59726.


[26] NA Rosenberg, S Mahajan, S Ramachandran, C Zhao, JK Pritchard, MW Feldman (2005) Clines, clusters, and the effect of study design on the inference of human population structure. PLoS Genetics 1: 660-671. [Full-text at journal website] [PDF] [Data]

Previously, we observed that without using prior information about individual sampling locations, a clustering algorithm applied to multilocus genotypes from worldwide human populations produced genetic clusters largely coincident with major geographic regions. It has been argued, however, that the degree of clustering is diminished by use of samples with greater uniformity in geographic distribution, and that the clusters we identified were a consequence of uneven sampling along genetic clines. Expanding our earlier dataset from 377 to 993 markers, we systematically examine the influence of several study design variables — sample size, number of loci, number of clusters, assumptions about correlations in allele frequencies across populations, and the geographic dispersion of the sample — on the ``clusteredness'' of individuals. With all other variables held constant, geographic dispersion is seen to have comparatively little effect on the degree of clustering. Examination of the relationship between genetic and geographic distance supports a view in which the clusters arise not as an artifact of the sampling scheme, but from small discontinuous jumps in genetic distance for most population pairs on opposite sides of geographic barriers, in comparison with genetic distance for pairs on the same side. Thus, analysis of the 993-locus dataset corroborates our earlier results: if enough markers are used with a sufficiently large worldwide sample, individuals can be partitioned into genetic clusters that match major geographic subdivisions of the globe, with some individuals from intermediate geographic locations having mixed membership in the clusters that correspond to neighboring regions.


[25] NA Rosenberg (2005) Algorithms for selecting informative marker panels for population assignment. Journal of Computational Biology 12: 1183-1201. [PDF]

Given a set of potential source populations, genotypes of an individual of unknown origin at a collection of markers can be used to predict the correct source population of the individual. For improved efficiency, informative markers can be chosen from a larger set of markers to maximize the accuracy of this prediction. However, selecting the loci that are individually most informative does not necessarily produce the optimal panel. Here, using genotypes from eight species — carp, cat, chicken, dog, fly, grayling, human, and maize — this univariate accumulation procedure is compared to new multivariate "greedy" and "maximin" algorithms for choosing marker panels. The procedures generally suggest similar panels, although the greedy method often recommends inclusion of loci that are not chosen by the other algorithms. In seven of the eight species, when applied to five or more markers, all methods achieve at least 94% assignment accuracy on simulated individuals, with one species — dog — producing this level of accuracy with only three markers, and the eighth species — human — requiring ~13-16 markers. The new algorithms produce substantial improvements over use of randomly selected markers; where differences among the methods are noticeable, the greedy algorithm leads to slightly higher probabilities of correct assignment. Although none of the approaches necessarily chooses the panel with optimal performance, the algorithms all likely select panels with performance near enough to the maximum that they all are suitable for practical use.


[24] S Ramachandran, O Deshpande, CC Roseman, NA Rosenberg, MW Feldman, LL Cavalli-Sforza (2005) Support from the relationship of genetic and geographic distance in human populations for a serial founder effect originating in Africa. Proceedings of the National Academy of Sciences USA 102: 15942-15947. [PDF] [Supplementary Figure 6] [Supplementary Table 2] [Supplementary text] [Data]

Equilibrium models of isolation by distance predict an increase in genetic differentiation with geographic distance. Here we find a linear relationship between genetic and geographic distance in a worldwide sample of human populations, with major deviations from the fitted line explicable by admixture or extreme isolation. A close relationship is shown to exist between the correlation of geographic distance and genetic differentiation (as measured by Fst) and the geographic pattern of heterozygosity across populations. Considering a worldwide set of geographic locations as possible sources of the human expansion, we find that heterozygosities in the globally distributed populations of the data set are best explained by an expansion originating in Africa and that no geographic origin outside of Africa accounts as well for the observed patterns of genetic diversity. Although the relationship between Fst and geographic distance has been interpreted in the past as the result of an equilibrium model of drift and dispersal, simulation shows that the geographic pattern of heterozygosities in this data set is consistent with a model of a serial founder effect starting at a single origin. Given this serial-founder scenario, the relationship between genetic and geographic distance allows us to derive bounds for the effects of drift and natural selection on human genetic variation.

[23] NA Rosenberg (2005) A sharp minimum on the mean number of steps taken in adaptive walks. Journal of Theoretical Biology 237: 17-22. [PDF]

It was recently conjectured by H.A. Orr [2003. A minimum on the mean number of steps taken in adaptive walks. J. Theor. Biol. 220, 241-247] that from a random initial point on a random fitness landscape of alphabetic sequences with one-mutation adjacency, chosen from a larger class of landscapes, no adaptive algorithm can arrive at a local optimum in fewer than on average e-1 steps. Here, using an example in which the mean number of steps to a local optimum equals (A-1)/A, where A is the number of distinct "letters" in the "alphabet" from which sequences are constructed, it is shown that as originally stated, the conjecture does not hold. It is also demonstrated that (A-1)/A is a sharp minimum on the mean number of steps taken in adaptive walks on fitness landscapes of alphabetic sequences with one-mutation adjacency. As the example that achieves the new lower bound has properties that are not often considered as potential attributes for fitness landscapes --- non-identically distributed fitnesses and negative fitness correlations for adjacent points --- a weaker set of conditions characteristic of more commonly studied fitness landscapes is proposed under which the lower bound on the mean length of adaptive walks is conjectured to equal e-1.

[22] M Nordborg, TT Hu, Y Ishino, J Jhaveri, C Toomajian, H Zheng, E Bakker, P Calabrese, J Gladstone, R Goyal, M Jakobsson, S Kim, Y Morozov, B Padhukasahasram, V Plagnol, NA Rosenberg, C Shah, JD Wall, J Wang, K Zhao, T Kalbfleisch, V Schulz, M Kreitman, J Bergelson (2005) The pattern of polymorphism in Arabidopsis thaliana. PLoS Biology 3: 1289-1299. [Full-text at journal website] [PDF]

We resequenced 876 short fragments in a sample of 96 individuals of Arabidopsis thaliana that included stock center accessions as well as a hierarchical sample from natural populations. Although A. thaliana is a selfing weed, the pattern of polymorphism in general agrees with what is expected for a widely distributed, sexually reproducing species. Linkage disequilibrium decays rapidly, within 50 kb. Variation is shared worldwide, although population structure and isolation by distance are evident. The data fail to fit standard neutral models in several ways. There is a genome-wide excess of rare alleles, at least partially due to selection. There is too much variation between genomic regions in the level of polymorphism. The local level of polymorphism is negatively correlated with gene density and positively correlated with segmental duplications. Because the data do not fit theoretical null distributions, attempts to infer natural selection from polymorphism data will require genome-wide surveys of polymorphism in order to identify anomalous regions. Despite this, our data support the utility of A. thaliana as a model for evolutionary functional genomics.

[21] H Innan, K Zhang, P Marjoram, S Tavaré, NA Rosenberg (2005) Statistical tests of the coalescent model based on the haplotype frequency distribution and the number of segregating sites. Genetics 169: 1763-1777. [PDF] [Software]

Several tests of neutral evolution employ the observed number of segregating sites and properties of the haplotype frequency distribution as summary statistics and use simulations to obtain rejection probabilities. Here we develop a "haplotype configuration test" of neutrality (HCT) based on the full haplotype frequency distribution. To enable exact computation of rejection probabilities for small samples, we derive a recursion under the standard coalescent model for the joint distribution of the haplotype frequencies and the number of segregating sites. For larger samples, we consider simulation-based approaches. The utility of the HCT is demonstrated in simulations of alternative models and in application to data from Drosophila melanogaster.

[20] NA Rosenberg, PP Calabrese (2004) Polyploid and multilocus extensions of the Wahlund inequality. Theoretical Population Biology 66: 381-391. [PDF]

Wahlund's inequality informally states that if a structured and an unstructured population have the same allele frequencies at a locus, the structued population contains more homozygotes. We show that this inequality holds generally for ploidy level P, that is, the structured population has more P-polyhomozygotes. Further, for M randomly chosen loci (M greater than or equal to 2), the structured population is also expected to contain more M-multihomozygotes than an unstructured population with the same single-locus homozygosities. The extended inequalities suggest multilocus identity coefficients analogous to FST. Using microsatellite genotypes from human populations, we demonstrate that the multilocus Wahlund inequality can explain a positive bias in "identity-in-state excess."

[19] MM Tanaka, NA Rosenberg, PM Small (2004) The control of copy number of IS6110 in Mycobacterium tuberculosis. Molecular Biology and Evolution 21: 2195-2201. [PDF]

Insertion sequence (IS) elements are bacterial genes that are able to transpose to different locations in the genome. These elements are often used in molecular epidemiology as genetic markers that track the spread of pathogens. Transposable elements have frequently been described as "selfish DNA" because they facilitate their own transposition, causing damage when they insert into coding regions, while contributing little if anything to the bacterial host. According to this hypothesis, the expansion of copy number of insertion sequences is opposed by negative selection against high copy numbers. From an alternative point of view, we might expect IS elements to intrinsically regulate transposition within cells, thereby limiting damage to their bacterial host. Here, we report evidence that the copy number of IS6110 in Mycobacterium tuberculosis is controlled by selection against the element. We first construct 12 different models of marker change resulting from a combination of possible transposition functions and selective regimes. We then compute the Akaike Information Criterion for each model to identify the models that best explain data consisting of serial isolates of M. tuberculosis genotyped with IS6110. We find that the best performing models all include selection against the accumulation of copies. Specifically, our analysis points to the interaction of separate copies of the element causing lethal effects. We discuss the implications of these findings for genome evolution and molecular epidemiology.

[18] S Ramachandran, NA Rosenberg, LA Zhivotovsky, MW Feldman (2004) Robustness of the inference of human population structure: a comparison of X-chromosomal and autosomal microsatellites. Human Genomics 1: 87-97 (2004). [PDF]

In this paper, data on 20 X-chromosomal microsatellite polymorphisms from the HGDP-CEPH cell line panel are used to infer human population structure. Inferences from these data are compared to those obtained from autosomal microsatellites. Some of the major features of the structure seen with 377 autosomal markers are generally visible with the X-linked markers, although the latter provide less resolution. Differences between the X-chromosomal and autosomal results can be explained without requiring major differences in demographic parameters between males and females. The dependence of the partitioning on the number of individuals sampled from each region and on the number of markers used is discussed.

[17] NA Rosenberg (2004) Distruct: a program for the graphical display of population structure. Molecular Ecology Notes 4: 137-138. [PDF] [Software]

In analysis of multilocus genotypes from structured populations, individual coefficients of membership in subpopulations are often estimated using programs such as structure. Distruct provides a general method for visualizing these estimated membership coefficients. Subpopulations are represented as colours, and individuals are depicted as bars partitioned into coloured segments that correspond to membership coefficients in the subgroups. Distruct, available at http://www.cmb.usc.edu/~noahr/distruct.html, can also be used to display subpopulation assignment probabilities when individuals are assumed to have ancestry in only one group.

[16] NA Rosenberg, LM Li, R Ward, JK Pritchard (2003) Informativeness of genetic markers for inference of ancestry. American Journal of Human Genetics 73: 1402-1422. [PDF] [Supplement] [Microsatellite data] [SNP data] [SNP data readme] [Solution to Problem 11039 required in appendix of paper (American Mathematical Monthly 112: 572-573, 2005)] [Software]

Inference of individual ancestry is useful in various applications, such as admixture mapping and structured-association mapping. Using information-theoretic principles, we introduce a general measure, the informativeness for assignment (In), applicable to any number of potential source populations, for determining the amount of information that multiallelic markers provide about individual ancestry. In a worldwide human microsatellite data set, we identify markers of highest informativeness for inference of regional ancestry and for inference of population ancestry within regions; these markers, which are listed in online-only tables in our article, can be useful both in testing for and in controlling the influence of ancestry on case-control genetic association studies. Markers that are informative in one collection of source populations are generally informative in others. Informativeness of random dinucleotides, the most informative class of microsatellites, is five to eight times that of random single-nucleotide polymorphisms (SNPs), but 2%-12% of SNPs have higher informativeness than the median for dinucleotides. Our results can aid in decisions about the type, quantity, and specific choice of markers for use in studies of ancestry.

[15] NA Rosenberg, AE Hirsh (2003) On the use of star-shaped genealogies in inference of coalescence times. Genetics 164: 1677-1682. [PDF]

Genealogies from rapidly growing populations have approximate "star" shapes. We study the degree to which this approximation holds in the context of estimating the time to the most recent common ancestor (TMRCA) of a set of lineages. In an exponential growth scenario, we find that unless the product of population size (N) and growth rate (r) is at least 105, the "pairwise comparison estimator" of TMRCA that derives from the star genealogy assumption has bias of 10-50%. Thus, the estimator is appropriate only for large populations that have grown very rapidly. The "tree-length estimator" of TMRCA is more biased than the pairwise comparison estimator, having low bias only for extremely large values of Nr.

[14] NA Rosenberg (2003) The shapes of neutral gene genealogies in two species: probabilities of monophyly, paraphyly, and polyphyly in a coalescent model. Evolution 57: 1465-1477. [PDF]

The genealogies of samples of orthologous regions from multiple species can be classified by their shapes. Using a neutral coalescent model of two species, I give exact probabilities of each of four possible genealogical shapes — reciprocal monophyly, two types of paraphyly, and polyphyly. After the divergence that forms two species, each of which has population size N, polyphyly is the most likely genealogical shape for the lineages of the two species. At ~1.300N generations after divergence, paraphyly becomes most likely, and reciprocal monophyly becomes most likely at ~1.665N generations. For a given species, the time at which 99% of its loci acquire monophyletic genealogies is ~5.298N generations, assuming all loci in its sister species are monophyletic. The probability that all lineages of two species are reciprocally monophyletic given that a sample from the two species has a reciprocally monophyletic genealogy increases rapidly with sample size, as does the probability that the most recent common ancestor (MRCA) for a sample is also the MRCA for all lineages from the two species. The results have potential applications for the testing of evolutionary hypotheses.

[13] NA Rosenberg, JK Pritchard, JL Weber, HM Cann, KK Kidd, LA Zhivotovsky, MW Feldman (2003) Response to comment on "Genetic structure of human populations." Science 300: 1877. [PDF] [Data]

Our higher within-group variance component estimate in relation to comparable past studies is due to our use of allelic indicator variables, inclusion of tetranucleotide loci, and analysis of a sample that contained proportionately fewer geographically well-separated populations. The 83.4% estimate of Excoffier and Hamilton employs a subset of groups that are nearly maximally differentiated within regions, and it can therefore be regarded as a lower bound.

[12] NA Rosenberg, AG Tsolaki, MM Tanaka (2003) Estimating change rates of genetic markers using serial samples: applications to the transposon IS6110 in Mycobacterium tuberculosis. Theoretical Population Biology 63: 347-363. [PDF]

In infectious disease epidemiology, it is useful to know how quickly genetic markers of pathogenic agents evolve while inside hosts. We propose a modular framework with which these genotype change rates can be estimated. The estimation scheme requires a model of the underlying process of genetic change, a detection scheme that filters this process into observable quantities, and a monitoring scheme that describes the timing of observations. We study a linear "birth-shift-death" model for change in transposable element genotypes, obtaining maximum-likelihood estimators for various detection and monitoring schemes. The method is applied to serial genotypes of the transposon IS6110 in Mycobacterium tuberculosis. The estimated birth rate of 0.0161 (events per copy of the transposon per year) and death rate of 0.0108 are both significantly larger than the estimated shift rate of 0.0018. The sum of these estimates, which corresponds to a "half-life" of 2.4 years for a typical strain that has 10 copies of the element, substantially exceeds a previous estimate of 0.0135 total changes per copy per year. We consider experimental design issues that enable the precision of estimates to be improved. We also discuss extensions to other markers and implications for molecular epidemiology.

[11] LA Zhivotovsky, NA Rosenberg, MW Feldman (2003) Features of evolution and expansion of modern humans, inferred from genomewide microsatellite markers. American Journal of Human Genetics 72: 1171-1186. [PDF] [Data]

We study data on variation in 52 worldwide populations at 377 autosomal short tandem repeat loci, to infer a demographic history of human populations. Variation at di-, tri-, and tetranucleotide repeat loci is distributed differently, although each class of markers exhibits a decrease of within-population genetic variation in the following order: sub-Saharan Africa, Eurasia, East Asia, Oceania, and America. There is a similar decrease in the frequency of private alleles. With multidimensional scaling, populations belonging to the same major geographic region cluster together, and some regions permit a finer resolution of populations. When a stepwise mutation model is used, a population tree based on TD estimates of divergence time suggests that the branches leading to the present sub-Saharan African populations of hunter-gatherers were the first to diverge from a common ancestral population (~71-142 thousand years ago). The branches corresponding to sub-Saharan African farming populations and those that left Africa diverge next, with subsequent splits of branches for Eurasia, Oceania, East Asia, and America. African hunter-gatherer populations and populations of Oceania and America exhibit no statistically significant signature of growth. The features of population subdivision and growth are discussed in the context of the ancient expansion of modern humans.

[10] NA Rosenberg, JK Pritchard, JL Weber, HM Cann, KK Kidd, LA Zhivotovsky, MW Feldman (2002) Genetic structure of human populations. Science 298: 2381-2385. [Full Text at Science website] [PDF] [Supplement] [Data in Excel] [Data in structure and NEXUS formats] [Software for drawing figures] [Español]

We studied human population structure using genotypes at 377 autosomal microsatellite loci in 1056 individuals from 52 populations. Within-population differences among individuals account for 93 to 95% of genetic variation; differences among major groups constitute only 3 to 5%. Nevertheless, without using prior information about the origins of individuals, we identified six main genetic clusters, five of which correspond to major geographic regions, and subclusters that often correspond to individual populations. General agreement of genetic and predefined populations suggests that self-reported ancestry can facilitate assessments of epidemiological risks but does not obviate the need to use genetic information in genetic association studies.

[9] NA Rosenberg, M Nordborg (2002) Genealogical trees, coalescent theory, and the analysis of genetic polymorphisms. Nature Reviews Genetics 3: 380-390. [PDF] [article at NRG website (includes "bullet point" summary)]

Improvements in genotyping technologies have led to the increased use of genetic polymorphism for inference about population phenomena, such as migration and selection. Such inference presents a challenge, because polymorphism data reflect a unique, complex, non-repeatable evolutionary history. Traditional analysis methods do not take this into account. A stochastic process known as "the coalescent" presents a coherent statistical framework for analysis of genetic polymorphisms.

[8] NA Rosenberg (2002) The probability of topological concordance of gene trees and species trees. Theoretical Population Biology 61: 225-247. [PDF]

The concordance of gene trees and species trees is reconsidered in detail, allowing for samples of arbitrary size to be taken from the species. A sense of concordance for gene tree and species tree topologies is clarified, such that if the "collapsed gene tree" produced by a gene tree has the same topology as the species tree, the gene tree is said to be topologically concordant with the species tree. The term speciodendric is introduced to refer to genes whose trees are topologically concordant with species trees. For a given three-species topology, probabilities of each of the three possible collapsed gene tree topologies are given, as are probabilities of monophyletic concordance and concordance in the sense of N. Takahata (1989), Genetics 122, 957-966. Increasing the sample size is found to increase the probability of topological concordance, but a limit exists on how much the topological concordance probability can be increased. Suggested sample sizes beyond which this probability can be increased only minimally are given. The results are discussed in terms of implications for molecular studies of phylogenetics and speciation.

[7] NA Rosenberg, MW Feldman (2002) The relationship between coalescence times and population divergence times. Chapter 9 in M Slatkin and M Veuille, eds. Modern Developments in Theoretical Population Genetics. Oxford: Oxford University Press, pp. 130-164. [PDF of final version]

The divergence time of two populations is the amount of time that has elapsed since the populations arose from an ancestral group, while the coalescence time of a set of copies of a gene is the amount of time that has elapsed since the most recent common ancestor of the gene copies lived. We briefly review the methods that have been used to infer divergence times and coalescence times from genetic data. We then consider the relationship between divergence times and coalescence times in a population genetic model that includes divergence followed by migration between two descendant populations, paying particular attention to the fact that migration can cause coalescence to occur more recently than divergence. Insights gained from the model and its special cases are applied to four examples: the divergences of humans and chimpanzees, modern humans and Neanderthals, Africans and non-Africans, and Native Americans and Asians. For each example, we discuss the connection between hypothesized divergence times and estimated coalescence times.

[6] NA Rosenberg, T Burke, MW Feldman, P Friedlin, MAM Groenen, J Hillel, A Mäki-Tanila, M Tixier-Boichard, A Vignal, K Wimmers, S Weigend (2001) Empirical evaluation of genetic clustering methods using multilocus genotypes from 20 chicken breeds. Genetics 159: 699-713. [PDF] [Data] [Photo]

We tested the utility of genetic cluster analysis in ascertaining population structure of a large data set for which population structure was previously known. Each of 600 individuals representing 20 distinct chicken breeds was genotyped for 27 microsatellite loci, and individual multilocus genotypes were used to infer genetic clusters. Individuals from each breed were inferred to belong mostly to the same cluster. The clustering success rate, measuring the fraction of individuals that were properly inferred to belong to their correct breeds, was consistently ~98%. When markers of highest expected heterozygosity were used, genotypes that included at least 8-10 highly variable markers from among the 27 markers genotyped also achieved >95% clustering success. When 12-15 highly variable markers and only 15-20 of the 30 individuals per breed were used, clustering success was at least 90%. We suggest that in species for which population structure is of interest, databases of multilocus genotypes at highly variable markers should be compiled. These genotypes could then be used as training samples for genetic cluster analysis and to facilitate assignments of individuals of unknown origin to populations. The clustering algorithm has potential applications in defining the within-species genetic units useful in problems of conservation.

[5] MM Tanaka, NA Rosenberg (2001) Optimal estimation of transposition rates of insertion sequences for molecular epidemiology. Statistics in Medicine 20: 2409-2420. [PDF]

Outbreaks of infectious disease can be confirmed by identifying clusters of DNA fingerprints among bacterial isolates from infected individuals. This procedure makes assumptions about the underlying properties of the genetic marker used for fingerprinting. In particular, it requires that each fingerprint changes sufficiently slowly within an individual that isolates from separate individuals infected by the same strain will exhibit similar or identical fingerprints. We propose a model for the probability that an individual's fingerprint will change over a given period of time. We use this model together with published data in order to estimate the fingerprint change rate for IS6110 in human tuberculosis, obtaining a value of 0.0139 changes per copy per year. Although we focus on insertion sequences (IS), our method applies to other fingerprinting techniques such as pulsed-field gel electrophoresis (PFGE). We suggest sampling intervals that produce the least error in estimates of the fingerprint change rate, as well as sample sizes that achieve specified levels of error in the estimate.

[4] NA Rosenberg, E Woolf, JK Pritchard, T Schaap, D Gefel, I Shpirer, U Lavi, B Bonné-Tamir, J Hillel, MW Feldman (2001) Distinctive genetic signatures in the Libyan Jews. Proceedings of the National Academy of Sciences, USA 98: 858-863. [PDF] [Data]

Unlinked autosomal microsatellites in six Jewish and two non-Jewish populations were genotyped, and the relationships among these populations were explored. Based on considerations of clustering, pairwise population differentiation, and genetic distance, we found that the Libyan Jewish group retains genetic signatures distinguishable from those of the other populations, in agreement with some historical records on the relative isolation of this community. Our methods also identified evidence of some similarity between Ethiopian and Yemenite Jews, reflecting possible migration in the Red Sea region. We suggest that high-resolution statistical methods that use individual multilocus genotypes may make it practical to distinguish related populations of extremely recent common ancestry.

[3] L Jin, ML Baskett, LL Cavalli-Sforza, LA Zhivotovsky, MW Feldman, NA Rosenberg (2000) Microsatellite evolution in modern humans: a comparison of two data sets from the same populations. Annals of Human Genetics 64: 117-134. [PDF] [Data]

We genotyped 64 dinucleotide microsatellite repeats in individuals from populations that represent all inhabited continents. Microsatellite summary statistics are reported for these data, as well as for a data set that includes 28 out of 30 loci studied by Bowcock (1994) in the same individuals. For both data sets, diversity statistics such as heterozygosity, number of alleles per locus, and number of private alleles per locus produced the highest values in Africans, intermediate values in Europeans and Asians, and low values in Americans. Evolutionary trees of populations based on genetic distances separated groups from different continents. Corresponding trees were topologically similar for the two data sets, with the exception that the (&delta&mu)2 genetic distance reliably distinguished groups from different continents for the larger data set, but not for the smaller one. Consistent with our results from diversity statistics and from evolutionary trees, population growth statistics Sk and &beta, which seem particularly useful for indicating recent and ancient population size changes, confirm a model of human evolution in which human populations expand in size and through space following the departure of a small group from Africa.

[2] JK Pritchard, M Stephens, NA Rosenberg, P Donnelly (2000) Association mapping in structured populations. American Journal of Human Genetics 67: 170-181. [PDF]

The use in association studies of the forthcoming dense genome-wide collection of SNPs has been heralded as a potential breakthrough in studying the genetic basis of common complex disorders. A serious problem with association mapping is that population structure can lead to spurious associations between a candidate marker and a phenotype. One common solution has been to abandon case-control studies in favour of family-based tests of association, such as the TDT, but this comes at a considerable cost in the need to collect DNA from close relatives of affected individuals. In this article we describe a novel, statistically valid, method for case-control association studies in structured populations. Our method uses a set of unlinked genetic markers to infer details of population structure, and estimate the ancestry of sampled individuals, before using this information to test for associations within subpopulations. It provides power comparable with the TDT in many settings, and may substantially outperform it if there are conflicting associations in different subpopulations.

[1] JK Pritchard, NA Rosenberg (1999) Use of unlinked genetic markers to detect population stratification in association studies. American Journal of Human Genetics 65: 220-228. [PDF]

We examine the issue of population stratification in association mapping studies. In case-control studies of association, population subdivision or recent admixture of populations can lead to spurious associations between a phenotype and unlinked candidate loci. Using a model of sampling from a structured population, we show that if population stratification exists, it can be detected using unlinked marker loci. We show that the case-control study design using unrelated control individuals is a valid approach for association mapping, provided that marker loci unlinked to the candidate locus are included in the study in order to test for stratification. We suggest guidelines for how many unlinked marker loci should be used.