High-resolution mtDNA evidence for the late-glacial resettlement of Europe from an Iberian refugium
The advent of complete mitochondrial DNA (mtDNA) sequence data has ushered in a new phase of human evolutionary studies. Even quite limited volumes of complete mtDNA sequence data can now be used to identify the critical polymorphisms that define sub-clades within an mtDNA haplogroup, providing a springboard for large-scale high-resolution screening of human mtDNAs. This strategy has in the past been applied to mtDNA haplogroup V, which represents <5% of European mtDNAs. Here we adopted a similar approach to haplogroup H, by far the most common European haplogroup, which at lower resolution displayed a rather uninformative frequency distribution within Europe. Using polymorphism information derived from the growing complete mtDNA sequence database, we sequenced 1580 base pairs of targeted coding-region segments of the mtDNA genome in 649 individuals harboring mtDNA haplogroup H from populations throughout Europe, the Caucasus, and the Near East. The enhanced genealogical resolution clearly shows that sub-clades of haplogroup H have highly distinctive geographical distributions. The patterns of frequency and diversity suggest that haplogroup H entered Europe from the Near East 20,000–25,000 years ago, around the time of the Last Glacial Maximum (LGM), and some sub-clades re-expanded from an Iberian refugium when the glaciers retreated 15,000 years ago. This shows that a large fraction of the maternal ancestry of modern Europeans traces back to the expansion of hunter-gatherer populations at the end of the last Ice Age.
Haplogroup H accounts for 40%–50% of the mtDNA pool in most of Europe, and 20%–30% in the Near East and the Caucasus region (Richards et al. 2000; Achilli et al. 2004; Loogväli et al. 2004). It is thought to have evolved in the vicinity of the Near East 23,000–28,000 years ago, and to have spread into Europe 20,000 years ago. Founder analysis, based on control-region HVS-I sequences and limited restriction typing of coding-region markers in about 3000 samples, suggested that, in a manner similar to haplogroup V, some or all of European haplogroup H may then have re-expanded from a European glacial refuge 15,000 years ago (Torroni et al. 1998, 2001; Richards et al. 2000). However, the phylogenetic resolution of haplogroup H mitochondrial DNAs (mtDNAs) identified by control-region sequencing is poor, and the haplogroup as a whole does not display the marked frequency gradient present in haplogroup V. This meant that the use of haplogroup H as a whole to locate the likely refuge(s) was not possible.
We therefore decided upon a screening strategy based on complete mtDNA sequence information. Recent complete sequence data have indicated a number of highly informative coding-region polymorphisms that resolve distinct sub-clades of haplogroup H. Finnilä et al. (2001) defined two sub-haplogroups, H1 and H2, based, respectively, on the polymorphisms G3010A and A4769G (along with A1438G) in a sample of Finns. Herrnstadt et al. (2002), studying a large sample of haplogroup H individuals from the UK and US, described two further sub-groups, H3 and H4. H3 was the next most common sub-haplogroup after H1, characterized by the polymorphism T6776C. H4 was rarer, with eight polymorphisms (including C3992T and A4024G) separating it from the root of the haplogroup. Quintáns et al. (2004) further identified H5, defined by T4336C, H6, defined by G3915A (and a further sub-branch by A4727G), and H7, defined by A4793G, and additional sub-clades were recently proposed by Loogväli et al. (2004) and Achilli et al. (2004).
Distribution of the major sub-clades within haplogroup H
We sequenced 1580 base pairs of coding region in 649 samples belonging to haplogroup H from 20 populations from Europe, the Caucasus, and the Near East (Table 1) and combined them with published data (Finnilä et al. 2001; Herrnstadt et al. 2002). A phylogenetic network for the variation scored in the 894 haplogroup H mtDNAs is shown in Figure 1. In addition to the seven major clades defined previously, we identified a further minor sub-haplogroup defined by A4745G (recently labeled as H13 by Achilli et al. 2004). The two most frequent sub-haplogroups, H1 and H3, each show a rather star-like phylogeny. We refer to the paraphyletic collection of H mtDNAs outside these eight main sub-clades as H*.
Table 1. Distribution of H sub-haplogroups
Three sequences from Ingman et al. (2000), included in some analyses, are excluded here.
Figure 1. Reduced-median network (Bandelt et al. 1995) for coding-region polymorphisms found in 894 samples belonging to haplogroup H, in the four segments described in the text. Mutations that define the H sub-haplogroups are shown in bold, and sub-clades are labeled. The area of each circle is proportional to the number of mtDNAs in the total sample harboring the corresponding haplotype. Geographic regions are as follows: AJ, Ashkenazi Jews; SW, Iberia; NW, France, Ireland, UK/US sample, Norway; Med, mainland Italy, Sardinia, Crete; NE, Finland, Russia, Chuvash; Ca, North Caucasus; CSE, Poland, Czech Republic, Romania, Bulgaria; NRE, Near East. Further details of samples are given in Table 1.
The frequencies of haplogroup H as a whole, and its sub-haplogroups, are reported in Table 1 for the 22 populations analyzed here, with age estimates in Table 2. The majority of the European populations have an overall haplogroup H frequency of 40%–50%. Frequencies decrease in the southeast of the continent, reaching 20% in the Near East and Caucasus, and <10% in the Gulf (Fig. 2A). Thus, haplogroup H as a whole displays a broadly southeast-northwest frequency pattern, reminiscent of the first principal component of classical marker frequencies (Cavalli-Sforza et al. 1994). However, genealogical dissection into sub-clades reveals a quite different sub-structure, showing that this overall pattern is something of a chimera.
Table 2. Ages for subclades of haplogroup H based on coding region (1580bp) and HVS-I (nps 16090–16365) variation (excluding UK/US data)
|Age in years (SE)
||Age in years (SE)
|Haplogroup (sample size)
||Hapopgroup (sample size)
|(n = 247)||(4,000)||(3,500)||(n = 15)||(5,000)||(6,000)|
|Western Europe||14,000||16,000||(n = 32)||(4,000)||(2,000)|
|(n = 195)||(5,000)||(3,500)|
|(n = 148)||(6,000)||(3,000)||(n = 27)||(43,000)||(5,000)|
|Eastern Europe||(6,000)||(5,000)||(n = 19)||8,000||16,000|
|(n = 52)||(4,000)||(5,000)|
|(n = 39)||(8,000)||(9,000)||(n = 12)||(6,000)||(6,000)|
|H3||9,000||11,000||All H less H1 and H3||33,000||23,000|
|(n = 75)
||(n = 360)
Figure 2. Frequency distributions of haplogroups H (A), H1 (B), H3 (C), and H less H1 and H3 (D) in Europe, the Caucasus and the Near East.
The distribution of H1, the largest sub-clade, displays two peaks, one in Iberia and another in Scandinavia (Fig. 2B). However, the Norwegian sample size is low (n = 18) and haplogroup H is overrepresented (70%, while larger data sets for Norway point to a frequency of 50%: Richards et al. 2000). When we removed the Norwegian sample, the Scandinavian peak disappeared, and the picture showed only the decreasing frequency of sub-haplogroup H1 from the southwest to the north and east. H1 is almost exclusively European, with its only incursion into the Near East being a few Palestinian individuals bearing the most common haplotype. This absence of derived lineages in the Near East sample suggests that the H1 sub-clade had its origin in Europe. H1 has an age of 14,000 years (SE 4000) using coding-region data and 16,000 years (SE 3500) using HVS-I. No significant difference between its diversity in western and eastern Europe was manifest.
The distribution of the second most frequent sub-clade, H3 (Fig. 2C), shows a very similar pattern, again suggesting a European origin. The frequency difference between west and east is highly significant (2 = 28.2; P < 0.000001), as it is also for H1 (2 = 137.1; P < 0.000001). H3 is exclusively European, with no Near Eastern representatives, and is 9000 years old (SE 3000) based on the coding-region data and 11,000 years old (SE 3000) using HVS-I.
Minor sub-clades within haplogroup H
The remaining sub-clades occur at low frequency, and it is difficult to detect any geographical patterns (Fig. 2D). Within haplogroup H HVS-I lineages, the most frequent sub-clade (4% of Europeans) is defined by T16304C (Richards et al. 2000). This sub-clade encompasses all of H5 and a fraction of H* lineages, indicating that the T16304C transition may have happened only once within haplogroup H (although see Loogväli et al. 2004) and occurred before the H5-defining transition at np 4336. Thus sub-haplogroup H5 can be broadened to include the 16304 transition, as suggested by Loogväli et al. (2004), within which T4336C defines a further sub-clade, H5a. The frequency of H5a appears to be highest on the central European plain (Table 1), and dates to 7000–8000 years (Table 2). It is fairly evenly distributed at low levels across Europe but is absent from the Caucasus and the Near East, again suggesting a European origin. In contrast, the H5 clade is present at low levels (1%–3%) throughout the Near East and may have evolved there, spreading later into Europe. Its age based on HVS-I variation is 11,500 (SE 2700) years, and its ancestor was identified as a putative late-glacial founder type by Richards et al. (2000). However, the HVS-I database indicates that it is common (>4%) not only in Iberia but also in central, eastern, and southeast Europe, and rather less frequent in northwest Europe.
In contrast, H2 and H6 are both common in eastern Europe and the Caucasus, although there are hints that they may have dispersed from western Europe. In particular, the basal type of H6 is exclusively European, and there is a single derived type that is common in eastern Europe and the Caucasus. Neither H2 nor H6 are found in our Near Eastern sample. The infrequent sub-clades H4, H7, and H13 occur in both Europe and the Near East, and the latter is also present in the Caucasus.
Origins of haplogroup H
The paraphyletic ancestral cluster, H*, is the main Near Eastern representative of haplogroup H, in agreement with the suggestion that the haplogroup evolved in the Near East and spread subsequently into Europe. Its distribution is to some extent the inversion of the distributions for H1 and H3: It is most frequent in east-central Europe and the Balkans, but is also well represented on the western fringes of Europe, including Iberia and Ireland. The age of H* is best estimated as the age of the haplogroup as a whole, which comes to 29,900 (SE 7700) years using the present coding-region data set (excluding the 3010 variant which renders the tree very non-star-like). Using the complete coding-region sequence data of Finnilä et al. (2001) and Herrnstadt et al. (2002), the age estimate of H (including 3010) is rather less, at 17,600 (SE 2200) years. This may be because no Near Eastern lineages are included, or it may simply reflect the high uncertainty of the estimate from our coding-region segments.
It seems likely, on the basis of this evidence, that haplogroup H entered Europe not much more than 20,000–25,000 years ago, and dispersed rapidly to the southwest of the continent. Although this was at the peak of the last Ice Age, a passage into Europe at this time is not implausible from an archaeological perspective, since there is evidence for extensive contacts between people of the Badegoulian culture of east-central Europe and those of southwest Europe. Indeed, it now seems likely that the west European Magdalenian culture had its roots in the Badegoulian, and not in the local Solutrean of the western glacial refugium. It is the Magdalenian culture that is seen to expand dramatically from the Iberian refugium from 15,000 years ago in the radiocarbon record for western Europe, although Europe was probably never completely depopulated during the LGM (Housley et al. 1997; Terberger and Street 2002; Gamble et al. 2004).
Haplogroup V was identified, on the basis of control-region sequences, as a likely marker of a human dispersal in Late Pleistocene Europe (Torroni et al. 1998). Higher phylogenetic resolution of the lineages concerned clarified the geographic pattern by distinguishing the more derived haplogroup V from its ancestor, pre-V, which could now be seen to display a quite distinct phylogeographic pattern (Torroni et al. 2001). Haplogroup pre-V appeared to have entered Europe from the east sometime around 20,000–25,000 years ago, at the time of the LGM. However, the diversity and frequency of the derived haplogroup V suggested that it had evolved from pre-V in western Europe, with its age suggesting an expansion from a glacial refuge in Iberia 15,000 years ago, accompanying the Magdalenian expansion.
It is clear that the phylogeographic patterns displayed by sub-haplogroups H1 and H3 both closely resemble that of haplogroup V. The star-like phylogenies, geographic distribution, and estimated ages of all three clades suggest that they all took part in a major expansion from southwest to northeast Europe 12,000–14,000 years ago. Between them H1 and H3 amount to around half of the haplogroup H samples in our coding-region database. They comprise 65% of haplogroup H lineages in Iberia, 46% in the northwest, 27% in central and eastern Europeans, and 5%–15% in the Near East/Caucasus, falling to zero in the Gulf. It is notable that the diversity does not fall within H1 moving from west to east, unlike the situation with haplogroup V (Torroni et al. 2001), but a rapid expansion within the time-frame of the Magdalenian would in fact not be expected to result in a west-east diversity gradient. The cline seen in haplogroup V diversities most likely has its explanation in more recent founder events in the east.
The remaining haplogroup H lineages present a more complex pattern. The explanation must include the evolution of haplogroup H from its ancestor haplogroup HV, probably in the vicinity of the Near East (Richards et al. 2000; Loogväli et al. 2004), and subsequent founder events in Europe, seen in H*. Minor sub-clades found in both Europe and the Near East (H4, H7, and H13) may also have entered Europe around the LGM, and/or during later dispersals from the Near East, such as the Neo-lithic. H must have given rise to H1 and H3 in the western refuge (analogous to ancestral lineages within haplogroup pre-V giving rise to haplogroup V; Torroni et al. 2001), and itself appears very likely to have been partly redistributed alongside them by the late-glacial re-expansion, since an Atlantic European cluster clearly forms part of the H* phylogeny. Several other minor sub-clades (H2, H5a, H6) also seem likely to have taken part in this process, and may also have evolved in western Europe: More data will be needed to trace their phylogeographic patterns more closely. Interestingly, however, the frequency profile of H5a suggests that, if indeed it has largely been distributed by late-glacial dispersals, this sub-haplogroup may trace a distinct dispersal route into central and eastern Europe. In contrast, H1 and H3 appear at least in part to have spread northwards fairly close to the Atlantic coastline, into the British Isles.
The mtDNA evidence therefore correlates well with Y-chromosome evidence for late-glacial expansions from a south-west European refugium (Semino et al. 2000; Rootsi et al. 2004). It indicates that the major demographic signal in the modern European mtDNA pool is the result of the expansion of hunter-gatherer populations at the end of the Palaeolithic, although this has not entirely erased the traces of earlier processes.
Samples and sequencing
We dissected haplogroup H variation in 649 samples from 20 populations from Europe, the Caucasus, and the Near East (see Table 1) previously analyzed only for HVS-I sequence variation and some haplogroup-diagnostic RFLPs. We sequenced four mtDNA coding-region segments encompassing the principal diagnostic positions in haplogroup H samples: 3001–3360, 3661–4050, 4281–4820, and 6761–7050 (a total of 1580 base pairs) (Andrews et al. 1999). Primers used were, respectively: L2978, 5′-GTCCATATCAACAATAGGGT-3′ and H3361 5′-CGTTCGGTAAGCATTAGGAA-3′; L3640, 5′-TCTAGCCACCTCTAGCCTAG-3′ and H4051 5′-TAGAGTTCAGGGGAGAGTGC-3′; L4264, 5′-CATTCCCCCTCAAACCTAAG-3′ and H4821 5′-AGAGGGGTGCCTTGGGTAAC-3′; L6740, 5′-TGGTCTGAGCTATGATATCA-3′ and H7051 5′-GATGGCAAATACAGCTCCTA-3′. The temperature profiles for the PCR were: 95°C for 10 sec, 64°C for 30 sec, and 72°C for 30 sec, for 35 cycles, for the third pair of primers, and the same except 58°C as annealing temperature for the others. We carried out automated sequencing in an ABI 3100, using the Kit Big-Dye Terminator Cycle Sequencing Ready Reaction (AB Applied Biosystems).
Including 31 complete sequences from Finland (Finnilä et al. 2001), as well as (for the phylogenetic analyses) the US/UK coding-region data of Herrnstadt et al. (2002), we analyzed a total of 894 haplogroup H mtDNAs for these coding-region segments. The three haplogroup H sequences from Ingman et al. (2000) were also included in some analyses. The inclusion of both the control-region and 1580-base pair coding-region segments in the majority of individuals in our database allowed us to estimate clade ages using the statistic (Saillard et al. 2000) in two ways, using a calibration of 1 transition per 20,180 years for HVS-I (Forster et al. 1996) and 1 substitution per 50,200 years (µ = 1.26 x 10-8 substitutions per year per base: Mishmar et al. 2003) for the coding region. We constructed reduced-median networks (Bandelt et al. 1995) separately for the coding-region segments and HVS-I (between positions 16090–16365) and estimated ages of clades using the program “Network” (Shareware Phylogenetic Network Software, version 4.0).