The African continent is regarded as the cradle of modern humans and African genomes contain more genetic variation than those from any other continent, yet only a fraction of the genetic diversity among African individuals has been surveyed1. Here we performed whole-genome sequencing analyses of 426 individuals—comprising 50 ethnolinguistic groups, including previously unsampled populations—to explore the breadth of genomic diversity across Africa. We uncovered more than 3 million previously undescribed variants, most of which were found among individuals from newly sampled ethnolinguistic groups, as well as 62 previously unreported loci that are under strong selection, which were predominantly found in genes that are involved in viral immunity, DNA repair and metabolism. We observed complex patterns of ancestral admixture and putative-damaging and novel variation, both within and between populations, alongside evidence that population from Zambia were a likely intermediate site along the routes of expansion of Bantu-speaking populations. Pathogenic variants in genes that are currently characterized as medically relevant were uncommon—but in other genes, variants denoted as ‘likely pathogenic’ in the ClinVar database were commonly observed. Collectively, these findings refine our current understanding of continental migration, identify gene flow and the response to human disease as strong drivers of genome-level population variation, and underscore the scientific imperative for a broader characterization of the genomic diversity of African individuals to understand human ancestry and improve health.
This is a preview of subscription content
Subscription info for Chinese customers
We have a dedicated website for our Chinese customers. Please go to naturechina.com to subscribe to this journal.
Get time limited or full article access on ReadCube.
All prices are NET prices.
WGS data used in this paper are available through the European Genome-phenome Archive (EGA) under study accession number: EGAS00001002976. The data include genomic (BAMs and VCFs) and minimal phenotypic data from appropriately consented individuals. In compliance with current international standards to protect participant confidentiality, the H3Africa-generated data are available to bona fide researchers within the wider scientific community through a controlled access process. Some of the DNA samples are archived in H3Africa biorepositories as part of the H3Africa Consortium agreement. To gain access to data in the EGA or biospecimens in the biorepositories, requests must be submitted to email@example.com, or requested through the H3Africa Data and Biospecimen Catalogue (https://catalogue.h3africa.org). Requests are subject to approval by an independent H3Africa Data and Biospecimen Access Committee (DBAC). Novel SNVs identified and reported here will be deposited into dbSNP. The H3Africa Initiative is committed to providing research data generated by the H3Africa research projects to the entire research community. H3Africa research seeks to promote fair collaboration between scientists in Africa and those from elsewhere. The H3Africa Consortium Data Sharing, Access and Release Policy outlines a policy framework that places a firm focus on African leadership and capacity building as guiding principles for African genomics research. The policy and related documents are available here: https://h3africa.org/index.php/consortium/consortium-documents/.
Code for the implementation of PROCRUSTES is available at https://github.com/dshriner/Procrustes, licensed under the GNU General Public License v.3.0.
Nielsen, R. et al. Tracing the peopling of the world through genomics. Nature 541, 302–310 (2017).
The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Tishkoff, S. A. et al. The genetic structure and history of Africans and African Americans. Science 324, 1035–1044 (2009).
Gurdasani, D. et al. The African Genome Variation Project shapes medical genetics in Africa. Nature 517, 327–332 (2015).
Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
Posey, J. E. et al. Insights into genetics, human biology and disease gleaned from family based genomic studies. Genet. Med. 21, 798–812 (2019).
Landry, L. G., Ali, N., Williams, D. R., Rehm, H. L. & Bonham, V. L. Lack of diversity in genomic databases is a barrier to translating precision medicine research into practice. Health Aff. 37, 780–785 (2018).
H3Africa Consortium. Enabling the genomic revolution in Africa. Science 344, 1345–1346 (2014).
Patin, E. et al. Dispersals and genetic adaptation of Bantu-speaking populations in Africa and North America. Science 356, 543–546 (2017).
Hanchard, N. et al. Classical sickle beta-globin haplotypes exhibit a high degree of long-range haplotype similarity in African and Afro-Caribbean populations. BMC Genet. 8, 52 (2007).
Ranciaro, A. et al. Genetic origins of lactase persistence and the spread of pastoralism in Africa. Am. J. Hum. Genet. 94, 496–510 (2014).
Genovese, G. et al. Association of trypanolytic ApoL1 variants with kidney disease in African Americans. Science 329, 841–845 (2010).
Sabeti, P. C. et al. Detecting recent positive selection in the human genome from haplotype structure. Nature 419, 832–837 (2002).
Schlebusch, C. M. et al. Genomic variation in seven Khoe-San groups reveals adaptation and complex African history. Science 338, 374–379 (2012).
Scheinfeldt, L. B. et al. Genomic evidence for shared common ancestry of East African hunting-gathering populations and insights into local adaptation. Proc. Natl Acad. Sci. USA 116, 4166–4175 (2019).
Skoglund, P. et al. Reconstructing prehistoric African population structure. Cell 171, 59–71 (2017).
Choudhury, A. et al. Whole-genome sequencing for an enhanced understanding of genetic variation among South Africans. Nat. Commun. 8, 2062 (2017).
Ilboudo, H. et al. Introducing the TrypanoGEN biobank: a valuable resource for the elimination of human African trypanosomiasis. PLoS Negl. Trop. Dis. 11, e0005438 (2017).
Alexander, D. H., Novembre, J. & Lange, K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19, 1655–1664 (2009).
Patterson, N. et al. Ancient admixture in human history. Genetics 192, 1065–1093 (2012).
Semo, A. et al. Along the Indian Ocean coast: genomic variation in Mozambique provides new insights into the Bantu expansion. Mol. Biol. Evol. 37, 406–416 (2020).
Loh, P.-R. et al. Inferring admixture histories of human populations using linkage disequilibrium. Genetics 193, 1233–1254 (2013).
Patin, E. et al. The impact of agricultural emergence on the genetic history of African rainforest hunter-gatherers and agriculturalists. Nat. Commun. 5, 3163 (2014).
Shriner, D. & Rotimi, C. N. Genetic history of Chad. Am. J. Phys. Anthropol. 167, 804–812 (2018).
Campbell, I. M. et al. Multiallelic positions in the human genome: challenges for genetic analyses. Hum. Mutat. 37, 231–234 (2016).
Campbell, M. C. & Tishkoff, S. A. African genetic diversity: implications for human demographic history, modern human origins, and complex disease mapping. Annu. Rev. Genomics Hum. Genet. 9, 403–433 (2008).
Pavlidis, P., Živkovic, D., Stamatakis, A. & Alachiotis, N. SweeD: likelihood-based detection of selective sweeps in thousands of genomes. Mol. Biol. Evol. 30, 2224–2234 (2013).
Szpiech, Z. A. & Hernandez, R. D. selscan: an efficient multithreaded program to perform EHH-based scans for positive selection. Mol. Biol. Evol. 31, 2824–2827 (2014).
Vitti, J. J., Grossman, S. R. & Sabeti, P. C. Detecting natural selection in genomic data. Annu. Rev. Genet. 47, 97–120 (2013).
Yi, X. et al. Sequencing of 50 human exomes reveals adaptation to high altitude. Science 329, 75–78 (2010).
Retshabile, G. et al. Whole-exome sequencing reveals uncaptured variation and distinct ancestry in the southern African population of Botswana. Am. J. Hum. Genet. 102, 731–743 (2018).
Lim, E. T. et al. Distribution and medical impact of loss-of-function variants in the Finnish founder population. PLoS Genet. 10, e1004494 (2014).
World Health Organization. WHO Influenza (Seasonal): Fact Sheet https://www.who.int/news-room/fact-sheets/detail/influenza-(seasonal) (2016).
Kalia, S. S. et al. Recommendations for reporting of secondary findings in clinical exome and genome sequencing, 2016 update (ACMG SF v2.0): a policy statement of the American College of Medical Genetics and Genomics. Genet. Med. 19, 249–255 (2017).
Manjurano, A. et al. African glucose-6-phosphate dehydrogenase alleles associated with protection from severe malaria in heterozygous females in Tanzania. PLoS Genet. 11, e1004960 (2015).
Howes, R. E., Battle, K. E., Satyagraha, A. W., Baird, J. K. & Hay, S. I. G6PD deficiency: global distribution, genetic variants and primaquine therapy. Adv. Parasitol. 81, 133–201 (2013).
Kimuda, M. P. et al. No evidence for association between APOL1 kidney disease risk alleles and human African trypanosomiasis in two Ugandan populations. PLoS Negl. Trop. Dis. 12, e0006300 (2018).
Rotimi, C. N. & Jorde, L. B. Ancestry and disease in the age of genomic medicine. N. Engl. J. Med. 363, 1551–1558 (2010).
Phillipson, D. W. Iron Age history and archaeology in Zambia. J. Afr. Hist. 15, 1–25 (1974).
Schlebusch, C. M. & Jakobsson, M. Tales of human migration, admixture, and selection in Africa. Annu. Rev. Genomics Hum. Genet. 19, 405–428 (2018).
Mulindwa, J. et al. High levels of genetic diversity within Nilo-Saharan populations: implications for human adaptation. Am. J. Hum. Genet. 107, 473–486 (2020).
Shiroya, O. J. E. The Lugbara states — politics, economics and warfare in the eighteenth and nineteenth centuries. TransAfrican J. Hist. 10, 125–183 (1981).
R Core Team. R: A Language and Environment for Statistical Computing. http://www.R-project.org/ (R Foundation for Statistical Computing, 2017).
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
O’Connell, J. et al. A general approach for haplotype phasing across the full spectrum of relatedness. PLoS Genet. 10, e1004234 (2014).
Loh, P. R., Palamara, P. F. & Price, A. L. Fast and accurate long-range phasing in a UK Biobank cohort. Nat. Genet. 48, 811–816 (2016).
Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 7 (2015).
Patterson, N., Price, A. L. & Reich, D. Population structure and eigenanalysis. PLoS Genet. 2, e190 (2006).
Buchmann, R. & Hazelhurst, S. Genesis PCA and Admixture Plot Viewer. Version 0.2.6 http://www.bioinf.wits.ac.za/software/genesis (2014).
Jakobsson, M. & Rosenberg, N. A. CLUMPP: a cluster matching and permutation program for dealing with label switching and multimodality in analysis of population structure. Bioinformatics 23, 1801–1806 (2007).
Wang, C. et al. Comparing spatial maps of human population-genetic variation using Procrustes analysis. Stat. Appl. Genet. Mol. Biol. 9, 13 (2010).
Pickrell, J. K. & Pritchard, J. K. Inference of population splits and mixtures from genome-wide allele frequency data. PLoS Genet. 8, e1002967 (2012).
Pickrell, J. K. et al. Ancient west Eurasian ancestry in southern and eastern Africa. Proc. Natl Acad. Sci. USA 111, 2632–2637 (2014).
Browning, B. L. & Browning, S. R. Improving the accuracy and efficiency of identity-by-descent detection in population data. Genetics 194, 459–471 (2013).
Atzmon, G. et al. Abraham’s children in the genome era: major Jewish diaspora populations comprise distinct genetic clusters with shared Middle Eastern ancestry. Am. J. Hum. Genet. 86, 850–859 (2010).
Maples, B. K., Gravel, S., Kenny, E. E. & Bustamante, C. D. RFMix: a discriminative modeling approach for rapid and robust local-ancestry inference. Am. J. Hum. Genet. 93, 278–288 (2013).
Haber, M. et al. Chad genetic diversity reveals an African history marked by multiple Holocene Eurasian migrations. Am. J. Hum. Genet. 99, 1316–1324 (2016).
Weissensteiner, H. et al. HaploGrep 2: mitochondrial haplogroup classification in the era of high-throughput sequencing. Nucleic Acids Res. 44, W58–W63 (2016).
Van Geystelen, A., Decorte, R. & Larmuseau, M. H. D. AMY-tree: an algorithm to use whole genome SNP calling for Y chromosomal phylogenetic applications. BMC Genomics 14, 101 (2013).
Pemberton, T. J. et al. Genomic patterns of homozygosity in worldwide human populations. Am. J. Hum. Genet. 91, 275–292 (2012).
Fumagalli, M. Assessing the effect of sequencing depth and sample size in population genetics inferences. PLoS ONE 8, e79667 (2013).
Zerbino, D. R. et al. Ensembl 2018. Nucleic Acids Res. 46, D754–D761 (2018).
Amberger, J. S., Bocchini, C. A., Scott, A. F. & Hamosh, A. OMIM.org: leveraging knowledge across phenotype-gene relationships. Nucleic Acids Res. 47, D1038–D1043 (2019).
Stelzer, G. et al. The GeneCards suite: from gene data mining to disease genome sequence analyses. Curr. Protoc. Bioinformatics 54, 1.30.31–1.30.33 (2016).
Pybus, M. et al. 1000 Genomes Selection Browser 1.0: a genome browser dedicated to signatures of natural selection in modern humans. Nucleic Acids Res. 42, D903–D909 (2014).
Sabeti, P. C. et al. Genome-wide detection and characterization of positive selection in human populations. Nature 449, 913–918 (2007).
Pickrell, J. K. et al. Signals of recent positive selection in a worldwide sample of human populations. Genome Res. 19, 826–837 (2009).
Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).
GTEx Consortium. The Genotype-Tissue Expression (GTEx) project. Nat. Genet. 45, 580–585 (2013).
Cingolani, P. et al. Using Drosophila melanogaster as a model for genotoxic chemical mutational studies with a new program, SnpSift. Front. Genet. 3, 35 (2012).
Sherry, S. T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–311 (2001).
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
MacArthur, J. et al. The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog). Nucleic Acids Res. 45, D896–D901 (2017).
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
Mazandu, G. K., Chimusa, E. R., Mbiyavanga, M. & Mulder, N. J. A-DaGO-Fun: an adaptable Gene Ontology semantic similarity-based functional analysis tool. Bioinformatics 32, 477–479 (2016).
Bindea, G. et al. ClueGO: a Cytoscape plug-in to decipher functionally grouped gene ontology and pathway annotation networks. Bioinformatics 25, 1091–1093 (2009).
Balasubramanian, S. et al. Using ALoFT to determine the impact of putative loss-of-function variants in protein-coding genes. Nat. Commun. 8, 382 (2017).
Smedley, D. et al. The BioMart community portal: an innovative alternative to large, centralized data repositories. Nucleic Acids Res. 43, W589–W598 (2015).
Piñero, J. et al. DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants. Nucleic Acids Res. 45, D833–D839 (2017).
Babbi, G. et al. eDGAR: a database of disease–gene associations with annotated relationships among genes. BMC Genomics 18, 554 (2017).
Davis, A. P. et al. The Comparative Toxicogenomics Database: update 2019. Nucleic Acids Res. 47, D948–D954 (2019).
ACMG Board of Directors. ACMG policy statement: updated recommendations regarding analysis and reporting of secondary findings in clinical genome-scale sequencing. Genet. Med. 17, 68–69 (2015).
Landrum, M. J. et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 46, D1062–D1067 (2018).
We thank the members of the wider H3Africa Consortium (www.h3africa.org) for their support and input, particularly J. Troyer and A. Duncanson; S. Tishkoff, J. Lupski, J. Belmont and C. Tyler-Smith for comments and feedback on the manuscript; K. Garson, A. Gillum and K. Schulze for their help with figure visualizations and for giving permission for the use of these figures; M. Cherif Rahimy for their assistance with recruitment in Benin and L. Sergeevna Mainzer, G. Rendon and V. Jongeneel from the HPCBio team at the University of Illinois Urbana-Champaign for the initial processing and variant calling of the high depth H3A-Baylor dataset using the Blue Waters supercomputing centre. WGS in H3Africa cohorts was supported by a grant from the National Human Genome Research Institute, National Institutes of Health (NIH/NHGRI) U54HG003273. The African Collaborative Center for Microbiome and Genomics Research (ACCME) is funded by NIH/NHGRI grant U54HG006947. The AWI-Gen Collaborative Centre is funded by NIH grant U54HG006938. The Exploring Perspectives on Genomics and Sickle Cell Public Health Interventions was funded by NHGRI/NIH grant U01HG007459. The Clinical and Genetic Studies of Hereditary Neurological Disorders in Mali study was funded by the NHGRI/NIH grant U01HG007044. The Collaborative African Genomics Network (CAfGEN) is funded by the National Institute of Allergy and Infectious Diseases (NIAID) of NIH and the NHGRI of the NIH (U54AI110398). ‘TrypanoGEN: an integrated approach to the identification of genetic determinants of susceptibility to trypanosomiasis’, was funded by the Wellcome Trust (099310/Z/12/Z). L.R.B. was supported by the CERCA Programme/Generalitat de Catalunya and by the Spanish Ministry of Economy and Competitiveness, through the ‘Severo Ochoa Programme for Centres of Excellence in R&D’ 2016–2019 (SEV-2015-0533). N.M. (principal investigator), S.A., G.B., G.W., J.K., Y.J.F., T.O., O.F., E.A., S.H., G. Mazandu, M. Mbiyvanga, A.B., S.K.K., E.R.C. and A. Moussa are funded by the NIH H3ABioNet grant under award number U24HG006941. The content of this paper is solely the responsibility of the authors and does not necessarily represent the official views of the African Academy of Sciences, the National Institutes of Health or the Wellcome Trust.
The authors declare no competing interests.
Peer review information Nature thanks Laura Gauthier, Joanna Mountain and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data figures and tables
Existing African datasets from AGVP4, 1000 Genomes project2, SAHGP17 and previously published studies9,14 and a representative European population (CEU) from the 1000 Genomes Project are included as reference panels. K values from 2 to 10 are shown. See Supplementary Table 22 for definitions of abbreviations.
a, CLR score distributions in known selected genes (significant population-specific outlier scores (that is, with P < 0.01) for the window overlapping the gene are indicated by an asterisk). b, Summary of PBS comparisons. Genes with longer branch lengths in WGR compared to BOT and CAM are circled in blue; longer branch lengths in BOT and CAM in comparison to the other two populations are encircled in brown and dark green, respectively. c, Overlap between the proportion of KS ancestry (%) and CLR score across chromosome 6 in BOT.
a, EFO traits from the GWAS catalogue reflected by highly divergent SNVs within 50 kb of GWAS hits. From left to right, ribbons illustrate the relative representation of variants across pairwise population comparisons, GWAS ancestry, EFO top label, EFO trait or disease label, and disease or traits mapped to the EFO label. b, Distribution and sharing of common (MAF > 5%) putative LOF variants between two or more populations (coloured bars) and between all populations surveyed (red bars). c, Specific disease classes to which 5% or more genes with putative LOF variants shared between all populations were mapped. d, Correlation (Pearson) between WHO mortality rates for influenza and ratio of putative LOF variants in direct (n = 181) compared with indirect (n = 1842) influenza-associated genes (red solid line, all populations; red dotted line, west African populations). The blue dotted line represents the mean correlation for the same correlations generated using 1,000 permutations of random genes; the s.e.m. for all populations is shown in grey. e, Correlation statistics (adjusted R2) for the putative LOF ratio for genes related to hepatitis C (HCV, n = 190 direct genes, n = 1837 indirect genes), HIV(n = 724 direct genes, n = 1351 indirect genes), influenza in west African countries (CAM, MAL, FNB and BRN), and malaria (n = 484 direct genes, n = 1554 indirect genes) are shown as red dots against the box plot distributions of correlation statistics (adjusted R2) generated using 1,000 permutations of random genes (Supplementary Table 18). Box plots show the median value (centre line), whiskers indicate the limits of the highest (fourth) and lowest (first) quartiles of the data; distribution outliers are shown as dots.
Extended Data Fig. 4 Distribution of G6PD variants and ClinVar pathogenic variants across H3Africa populations.
a, Frequency distribution of pathogenic and likely pathogenic variants (n = 287) in H3Africa HC-WGS populations. Disease genes with variants that had an allele frequency > 5% across multiple populations (shown in Fig. 4c) are highlighted. Box plots show the median value (centre line), whiskers indicate the limits of the highest (fourth) and lowest (first) quartiles of the data; distribution outliers are shown as dots. b, Relative frequencies of 11 G6PD deficiency-associated alleles within each population separated by sex. G6PD A− 202A and 376G refer to the A-deficiency associated with either rs1050828 (c.202G>A) or rs1050829 (c.376A>G) (MIM 305900).
This file contains Supplementary Notes 1-5, Supplementary Figures 1-20, Supplementary Methods Figures 1–3 and Supplementary References.
This file contains Supplementary Methods Tables 1-2 and 23 Supplementary Tables (referred to in the main Supplementary Information file).
About this article
Cite this article
Choudhury, A., Aron, S., Botigué, L.R. et al. High-depth African genomes inform human migration and health. Nature 586, 741–748 (2020). https://doi.org/10.1038/s41586-020-2859-7
Polygenic transcriptome risk scores (PTRS) can improve portability of polygenic risk scores across ancestries
Genome Biology (2022)
Nature Genetics (2022)
Nature Reviews Genetics (2022)
Natural Computing (2022)