格陵兰岛的遗传结构是由人口统计学，结构和选择塑造的

　　该研究得到了格陵兰科学伦理委员会的批准（项目505-42 ，505-95，项目2011-13（参考号编号2011–056978），项目2014-08（编号2014-098017），项目2017-5582，项目2015-22，第2015-22页，与2015-164 and Actip and Incotive and Incotive and Incoctibaly and Project”）赫尔辛基的声明，第二次修订。所有参与者均给予书面知情同意。　　用户研究是作为公开会议进行的，与格陵兰的人群研究以及基于基因型的CSID和TBC1D4变体载体之间的共享圈子进行了共享圈子。简而言之，这些用户研究的结论表明，对慢性疾病的研究被认为是重要和必要的。这包括了解遗传和生活方式如何相互作用。需要研究传统生活方式对健康的重要性的需求，包括传统实践在治疗中的重要性。此外，探索目前不存在治疗的区域至关重要。研究是发现新的预防和治疗方法的先决条件，解释可以帮助减轻与疾病和症状相关的焦虑。最后，为调查遗传性健康和疾病的原因提供了广泛的支持。用户表示，识别已知的遗传原因可以使早期预防和治疗，尤其是对于家庭成员。缺乏代表被认为是一个重大问题。　　我们包括居住在格陵兰的格陵兰人，该绿地人口健康调查B99（参考文献47）（1999- 2001年招募； n = 1,317），过渡中的因纽特人健康48（IHIT; IHIT; 2005-2010; n = 3,108）和B2018和B2018（refs 。49,50）（Reps。49,50）（招募2017-2019;我们还包括居住在丹麦的格陵兰人队的队列，作为B99调查的一部分（BBH; n = 739）。一些参与者是几项调查的一部分，导致各个同伙共有5,996名独特的参与者。如前所述47,48,49，参与者是在格陵兰岛每个地区的代表城镇和定居点中招募的，那里从中央人口登记中抽取随机样本。　　如参考文献中所述，WGS数据处理。14，它用于筛选与年轻人中与成熟性糖尿病有关的14个基因中的新变体。简而言之，使用Illumina 150配对末端测序对448个WGS样品进行测序，并具有平均测序深度为35倍。修剪适配器后，将读取用BWA-MEM映射到GRCH38，并使用GATK单倍型呼叫者进行了基因型调用。仅使用了T98批次中的变体重新校准。此外，我们使用了从多种族全球阵列（Mega-Chip，iLlumina）产生的5,548人的基因型数据；其中，参考文献中先前描述了4,182个。20（带有欧洲基因组档案存档，登录号，Egad00010002057），其余1,366个是较新的B2018研究的一部分。这两个数据集被合并，并且在变体质量控制之后，数据由160万个常染色体变体组成。　　为了为下游插补分析准备一个良好的参考面板，我们使用ShapeIt2（参考文献51）使用三重奏和二人信息来逐步逐步逐步研究WGS数据。我们将其称为格陵兰WGS面板，其中包括在格陵兰WGS数据中发现的所有地点。此外，我们准备了另一个参考面板，该参考面板由1公斤项目中的所有参与者组成，我们称之为1kg面板。对于此面板，我们将整个与格陵兰面板的重叠站点纳入了与东亚和欧洲血统的人群中的所有MAF> 1％的地点（CDX，CEU，CHB ，CHS，CHS，GBR ，GBR，TSI和IBS52）。然后，我们用选项“ -Merge_ref_panels ”将Mega-Chip数据归为Impute2（参考文献51,53），该选项可以利用两个准备好的参考面板来提高插补性能。最后，我们通过将估算的（和分阶段）的巨型数据与分阶段的WGS数据合并，创建了两个数据集，并没有逐步缩放信息，这导致了5,996名参与者。对于下游分析，我们对基因型缺失和等位基因频率进行了对变体的质量控制，如每个分析中所述。此外，为了研究Greenlandic WGS面板的提高插定精度，我们还使用仅1公斤参考面板进行了估算V.2，并且仅在重叠位点显示插补性能。　　All variants were annotated using VEP54 with the following additional custom annotations: dbSNP (build 155), gnomAD v.3.0.0 allele frequencies and genome coverage (gnomADover15: proportion of gnomAD participants with depth > 15, gnomADover50: proportion of gnomAD participants with depth > 50, gnomAD filter: the gnomAD filter column being either PASS, AC0,AS_VQSR ，InbreedingCoeff或Na如果不是gnomad中的多态性），则为1kg频率（V.20201028_3202_PHASE）52，来自Ensembl Homo Sapiens祖先祖先序列的祖先等位基因，以及功能官能转录效应效果估计55。我们没有使用GNOMAD v.4，因为它没有有关非公布站点的深度和过滤器的信息。　　要找到新的变体，我们只保留了gnomad覆盖良好的变体（gnomadover15> 80％，gnoMadover5050< 10% and gnomAD filter = PASS or NA). Moreover, we did not allow for spanning deletions (variant call format (VCF) asterisk allele), variants with missingness greater than 20% or several variants 1 bp apart in the Greenlandic WGS data. All these filters were used to minimize the number of false-positive variants. For plotting, variants were polarized according to the minor allele in gnomAD African populations. 　　For comparing the number of SNPs not found in other populations we used only SNPs with good coverage in gnomAD (as defined above) and found the maximum allele frequency in any of the following gnomAD populations: AFR, NFE, FIN, SAS, EAS and ASJ, covering African, European and Asian populations. SNPs with a maximum gnomAD frequency more than 0.01%, were excluded and the remaining SNPs were counted as SNPs not in Africa, Europe or Asia for both the 448 admixed Greenlandic WGS samples and a random sample of 448 samples from the Americas in the 1KG PEL, CLM, MXL and PUR populations. 　　Height, weight, systolic blood pressure, diastolic blood pressure, and hip and waist circumference were measured, and body mass index and waist-hip ratio were calculated. All IHIT participants above 18 years, B99 participants above 35 years and a subset of B2018 participants underwent an oral glucose tolerance test, where blood samples were drawn after an overnight fast of at least 8 h, and at 30 min (only for B2018) and 2 h after receiving 75 g glucose. Plasma glucose was measured at fasting, 30 min and 2 h, and haemoglobin 1Ac at fasting, as previously described56. Concentrations of serum cholesterol, high-density lipoprotein cholesterol and triglycerides were measured, and LDL-cholesterol calculated. Type 2 diabetes was defined based on the World Health Organization 1999 criteria57 and controls were defined as normal glucose tolerant based on the oral glucose tolerance test data. 　　The Olink protein data for the Greenlandic participants used for protein quantitative trait loci analysis are from ref. 31. Using the Olink Target 96 Inflammation and Cardiovascular II panels, relative plasma levels of 184 proteins were measured in 3,732 participants across the population surveys. The 2 batches were bridged and normalized based on 16 control samples using the OlinkAnalyze R package (https://cran.r-project.org/web/packages/OlinkAnalyze/index.html). Normalized protein expression values on a log2 scale were inverse-rank normalized, including normalized protein expression data below the limit of detection. Samples with a quality control warning were excluded. 　　The Danish data used in assessment of polygenic scores consist of 6,182 people from the Danish population-based Inter99 cohort, where metabolic phenotypes were measured as described previously58, and genotyping was done using the Infinium OmniExpress-24 v.1.3 Chip (Illumina Inc.). Genotypes were called using GenCall in GenomeStudio (v.2011.1; Illumina) and quality control was performed according to a standard procedure as previously described59. Data were imputed using the Haplotype Reference Consortium reference panel on the Michigan imputation server. 　　Inuit and European admixture proportions were calculated using the software ADMIXTURE60 on a subset of variants with MAF >5％，使用Plink v.1.9.0（参考文献61），缺失小于1％，在1 MB内删除R2> 0.8的LD proun。　　为了对因纽特人的血统进行精细的结构分析，我们在所有5,996名Greenlandic参与者的分阶段和估算数据上使用了神经网络框架，即MAF> 5％。首先，我们使用高斯混合物变化自动编码器进行了基于窗口的单倍型聚类。使用1,024个SNP的窗口大小用于生成所有样品的单倍型群群可能性，我们利用该样本通过祖先估计和主要成分分析来推断出细数的种群结构。我们进行了无监督的祖先估计，允许两个祖先来源（k = 2），并用几种种子运行它，以确保Haplonet的期望最大化算法已收敛。收敛标准被定义为在最佳种子的五个log类样单元中具有两次运行。假定这两个祖先反映了因纽特人和欧洲血统。　　接下来，我们使用Haplonet Fatash推断了单倍型的局部血统。FATASH根据单倍型可能性和从Haplonet Train和Haplonet Admix获得的基于单倍型的可能性和全基因组的混合估计值的每个基因组窗口的局部祖先的后验概率。该模型基于具有瞬时速率变化的隐藏马尔可夫模型。Fatash在Python/Cython中作为Haplonet软件套件中的子模型实施。使用三种不同的窗口尺寸（1,024; 512和256变体）推断出局部祖先区，以提高与重组事件相关的准确性。如果拟合度超过50 log-okelihoodhoodhoody单位，我们仅使用较小的窗口（512或256个变体），以平衡过度拟合和检测真正的重组事件。　　最后，我们根据Fatash推断出的局部血统，在基因组窗口中掩盖了基因组窗口中的单倍型簇，后验概率小于0.95。在基因组掩盖后，我们排除了低于0.90的失踪性和基因组窗口的人，而下游分析掩盖后，单倍型频率小于0.01。使用Haplonet PCA和Haplonet Admix，使用了蒙版数据集对因纽特人血统进行精细结构分析。对于Haplonet Admix，我们使用了与上述相同的收敛标准，并从两个到八个祖先进行建模。具有九个祖先来源的混合模型无法满足我们的融合标准。　　从估算的数据和分阶段数据中，我们首先排除了缺失小于1％和MAF的变体< 5%. Then, we inferred local ancestry for the admixed participants using RFmix v.2 with unadmixed Greenlandic Inuit (N = 99) and participants with European genetic ancestry (N = 313, CEU, IBS and TSI) from the 1KG project as reference populations. This gives us inference of local ancestry tracts for each person. To analyse the genetic architecture of the Greenlandic Inuit, we constructed a masked WGS dataset, using vcfppR62, where regions with any European local ancestry were excluded resulting in an average of 240 (minimum, 201; maximum, 285) Greenlandic Inuit participants at each site. To create the masked dataset, for each person we only keep sites in a local ancestry region inferred to be of Inuit ancestry on both alleles. In this way we can also assess rare variants without relying on correct phasing. We manually excluded the HLA region (Chr. 9: 28510120–33480577), local ancestry tracts with fewer than 200 variants per megabase and local ancestry tracts with extreme inferred mean-Inuit ancestry (less than 63% or more than 71%) as they are all potentially problematic regions in terms of inferring local ancestry accurately. 　　Reference and alternative allele counts were counted using Plink v.1.9.0 (ref. 61) keeping allele order and projected to the wanted number of participants using the formula binom(m,j) × binom(n − m, k − j)/binom(n,k), where k is the observed number of alternative alleles, n is the number of total alleles, m is the number of alleles to project to (that is, two times the number of participants) and j is the site frequency spectra (SFS)-bin. For each site we get the probability that we would observe j alternative alleles in a subsample of m alleles. The probabilities were summed across variants and folded to get the folded SFS. 　　The derived sum was calculated based on a SFS polarized by ancestral or derived allele only using SNPs with a high-confidence ancestral allele match. Derived alleles were counted and projected to the needed number of participants as above SFS, but not folded. 　　To measure the number of segregating SNPs as a function of the number of participants sequenced, we projected the SFS to the wanted number of participants, folded the SFS and summed across all the non-zero SFS-bins. In this way, we get the number of segregating SNPs for all possible subsamples of participants from our data. 　　Only variants with good coverage in gnomAD v.3.0.0 as described above and predicted to be LoF on the canonical transcript were used. For each participant, variants with a MAF in any gnomAD population above 0.1% were excluded. For Greenlandic people, an additional count was made where the variants could be excluded based on either MAF from any gnomAD population or MAF from the Greenlanders excluding itself. Figure 2b shows the results for different MAF thresholds. Numbers of people in genetic ancestry groups were as follows: Greenlandic, 448; Greenlandic (unadmixed), 31; Nigerian (YRI + ESN), 207; Han Chinese (CHB + CHS), 208 and British (+CEU) (GBR + CEU), 190. 　　A total of 190 participants with the least European ancestry were sampled from the Greenlandic WGS data (on average 12% and at most 20% European ancestry) and 190 participants were sampled randomly from the East Asian populations, CHB and CHS. The European GBR + CEU group had a total of 190 unrelated samples. All variants were annotated with an allele frequency based on the African populations from 1KG without European admixture (LWK, ESN, YRI, MSL and GWD)52. Only SNPs with good coverage in gnomAD as defined above, predicted to be missense or LoF variants in the canonical transcript of a constrained gene, and with an African MAF < 0.01% were kept. Constrained genes were defined to be genes with expected number of pLoF and LOEUF score estimated by Karczewski et al.55 being less than 10 and 1, respectively. This resulted in 9,533 constrained genes from canonical transcripts. In each gene, two burdens were calculated per person: (1) the gene burden, which is 0 if no SNPs were present and 1 if at least one SNP was present and (2) the most common burden, which is 1 if the person carries the most common variant in the gene and 0 otherwise. If the most common burden was less than 50% of the gene burden, we categorize the gene burden as informative and if the most common burden was at least 90% of the gene burden, we categorize the gene as one variant dominates. All calculations were done independently in each of the three populations. 　　We randomly sampled 190 unadmixed (inferred European ancestry less than 1%) and unrelated Greenlanders. From these, we calculated pairwise LD using Plink v.1.9.0 in a 10-Mb region. The average number of tag-SNPs was calculated by counting the number variants that were in high LD (R2 > 0.8) in regions of varying sizes. 　　Association tests were run using a linear mixed model, with the estimated genetic relationship matrix (GRM) as a random effect, taking population structure and relatedness into account, using GEMMA v.0.98.5 (ref. 63). The GRM was estimated from a set of ‘good’ variants with MAF >5％，少于1％，INFO> 0.95，在Hardy -Weinberg均衡中考虑了人口结构（PCANGSD，不包括| sitef | sitef |小于0.05和p的站点< 1 × 10−6). We used a leave one chromosome out scheme where the GRM used for testing associations on a given chromosome was estimated using all the other chromosomes. All phenotypes were tested using a score-test after sex-stratified rank-based inverse normal transformation and including age, sex and cohort as covariates. As in previous studies14, the odds ratio for diabetes was calculated using the logistic mixed model implemented in GMMAT64. For each phenotype, the GRMs were estimated only for people with no missing phenotype or covariate information. To identify independent association signals in the Greenlandic cohorts, we first calculated the LD-adjust65 correlation, R2 between pairwise SNPs accounting for the population structure. Then, we performed LD clumping on the above association signals using PCAone v.0.4.4 (ref. 66) adjusted for five principal components with the options ‘--clump-p1 1e-6 --clump-p2 1e-6 --clump-r2 0.001 --clump-bp 10000000 --pc 5’. 　　European independent genome-wide significant (P < 5 × 10−8) signals within 10 Mb were extracted from summary statistics for the 13 metabolic traits using the extract_instruments function with default parameters from the R package TwoSampleMR v.0.5.10 (ref. 67) (Supplementary Table 10). The independent signals of type 2 diabetes were extracted from the curated list of Mahajan et al.68. 　　For the sample-size-matched GWAS with UK Biobank, we randomly sampled 5,996 people with no missingness on the phenotypes. For associations on protein abundances, we randomly sampled 3,707 of the 5,996, matching the number of people in the Greenlandic cohorts with protein abundances. The GRMs on the UK Biobank were estimated on variants with MAF > 5% and missing less than 1%, resulting in a total of 4.5 million variants. For all protein abundance associations, we extracted the variant with the lowest P value, removed all variants within ±2.5 Mb of that variant and repeated until no variants were left with a P value < 5 × 10−8. This was done independently for the additive and recessive model, but assigned to the model with the lowest P value. To avoid duplicated signals of strong associations that were found in both models, we excluded associations within ±1 Mb of an association with a lower P value. 　　For continuous traits under the additive model, the variance explained was calculated as PVE = β2add 2AF(1 − AF), where βadd is the additive effect size and AF is the allele frequency of the effect allele69. For the recessive model variance explained was calculated as PVE = β2rec Fhom(1 − Fhom), where βrec is the recessive effect size and Fhom is the expected homozygous frequency Fhom = AF2. This method was used for plotting as we could calculate variance explained CIs using the same formulas but with the lower and upper CI for the effect size. For choosing the variants with more than 1% variance explained we used the formula PVE = β2/(β2 + SE(β)2 × N), where SE(β) is the standard error of β and N is the number of participants70. For binary traits we calculated the liability-scale variance explained using the R package Mangrove (v.1.21) as previously described14. 　　PGS files from Weissbrod et al.71 were downloaded from the PGS catalogue (Supplementary Table 11). All variants were lifted over from Hg19 to Hg38 using the UCSC Liftover command line interface while keeping track of strand-flipping. For both the Greenlandic, Danish and UK Biobank (non-British Europeans) datasets, overlapping variants were identified by matching on chromosome, position and effect, and other allele on both alternative and reference allele. The PGS was calculated on the genotype dosages using the score function in Plink2 v.2.0.0 (ref. 61). Each phenotype was rank-based inverse normal transformed separately for males and females. Next, two linear models were fitted: (1) Null model, phenotype is approximately age × sex + PC1–10 and (2) PGS model, phenotype is approximately age × sex + PC1–10 + PGS. Then, we calculated the incr. R2 = adjusted R2 (PGS model) − Aajusted R2 (Null model). For the Greenlandic cohorts, an additional covariate for cohort was included and for the UK Biobank an additional covariate for assessment centre was included. Body mass index was included as a covariate for waist:hip ratio. The CIs for the incr. R2 were calculated using the Olkin and Finn’s approximation for s.e. To summarize across all phenotypes, we calculated the relative incr. R2 using the UK Biobank as baseline. The increased performance on incr. R2 on LDL-cholesterol for the Danish cohort was probably due to differences in age distributions (age range, UK Biobank, 46–80 years; Danish, 30–60 years). The population-based Greenlandic cohort design followed the population-based design of the Danish Inter99. The UK Biobank had a completely different cohort design and is not completely compatible; for example, mean age in UK Biobank is 64 years, whereas mean age in the Danish and Greenlandic cohort is 46 and 45 years, respectively. 　　As described previously72, we estimated relatedness using a filtered set of genetic variants with MAF > 5%, missingness < 5% and LD-pruned (Plink v.1.9.0 indep-pairwise 1,000 kb 1 0.8) along with the inferred admixture proportions as input to NGSremix73. For each pair, NGSremix calculates pairwise relatedness as the fraction of loci sharing zero, one or two alleles identical by descent (represented by k0, k1 and k2, respectively). Parent–offspring pairs were defined as relationships with k1 + k2 > 0.95 and k1 >0.75使用参与者的年龄推断父母。完整的兄弟姐妹对定义为与0.3的关系< k1 < 0.7 and k2 >0.1。在5,828名年龄和位置的人中，我们确定了1,727个父母 - 源和1,841个完整的兄弟姐妹关系。将关系标准化为可能对的数量。对于区域之间的关系，例如区域1和2，可能对成对的数量被计算为可NPOSION1 = NRIGION1×NRIGION2 ，其中Nregion1和Nregion2分别是区域1和2中的参与者数量。在区域内，将可能的成对成对的父 - 源关系对计算为NPIBLIBLIS（1,1）= NRIGION1×（NRIGION1-1）。在区域内，将可能的完整兄弟姐妹关系对的数量计算为Npible（1,1）=（nregion1×（nregion1-1-1））/2。　　为了计算具有当前良好结构（结构）的纯合载体的预期频率，我们估计了基于样本位置的区域等位基因频率，Afrigion，计算了每个区域中纯合载体的预期数量为NHOM（区域）= Nregion×AF2Recion，计算了固体载体的总和。nhom = nhom（region1）+nhom（region2）+··+nhom（region8），并与参与者总数fhom（结构）= nhom/ntotal分开。纯合子载体作为杂物人群的预期频率（panmictic）被估计为fhom（panmictic）= af2。纯合频率的顺式估计为S.D.10,000个引导样品的频率估计。　　我们估计了使用本研究的蒙版参与者估计格陵兰因纽特人种群中不同变体的频率。此外，我们调查了这些变体的等位基因频率在一系列可用数据集中12,17,26,29,30,74,75,75,76,77,77,78,79,80,81（补充表7和8）。我们使用Samtools82和BGT83从这些数据集中提取相关变体的等位基因计数。　　The ARG was estimated in two versions: (1) using all 448 WGS Greenlanders (used for analysis of FADS2 and CPT1A) and (2) dividing the genome into 6-Mb chunks, finding people with only Inuit ancestry in that chunk and randomly subsampling 150 of those people, resulting in a Greenlandic Inuit masked ARG from 150 mosaic participants (used for all other analyses).对于蒙面版本，仅使用每5,000 bp多个变体的块。使用默认关系v.1.2.1管道估算两个ARG；for each chromosome (or chunk, in the masked version), convert phased VCF to haps/sample file format, prepare input using the provided PrepareInputFiles script, with the high-confidence Ensembl H. sapiens ancestral sequence as ancestral reference and estimate the ARG for the chromosome/chunk using the provided RelateParallel.sh script with Effective population size of haplotypes set to 2,000 and mutation rate per generation to2.5×10-8。然后，使用提供的所有染色体/块的供应量表将总人口规模估算，从而导致估计的人口历史。　　为了估计变体年龄，我们使用Relate提供的示例Branchlanghths脚本从估计的ARG中制作了10,000个本地树的分支长度样本。请注意，重新采样时，相关性不允许在树拓扑中进行更改。从每个样本中，我们将时间提取到最新的变体和前面祖先的祖先。这产生了每个分支长度样品中几代人测量的变体的最小和最大年龄。从间隔开始，我们估计了变体的年龄和可信度间隔。我们计算出概率密度为间隔的加权平均值，其中重量是每个间隔长度的倒数。通过这样做，我们假设年龄同样有可能位于每个间隔内的任何位置，并且我们的重量与10,000个采样分支长度中的每一个相等。估计变体估计值是概率密度的中位数，而95％可靠的间隔计算为2.5％和97.5％的分位数。几代人的年龄通过乘以每一代28年的生成时间来转化为年。　　与上面的变异年龄估计一样，我们使用估计的ARG的局部树的分支长度样品测试了选择。这些分支长度样品用作推断等位基因频率轨迹并测试选择的线索的输入。44。为了获得经验p值，我们测试了999个其他变体的选择，该变体的具有每个变体的±10％衍生等位基因频率。然后将经验P值计算为对数似然级别除以测试的变体总数。随机变体未在任何已测试变体的±5 MB内采样。　　有关研究设计的更多信息可在与本文有关的自然投资组合报告摘要中获得。

本文来自作者[admin]投稿，不代表象功馆立场，如若转载，请注明出处：https://wap.xianggongguan.cn/xgzx/202506-2044.html