人类pangenome参考草案

  我们从1公斤中鉴定出父母三重奏 ,其中Coriell医学研究所在NHGRI样本存储库中用于人类遗传研究的子细胞系被列为零扩展和两个或更少的段落,并且通过等级订购的代表个体如下 。去除MAF小于0.05的基因座。MAF是在整个队列中测量的(即2504个人,26个亚群) ,无论每个人的亚群标记如何。对于每个染色体,进行主成分分析(PCA)以减小尺寸 。这导致了一个具有2200个功能的矩阵,然后使用SmartPCA归一化对其进行核心和缩放。通过另一轮PCA ,矩阵进一步降低至100个功能。   我们将亚种群的代表个人定义为与小组中其他成员相似的人(在这种情况下 ,这是他们所属的亚群),并且与小组以外的个人不同 。小组由以前的1公斤人口标签定义(例如,“西部分区的冈比亚人”) 。我们这样做如下。对于每个样品 ,我们首先计算了群体内距离Dintra,这是样品和同一亚群样品之间的L2-norm的平均值。组间距离(Dinter)类似地定义为样本之间的L2-元素的平均值和所有其他亚群中的样品 。L2-Norms衍生在PCA的特征空间中。然后将该样品的分数定义为10×dintra+dinter/(n - 1),其中n是亚群的数量。对于每个亚种群 ,如果少于三个可用,则选择了所有三重量 。否则,通过对最大儿童(Paternalrank ,Mednalrank)进行排序,在该排序中,Paternalrank和Mednalrank是每个父母分数的各个等级 ,选择了最大值的三个三重奏。我们按家长分数排名,因为在第一年的工作中,子样本没有测序数据 ,因此必须由父母代表。   理想情况下 ,我们将从每个亚种群中选择相同数量的候选人,并具有同等数量的两性候选人 。为了纠正失衡,我们针对每个亚种群的候选人设置应用了以下标准:(1)当性别不平衡时(即 ,由多个样本不止一个)时,我们试图在较少代表性的性别的下一个最佳候选人中互换,或者如果不可能 ,则无所作为或什么都不做;(2)如果亚群的个体少于所需的样本选择大小(即选择所有候选者),则将其未使用的插槽分布在其他不饱和亚群中。后一种选择是任意的,但对总体结果的影响不大。   这项研究中使用的遗传信息来自Coriell医学研究所的NHGRI样品存储库中的公开可用的细胞系和人类遗传细胞存储库的NHGRI样品存储库和NIGMS人类遗传细胞存储库 。因此 ,这项研究免于人类研究的批准,因为拟议的工作涉及已公开可用的数据或标本的收集或研究。   淋巴母细胞系(LCLS)用于从1kg收集中进行测序(补充表1),是从Coriell医学研究所的NHGRI样品存储库中的人类遗传研究中获得的。HG002(GM24385)和HG005(GM24631)LCL是从科里尔医学研究所的Nigms人类遗传细胞存储库中获得的 。测序的所有扩展均来自原始扩展培养地段 ,以确保最低的通道数量并减少整体培养时间 。用于HIFI,Nanopore,Omni-C ,Strand-Seq ,10倍基因组学和Bionano产生的细胞以及G带核分型和Illumina omni2.5微阵列的细胞被扩展到4×108细胞的总培养大小为4×108个细胞,这导致了五次通过五个通过细胞生产线建立。如下:HIFI,2×107细胞 ,将细胞分为特定于生产大小的小瓶;纳米孔,5×107细胞;Omni-C,5×106个细胞;Strand-seq ,1×107个细胞;10倍基因组学,4×106细胞;和Bionano,4×106个细胞。将链序列的细胞存储在65%的RPMI-1640中 ,FBS 30%和5%DMSO和冷冻作为生存培养物 。将所有其他细胞在PBS中洗涤,并作为干细胞颗粒洗涤。用于ONT-UL生产的细胞分别从原始膨胀培养地段扩展到5×106个细胞的五个小瓶。随后将一个小瓶扩展到4×108个细胞的总培养大小,总共八个通道 。还保留了细胞用于G带核分型和Illumina omni2.5微阵列。   在5×106个细胞(对于HIFI ,Nanopore和Omni-C)和第8(对于ONT-UL)上(对于ONT-UL)上进行了G带核型分析。对于所有细胞系,对20个中期细胞进行计数,并分析并分析了至少5个中期细胞并进行核分型 。染色体分析以400或更大的分辨率进行。在进行细胞系进行测序之前 ,使用通过/失败标准。具有正常核型(46 ,XX或46,XY)或具有良性多态性的细胞系经常在显然健康的个体中看到,被分类为通过 。如果两个或多个细胞具有相同的染色体异常 ,则将细胞系分类为失败。使用Maxwell RSC培养的细胞DNA试剂盒(Promega)上使用Maxwell RSC培养的细胞DNA试剂盒(Promega),从冷冻细胞颗粒(3×106至7×106细胞)中分离出用于微阵列的DNA。DNA使用Infinium Omni2.5-8 v.1.3 Beadchip(Illumina)(Illumina)(Illumina)使用Infinium Omni2.5-8 v.1.3 Beatoty型在费城儿童医院应用基因组学中心 。   PACBIO HIFI测序分布在两个中心之间:圣路易斯的华盛顿大学和华盛顿大学 。我们分别描述每个中心使用的协议。   使用Qiagen magAttract HMW DNA试剂盒从冷冻的细胞颗粒中分离出高分子量DNA,并使用eaganode Megaruptor I至20 kb模式大小进行剪切。在所有步骤中 ,使用DSDNA HS分析试剂盒(Thermo Fisher)对量子荧光计I进行了DNA数量,并使用基因组DNA 165 Kb套件在FEMTO脉冲(Agilent Technologies)上检查了大小 。SMRTBELL库是根据协议“过程与清单 - 使用SMRTBELL Express模板准备套件2.0 ”进行测序的。SMRTBELL生成后,使用“ 0.75%1-18 KB”程序(目标为3,450 bp)和某些组合3(平均大小为15-21 kb)(分数2) ,尺寸为16-27 kb)(平均尺寸为16-27 kb)和20-31 KB的尺寸(平均尺寸),使用“ 0.75%1-18 kb”程序(目标为3,450 bp)在鼠尾草系统(Sage Science)上进行了尺寸选择材料,对20-31的平均尺寸(平均大小) ,EM的尺寸为20-31 KB测量和可用质量。选定的库馏分与测序引物V.2和续集II聚合酶V.2.0结合,并使用测序板V.2.0,扩散载荷 ,2小时的前扩展和电影时间的续集II仪器(PACBIO)测序在SMRT细胞上进行测序 。在四个SMRT细胞上 ,将样品测序为最小HIFI数据量为108.5 GBP(35×估计的基因组覆盖率)。   使用改良的Gentra Puregene方法从冷冻的细胞颗粒中分离出高分子量的DNA,并使用Gtube(Covaris)剪切至20 kb模式大小。在所有步骤中,DNA数量均通过荧光测定法在DS-11 FX仪器(DENOVIX)上检查 ,并使用量子DSDNA HS分析试剂盒(Thermo Fisher)检查,并使用基因组DNA 165 Kb套件在FEMTO脉冲(Agilent Technologies)上检查了尺寸 。SMRTBELL库是根据协议“过程与清单 - 使用SMRTBELL Express模板准备套件2.0 ”进行测序的。SMRTBELL生成后,使用“ 0.75%1-18 kb”程序(井中的目标3,400 bp)和2(平均尺寸为17-20 kb)或分数1(平均尺寸为18-20 kb)的材料在Sageelf系统(Sage Science)上进行了尺寸选择 ,取决于测序,取决于测序,取决于empuical smopuroical尺寸和可用尺寸。对于某些样本 ,使用了Sageelf计划“ 0.75%琼脂糖,10 kb -40 kb”(井中10,000 bp的目标10,000 bp),并将分数6和7合并在一起进行测序(平均大小为17-21 kb) 。所选的库馏分与测序引物V.2和续集II聚合酶V.2.0结合 ,并使用测序板V.2.0,扩散载荷,3-4 h的前延伸率和电影时间的3-4小时在SMRT细胞上进行了续集II仪器(PACBIO)结合。在至少四个SMRT细胞上 ,将样品测序为最小HIFI数据量96 GBP(30×估计的基因组覆盖率)。   尽管华盛顿大学和圣路易斯华盛顿大学之间存在HIFI数据生产方法的细微差异 ,但最终的数据非常相似,大多数样本的组装统计数据都重叠 。这些初始基因组是在精炼并优化用于HIFI测序的时测序的,因为这是一个相对较新的过程 。方案的主要差异是核酸隔离 ,碎片化和尺寸选择的一部分,下游测序特异性应用更加一致。这两个团队都彼此紧密相互互动,以及我们的公司同事 ,包括新英格兰Biolabs(NEB),Qiagen,Diagenode和Sage Science ,以提供最佳的最终产品。   对于其他18个样本,我们使用了纳米未切除的长阅读测序协议16 。这产生了来自3个Promethion流动细胞的未切除测序的约60倍,N50值约为44 kb。对于29个新选择的HPRC样本(结果) ,我们使用了以下概述的协议。   将沉淀中的约5000万个细胞重悬于200 µL的PBS中,并将重悬的细胞等分(40 µL)为五个1.5 ml DNA DNA Lo-lo-cind Eppendorf管 。对于五个等分试样,每个等分试样完成了以下DNA提取的过程。每个管都包含足够的DNA ,用于将三个库装载到一个流动池上。使用P200宽孔移液管中的移液器混合(上下上下)将以下试剂序列序列添加到每个管中:40μL蛋白酶K ,40μL的缓冲液CS和40μLCLE3 。然后将样品在室温(18-25°C)下孵育30分钟。接下来,将40μl的RNase A与P200宽孔移液管中的移液器混合(10次)添加到每个管中,并在室温下孵育样品3分钟。在1.5 mL eppendorf管中将200微升的BL3与200μlPBS混合 。然后将四百微米的BL3 – PBS混合物添加到每个样品中 ,样品与P1000宽孔移液管混合10次,套件套装为600μl。   将样品在室温下孵育10分钟,然后将移液器混合5次 ,然后在室温下孵育10分钟,然后将移液器混合5次,然后在室温下进一步孵育10分钟。添加BL3后可能形成白色沉淀 。这是正常的 。在细胞裂解物中添加纳米蛋白盘 ,然后加入600μl的异丙醇。混合通过管的反转5次进行。在管旋转器上进一步混合试管(在室温下为9 r.p.m.,持续10分钟) 。然后将试管放在磁管架上,纳米蛋白磁盘位于靠近管的顶部 ,以避免无意中的DNA绑定到纳米蛋白盘的DNA。使用移液器将上清液丢弃,并将700μl的缓冲液CW1添加到每个管中。然后将磁架中的管倒了4次以进行混合 。用500μl缓冲液CW2(每次洗涤4次)洗涤第二和第三次洗涤。第二次CW2洗涤后,将液体从试管盖中除去 ,然后在微型中心旋转2 s ,然后在磁架上取代。从管的底部取出残留液体,注意不要去除与纳米蛋白盘相关的DNA 。通过添加160μl循环洗脱缓冲液(EB)加0.02%Triton X-100(包括316.8μLEB和3.2μl2%Triton X-100)并在室温至少孵育至少1 h。在洗脱过程中,轻轻挖掘了试管。通过用P200宽孔移液器转移到新的1.5 ml微切除管来收集DNA 。移液器后 ,一些液体和DNA保留在纳米心磁盘上。包含纳米圆盘的管以10,000克离心为5 s,而从磁盘上脱离的任何其他液体都会转移到洗脱管上。如有必要,将重复此过程 ,直到除去所有DNA 。样品是移液管混合5次 (大约10 s的抽吸物,每个周期分配10 s)用宽孔P200移液器来分配样品 。进一步使样品在室温下过夜休息,以允许DNA溶解(分散)。   制备了循环EB+(具有0.02%Triton X-100的EB缓冲液) ,并将140.82μLEB+等分为1.5 mL Eppendorf DNA Lo-lo-bind管。将上面的UHMW DNA(300μL)用宽孔P200移液器等分为同一管 。将混合物缓慢移动3次,宽孔P200移液器设置为150 µL。在单独的1.5 mL eppendorf DNA lo-cind管中,添加了以下试剂:144μlFRA稀释缓冲液 ,9.18μl1 m mgCl2和6μlFRA。将管子轻拍以混合并使用微输出式旋转 。将EB – Triton X-100-DNA混合物与宽孔P200移液器一起添加到FRA稀释缓冲液-MGCL2-FRA混合物中。然后将这种混合物与宽孔P1000移液器混合15-20次,以600 µL的速度。移液器混合完成后,混合物显得均匀 。然后将管子在室温下孵育15分钟。然后将混合物与宽孔P1000移液器混合5次 ,将其放在600 µL ,并在室温下再孵育15分钟。将混合物在30°C孵育1分钟,然后在80°C下孵育1分钟,然后保持在4°C 。   清洁使用了纳米蛋白磁盘。在上述反应混合物中添加了一个5 mM纳米粒磁盘 ,然后将300 µL的循环缓冲液NAF10添加。该管轻轻挖掘10-20次以混合 。将混合物放在20 R.P.M.的平台摇杆上 。在室温下2分钟。在纳米磁盘上可见一个DNA“云 ”。使用台式微型离心机将管子旋转1-2 s,并放置在磁架上 。去除结合溶液并丢弃。通过添加350 µL ONT长片段缓冲液(LFB)洗涤纳米蛋白磁盘,并轻轻敲击5次混合。使用微切除仪旋转管1-2 s ,并放置在磁架上 。删除了ONT LFB并丢弃。小心不要将移液DNA连接到纳米蛋白磁盘上。重复此LFB洗涤 。然后将管子短暂旋转(微离心机),以将纳米蛋白盘移至管的底部。通过向管中添加125 µL的ONT EB,从纳米蛋白盘从纳米蛋白盘洗脱了DNA。将管子在室温下孵育30分钟 ,然后轻轻敲击5次(混合),然后在室温下再孵育30分钟 。在从管中取出洗脱器之前,将流体缓慢吸入4次。使用宽孔P200移液器将洗脱器转移到新的1.5 mL Eppendorf DNA Lo-cand Tube中。然后 ,洗脱器将移液器与宽孔P200移液器混合了2次 。   将快速适配器(RAP)添加到DNA制备中 。至120 µl洗脱(从上方),加入了3 µL的ONT RAP。混合物与宽孔P200移液器混合了8次。然后将混合物在室温下孵育15分钟,然后再次将移液器与宽孔P200移液器混合8次 。   最终的图书馆清理步骤删除了无罪的适配器。简而言之 ,将120 µL循环EB添加到上述RAP反应混合物的123 µL中。混合物将移液器慢慢混合3次 ,宽孔P1000移液管设置为240 µL 。每个抽吸大约需要10秒,每个分配大约需要10秒钟。在反应混合物中添加了一个5 mM纳米粒磁盘,然后将循环缓冲液NAF10添加到120 µL。混合是通过轻柔的敲击来完成的 。将管子在室温下孵育5分钟 ,而无需搅拌或旋转。在孵育的5分钟内,将管轻轻攻入5次(每次持续2-3次)。使用微切除仪旋转管1-2 s,并放置在磁架上 。结合溶液被丢弃。将接下来的350 µL ONT LFB添加到管中 ,并通过温和的敲击5次混合。然后,使用微切离子将管子旋转1-2 s,并放在磁架上 。删除了ONT LFB并丢弃 。接下来 ,通过添加350 µL ONT LFB洗涤纳米蛋白磁盘。将管轻轻轻拍5次,以将LFB移到磁盘的表面上。然后将管子在室温下孵育5分钟 。然后,使用微切离子将管子旋转1-2 s ,并放在磁架上。删除了ONT LFB并丢弃。使用微量离心机短暂地旋转该管,将纳米蛋白盘移至管的底部 。为了从纳米圆盘中洗脱DNA,将126 µL ONT EB添加到管中。将管子在室温下孵育30分钟 , 然后轻轻敲击5-10次 ,并在室温下再孵育1-2小时。然后使用上述相同的技术将洗脱器转移到新的1.5 mL Eppendorf DNA Lo-cind管中,并使用上面描述的相同技术将洗脱器传递到纳米蛋白磁盘上,然后再从管中移除洗脱器 。然后将混合物与宽孔P200移液器混合2-3次。在测序之前 ,将库存储在4°C下过夜,以促进DNA的最大溶解。   将ONT测序缓冲液(SQB)(68 µL)从上方加入82 µL洗脱器 。混合物将移液管混合4次,宽孔P200移液器套件设置为150 µL。每次150 µl的抽吸都需要10–20 s ,每次分配150 µl的时间为10–20 s。然后将样品在室温下孵育10分钟 。接下来,样品将移液器混合8次,宽孔P200移液器设置为150 µL 。在加载库之前 ,将流动池用冲洗缓冲液/冲洗系绳混合物进行启动。然后将库添加到流动池中。混合物是粘性的,但在大约1分钟内平滑地加载 。一些样品最多需要2分钟才能加载。测序运行每6小时都有一个重新装置的时间设置。使用Guppy(V.4.0.11)进行基本调用,具有默认参数和高准确的Promethion模型(DNA_R9.4.1_450BPS_HAC_HAC_PROM.CFG) 。   我们使用Dovetail Omni-C Kit(Dovetail Genomics)从每个细胞系中制备了Omni-C库 ,并进行了以下修改为。首先,我们将100万个细胞等分用于与甲醛和DSG固定。我们用DNase I消化染色质,直到获得了所需长度的DNA片段 。根据方案 ,我们对染色质进行了末端修复 ,然后连接生物素化桥寡核苷酸,然后连接游离染色质末端。我们逆转了交联,并纯化的接近性DNA。我们使用NEB Ultra II库制备试剂盒(NEB)将DNA转换为Illumina测序文库 ,并将其转换为Y-Audaptor 。我们在最终库上使用链霉亲素珠捕获量富含连接产品。然后在最终的PCR富集步骤之前将每个捕获反应分为两个重复,以保持复杂性。所有库都是唯一双重索引的,并在Illumina Novaseq平台上进行了测序 ,读取长度为2×150 bp 。   我们描述了组装之前,之中和之后采取的主要自动化和手动步骤 。可以从DockStore获得一组工作流程说明语言(WDL)格式化的组装工作流程,该工作流程可捕获每个用于过滤适配器的读取和运行Hifiasm(https://dockstore.org/organizations/humanpangenome/collections/hifiasm)的步骤。所有组件均使用此工作流 ,在Anvil71上运行。清洁组件和修复一些结构问题是通过自动化工作流和手动策展的组合进行的 。使用github(https://github.com/human-pangenomics/hpp_production_workflows/tree/master/master/master/assembly/y1-notebooks)提供手动策划。   在制作组件之前,我们使用Hifiadapterfilt Repository72(Comply 64D1C7B)的BASH脚本检测并删除了包含PACBIO适配器的读数。该脚本首先创建了PACBIO适配器序列的数据库,如下所示:   然后 ,它使用调谐参数运行,以检测含转移的读取如下:   对于43个样本(在47个样本中),我们删除了读数的不到0.15% 。所有29个HPRC选择的样品均为这43个样品 ,表明HPRC产生的HIFI数据中适配器污染的水平较低。HG005是另外18个样本之一 ,其适配器污染百分比最高,约为1%。(补充图39)   然后将删除的读取与T2T-CHM13(v.2.0)参考对齐,以确保在过滤过程中没有染色体或基因座特异性偏置 。补充图40显示了IGV浏览器的快照73 ,该快照说明了沿基因组含有衔接子读取的覆盖范围。读取的位置几乎沿基因组均匀分布,除了中心粒外,我们几乎没有发现任何覆盖有两个以上适配器的读数的区域 ,即使在HG005中,污染百分比也是最高的。   HIFIASM的三键模式需要单倍型特异性的K-MER来逐步分配组装图 。为了生成这些K-Mers,我们使用了47个HPRC样本的父母Illumina简短读数 ,可从1kg DataSet19公开获得。对于每个父母的短阅读样本,我们使用(v.0.1)74生成K-MER哈希表,为每个父亲和母体阅读集运行一次:   然后将转移器过滤的HIFI与父母K-MER表一起读取三重量(V.0.14) ,以产生单倍型分辨的组装图。仅将样品HG002重新组装为三个速度(v.0.14.1),在下一节中更详细地进行了解释 。   HIFIASM以GFA格式产生每个单倍型 。然后,使用GFATOOLS75将每个单倍型特异性GFA文件转换为FastA格式。在执行本节末尾描述的三个清洁步骤后 ,根据第v.2释放了由三人速度(V.0.14)产生的组件。   我们从MiniMAP2存储库(https://github.com/lh3/minimap2/tree/master/misc)76使用了来计算由三人速度产生的每个组件的明显基因重复数量(v.0.14) 。Asmgene不能区分真正的重复和错误。查看其结果 ,我们能够找到重复趋势并检测到任何异常值。该评估是检测高级重复错误的代理 。我们使用Ensembl V.99 cDNA序列77作为用于运行ASMGENE的输入基因。   根据基因重复的数量,将三个样本视为异常值。为了确定此问题的原因,我们将HIFI读取的读取与这些组件相结合 ,并检查了覆盖范围和映射质量的深度 。它表明,样品HG01358,HG01123和HG002在至少一个约55 Mb(中心)的长度上包含假重复 ,分别为14 MB和约70 MB。在HG01358和HG01123的组装图中,多次出现的重复的HIFI读数被用作手动确定重叠中重复区域的确切边界的锚点。然后,通过在重复起点和终点上打破重叠群并从组装中丢弃重复的序列来手动固定这两个重叠群 。详细说明 ,对于HG01123,我们丢弃了间隔。对于HG01358,我们将重命名重叠群重命名为重叠的间隔我们丢弃了间隔 ,并将其重命名为将其重命名为。为了解决HG002中的错误重复,我们使用较新版本的Trio-Hifiasm(V.0.14.1)重新组装了它,该版本据报道没有此问题 。   我们还通过使用(见下文)评估了组件的分阶段精度 。我们在HG02080组件的母体重叠群中检测到了一个大型连接。它在重叠群的中间包含大约22-MB长的父亲块 ,这导致该块边缘的两个开关误差。该块是从组装中手动丢弃的 ,重叠群被分成两个较小的块 。详细说明,在HG02080中,对于重叠群 ,我们将其重命名为重命名的间隔将其丢弃并保留间隔,并将其重命名为。   我们终于使用Minigraph Pangenome搜索了染色体体的混合(有关施工详细信息,请参见下文)。“染色体杂音式连接 ”是由嵌合的微论对准(见下文)定义的 ,该嵌合体由不同的染色体上的≥1MB亚调整组成 。   为了清洁原始组件,我们执行了三个其他步骤:掩盖其余的HIFI适配器,放下全部污染的重叠群 ,并删除任何冗余的线粒体重叠群。   在第一个清洁步骤中,使用参数将PACBIO SMRTBELL适配器的序列与每个组件对齐。我们仅提取小于或等于2不匹配的命中率,其命中率大于42 nt 。此外 ,每个组件中的真核生物适配器都通过78识别。使用床MaskFasta命令的wdl,将组合和适配器命中(当存在)在组件中硬掩盖(https://dockstore.org/workflows/github.com/human-pangenomics/hpp_production_workflows/maskassembly:master:tab = info)。   在第二个清洁步骤中,我们用来检测来自其他生物体(例如细菌 ,病毒和真菌)的非人类序列的线粒体重叠群和重叠群 。然后使用Samtools Faidx的WDLIZED版本从组件中删除这些重叠群。值得注意的是 ,其中含有核线粒体DNA的重叠群没有被丢弃。   在最后一个清洁步骤中,我们选择了一个重叠群作为每个二倍体组件的最佳线粒体重叠群 。为此,选择将线粒体DNA的序列(使用NC_012920.1的RefSeq标识符)与使用参数的每个二倍体组件对齐 。然后 ,我们选择了一个具有最高映射分数和最低数字不匹配数量的重叠群作为最佳的线粒体重叠群(如果存在多个最佳重叠群,我们随机选择一个)。然后将此重叠群旋转并翻转(如有必要),以匹配NC_012920.1.FA的开始和方向 ,然后添加到相应样品的母体组装中。仅HG01071样品没有产生任何可识别的线粒体重叠群 。   然后将掩盖和清洁的线粒体组件登录到Genbank,在那里他们接受了另一轮适配器掩盖和去除污染,这主要是Epstein -Barr病毒 ,用于生成LCL。下载了GenBank中的最终清洁组件,重叠群标识符已预先备份样品名称和单倍型整数(其中1 =父亲和2 =母亲)。例如,样本HG02257中的Jagyvh010000025名称的重叠群更名为HG02257#2#Jagyvh010000025 。然后 ,更名的组件将发布到我们的亚马逊简单存储服务(S3)和Google Cloud Platform(GCP)存储桶中。在从Genbank下载的过程中,三个集会(HG00733父亲,HG02630父亲和NA21309母体)过早地下载了下载 ,这导致了缺失的序列。值得注意的是 ,NA21309缺少其线粒体重叠群 。详细信息可以在HPRC的1年级汇编GitHub存储库中找到(https://github.com/human-pangenomics/hpp_year1_assemblies)。在国际核苷酸序列数据库协作中举行的组件没有被截断,但是将截断的副本保留在S3和GCP中,因为它们用于构造Pangenomes。   提交给GenBank后 ,使用WinnowMap对组件对齐CHM13,并发现多个重叠群未覆盖 。这些重叠群被爆炸,发现几乎完全是爱泼斯坦 - 巴尔病毒序列。GenBank证实(个人交流)这些未塑造的重叠群应该已被视为污染 ,但是由于基因组已经在积极使用中,因此他们当选此时不删除它们。可以在1年级的github存储库(https://github.com/human-pangenomics/hpp_year1_assemblies/blob/main/main/main/main/genbank_changes/y1_genbank_genbank_remaine_potential_potential_contamination.txt)上找到应该找到应该删除的重叠群的列表 。   通过使用WDL编写的工作流,在砧上运行的工作流程进行了几个步骤 ,并在Dockstore(https://dockstore.org/workflows/github.com/human-pangenomics/hpp_production_production_production_production_production_prodflowsflows/standardqc)上可用 。工作流程中的各个工具在Docker容器中运行,并安装了特定的工具版本,以保持一致性和可重复性。详细信息可在DockStore居住的工作流程中找到。该工作流程采用了父母和子样本的短阅读数据 ,这是两个组装单倍型,并且对下面描述的工具产生的各种质量指标产生了分析 。对于每个任务,工作流都会产生一个小的人类可读摘要文件 ,该文件也很容易解析以摘要步骤 ,以及工具的全部输出进行手动检查。可以从沉积工作流中确定特定的工具调用,并在后续部分中进行描述。   将重叠群与Minigraph(V.0.18)的CHM13(V.2.0)对齐,并使用以下命令行进行处理:   如果重叠群对两个不同的染色体有两个≥1MB的比对 ,则“ MISJOIN”命令报告了染色体联接 。   使用Quast79评估每个单倍型的组合连续性。这些统计数据包括组装的总序列,组装的重叠群和重叠群NG50(假设基因组大小为3.1 GB)。所有基于参考的分析都跳过了 。   Quast通过以下命令调用:   使用两个单独的基于K-MER的工具确定组装QV。首先是Yak20。Yak的QV估计在每个单倍型上分别进行 。使用以下命令生成了YAK的K-MER数据库:   使用以下命令生成使用YAK的QV估计:   还使用Meryl和Merqury80确定组装QV。Meryl生成K-MER数据库,Merqury用两个单倍型共同决定了单倍型QV。   使用以下命令生成了带有MERYL的K-MER数据库 。使用并与之合并为每个读取文件分别生成数据库 。使用特定于父母的K-Mers(HAP-MER)使用。   使用以下命令生成了使用Merqury的QV估算:   作为组装质量的互补和分层评估 ,我们使用GIAB组装基准管道将基于组装的变体调用与GIAB的小型变体基准(v.4.2.1)进行比较,用于在这项工作中组装的两个GIAB样品:HG002和HG005。我们评估了与GRCH38对齐的HG002和HG005 HPRC组件 。使用DIPCALL(V.0.3)(使用mimimap2(v.2.2.4))38调用了变体。我们使用-Z200000,10000参数来改善对齐的连续性,如前所述可以改善具有致密变化的区域(例如主要的组织相容性复合物81)的变化回忆。使用HAP.PY(v.3.15)82进行小型变体评估 ,对v.4.2.1的基准测试,对GIAB样品HG002和HG005进行了高信心SNP,小型Indel和纯合子参考 。对相关的DIPCALL区域文件(DIP.BED)进行比较 ,以评估组装区域内外的召回区域。为了更好地比较复杂的变体,使用VCFEVAL83运行HAP.PY。使用GIAB分层(v.3.0)84对变体调用进行了分层,在基因组的挑战性和靶向区域中对真实阳性 ,假阳性和假阴性变体进行了分层 。   使用YAK评估组装量相位 ,并使用两个统计数据进行了描述:开关错误和锤误率。开关误差描述了两个相邻分阶段变体的次数错误地在母亲和父亲单倍型之间切换。锤误差率与每个组装的重叠群的错误型变体的总数有关 。YAK使用父母的简短读取测序,分别为每个单倍型生成每个单倍型的分阶段统计。   YAK生成了样本和两个父母单倍型的K-MER数据库(如上所述)。我们使用牛与以下命令生成相位指标:   计算评估的另一种方法是使用不需要三重奏信息的HI-C读数 。我们计算了局部相分评估的开关错误率和全局相分评估的锤击错误率 。我们在24中实现了一种有效的基于K-MER的方法,并使用了最大的HI-C读取支持来检测杂合位置上的开关错误。在此过程中 ,我们首先使用31-MERS从分阶段的组件中鉴定出杂合K-MER(HET)。然后,我们使用这些31-Mers将HI-C映射到组件 。如果有5个读数支持组件中连续HET之间的开关,我们考虑了单倍型开关。对于每个HET对 ,我们注意到HI-C是否支持或不支持该阶段。当HET站点相对于先前的杂合站点的相对于相对于先前的杂合站点的支持相对于相位的支持时,我们考虑了开关误差 。开关错误率是本地开关的数量除以杂合站点的数量。我们在所有重叠群上为开关计算执行了整个重叠群的操作。对于锤误差计算,我们考虑了整个重叠群水平上的锤距离除以杂合位点的数量 。该措施给出了分阶段错误的全局视图 ,并隐式地惩罚了重叠群中的任何长开关。   以下描述了HPRC组件的HIFI比对的生成和清洁和运行的Flagger(V.0.1),这是一种用于评估二倍体/双组件的读取管道。运行这些步骤的所有基于WDL的工作流都存放在DockStore Collection(https://dockstore.org/organization/humanpangenome/collections/collections/flagger-secphase)中 。   我们将每个样品的HIFI读取与其二倍体组件对齐。使用以下命令使用winnowmap(v.2.03)产生对齐方式:   对于所有样本,我们使用了补充表1中提到的完整HIFI读取集 ,除了HG002,我们将读取集的读取为35×。   为了排除不可靠的比对,我们删除了所有小于2 kb或间隙压缩不匹配比的所有嵌合对齐和对齐次数高于1% 。由于组件是二倍体的 ,并且与纯合区域保持一致的读数预计将具有低映射质量 ,因此我们没有根据其映射质量过滤对齐 。在补充图41中,我们绘制了一个样本HG00438的映射质量的直方图和对齐身份的分布。绘制了三组比对的统计数据:分别与二倍体组件和每个单倍体组件(母亲和父亲)的比对。它表明,当使用二倍体组件用作参考时 ,读取具有较高的身份,但读取质量的映射质量低于10 。   通常,在高度纯合的区域中 ,由于读取错误或杂材,对准器可能无法选择正确的单倍型作为主要对齐。为了检测这些案例,我们搜索了辅助对准 ,其中得分几乎与同一读取的主要比对一样高。对于每个这样的读取,我们对读取序列进行了伪式比对,并由所有次级和主要比对捕获的组件块 。使用此对齐方式 ,我们搜索了至少一个对齐方式不匹配但并非所有对齐的读取基础。我们称此类碱基单核苷酸标记。对于每个比对,我们通过仅考虑单个核苷酸标记并以负符号的基础质量总和来计算一致性得分 。然后,我们根据此分数对比对(无论是主要或次要)排序。如果最佳对准是次级对齐 ,我们将主要标签分配给了此对齐 ,并删除了其他对齐。在47个HPRC样本中,换对的总读数百分比在0.03%(HG03453)到0.44%(HG005)之间 。该结果表明,只需使用此方法将一小部分的读数重新定位。通过SECPHASE(v.0.1)工作流进行此步骤 ,该工作流程可在DockStore Collection(https://dockstore.org/organization/humanpangenome/collections/collections/flagger-secphase)中获得。   通过调用变体,可以检测需要抛光的区域(即错误),或者由于次数不良而从错误的单倍型中进行一致性 。我们将DeepVariant(v.1.3.0)与参数-Model_type =“ PacBio”使用来调用这些对齐方式 。然后将变体过滤以仅包括频率高于0.3的双重SNP ,基因型质量高于10。   有了双重SNP,我们发现了与替代等位基因的对齐方式,并将其从BAM文件中删除。在此步骤中 ,我们实施并使用了运行以下命令的程序:   为了评估我们的二倍体比对过程产生的读取映射,我们使用了以下五个步骤,这些步骤合并为我们称为Flagger的管道 。Flagger本质上将混合模型拟合到读取到二倍体组件对准的连续覆盖范围 ,然后将每个块分类为一个类别,以预测该位置的组件的准确性。   首先,在产生和清洁HIFI对齐后 ,我们通过按零覆盖范围输出基地计算了每个组件基础的覆盖深度):   然后将Samtools深度的输出转换为更有效的格式。后缀 。这种格式专门针对旗手实现 ,并且更有效,因为连续覆盖范围相同的基地仅需一行。我们实施了一个程序,该程序称SAMTools深度的输出为格式。   在第二步中 ,使用覆盖率的频率计算出来 。带有的输出文件。后缀是一个两列选项卡划分的文件:第一列显示覆盖范围,第二列显示了这些覆盖范围的频率。   Python脚本获取文件 。后缀,适合高斯混合模型 ,并通过预期最大化找到最佳参数。该混合模型由四个主要组成部分组成,每个组件代表特定类型的区域:   注意到,由于区域覆盖范围的差异 ,模型组件可能会改变不同区域,并且由此产生的系统差异会影响分区过程的准确性。为了使覆盖范围对本地模式更加敏感,将二倍体组件分为长度(5-10 MB) ,并且适合每个窗口的独特模型 。在拟合之前,我们将第一步中产生的全基因组覆盖范围文件分为每个窗口的多个覆盖范围文件 。我们实施并进行了分裂:   这产生了覆盖范围文件的列表,每个文件以   然后 ,我们为每个结果覆盖范围文件重复上述步骤。   一个重要的观察结果是 ,对于短的重叠群,覆盖范围的分布通常太嘈杂,无法令人满意地适合混合模型。为了解决此问题 ,我们仅针对长达5 MB的重叠群进行了特定于窗口的覆盖范围分析 。对于较短的重叠群,我们使用了整个基因组分析的结果。   使用拟合模型,我们将每个覆盖率值分配给四个组件之一(错误 ,重复,单倍体和折叠)。为了为每个覆盖值,我们选择了最高概率的组件 。例如 ,覆盖范围值0经常分配给错误的组件。在补充图42中,覆盖间隔是根据其分配的组件进行着色的。   根据描述完整人类基因组的文章,有一些卫星阵列(例如 ,HSAT1,HSAT2和HSAT3),由于样品制备和测序中的偏见 ,HIFI覆盖率下降或系统地增加了 。这种平台特定的偏见误导了管道。结果 ,错误的重复组件可能包含错误的重复和覆盖范围偏置块的混合物。折叠成分发生了类似的影响 。   为了纳入此类覆盖偏差并纠正相应区域的结果,我们首先找到了每个单倍体组件的区域,这些区域的覆盖率偏置为预期。为了找到这样的区域 ,我们通过将组件重叠群对齐到参考T2T-CHM13(v.1.1)和GRCH38(V.1.1)和GRCH38(Chromosome Y),并将HSAT coordinates coordinates coordinates coordinate coordinate(使用pythons)。然后,我们根据相应的HSAT中的预期覆盖范围来调整分配给每种HSAT类型的块的混合模型 ,并调整了参数,即期望最大化过程的起点 。对于HSAT1,HSAT2和HSAT3 ,我们将平均测序覆盖率分别设置为0.75、1.25和1.25倍 。最后,我们根据推断的覆盖阈值分解了每个HSAT,并用新的组件替换了先前的分配组件。   在某些情况下 ,将重复的分量与单倍体混合在一起。通常,当单倍体组件中的覆盖范围有系统地掉落或大部分长重叠群被错误地复制时 。为了解决这个问题,我们使用了另一个虚假重复的指标 ,即映射质量非常低的对齐方式的积累(MAPQ)。我们仅使用与MAPQ> 20的对齐方式生成了另一个覆盖范围文件。每当我们发现一个以五个以上的高质量比对重复的区域时 ,我们都会将标志更改为单倍体 。   在步骤5中进行了校正后,我们将每个组件的块合并到1,000的距离之后,并将合并后的任何两个组件的重叠标记为“未知 ” ,以表明无法正确分配此块。Flagger生产的床文件可在HPRC S3存储桶中找到(https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=submissions/e9ad8022222-1b30-11ec-ab04-0a13C5208311-COVERAGE_ANALYSIS_Y1_GENBANK/FLAGGER/APR_08_2022/final_hifi_based/flagger_hifi_asm_asm_simplified_beds/)。   为了估计旗手的假阳性速率,我们将其应用于T2T-CHM13(v.1.1)参考 。Flagger的直接输出表明,T2T-CHM13参考组件中约有12.77 MB(约0.41%)被标记为潜在不可靠。HPRC组件几乎没有RDNA阵列 ,但是在T2T-CHM13(V.1.1)参考中有rDNA阵列的建模序列。这些阵列被标记为虚假重复的整体复制,这表明使用HIFI读取的旗手可能无法正确评估RDNA阵列 。因此,为了进行公平的比较 ,我们从参考评估中排除了rDNA阵列(总计约为9.92 MB),这将不可靠的碱数量减少到5.58 MB(约为0.18%)。我们还确定了1 – HSAT2染色体旁边的区域约2.76 MB,这些区域被倒塌时被误锁定。这种失误障碍是系统覆盖范围上升对相邻HSAT2的影响 ,这改变了拟合混合模型 。通过手动固定这种错误标记,我们在T2T-CHM13(v.1.1)中拥有约2.82 MB(0.09%)的不可靠块 。该数字比HPRC组件的平均值低约9.3倍。这些不可靠的块主要是“未知”块的组合,这些块无法正确分配 ,并且具有特定于HIFI的覆盖率下降的区域。该分析的结果可在HPRC S3存储桶中获得(https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=submissions/e9ad80222-1b30-11EC-AB04-0A13C5208311- -COVERAGE_ANALYSISS_Y1_GENBANK/FLAGGER/APR_08_2022/final_hifi_based/t2t-chm13/) 。   使用ReponMasker(V.4.1.2-P1)对每个组件上的重复掩盖进行迭代进行。第一步掩盖了默认的人类重复库 ,第二步使用了一个重复库,该库是在硬掩盖初始重复屏蔽的DNA后,在原始组件上加强了CHM13卫星DNA序列。增强重复库(final_conseni_gap_nohsat_teucer.embl.txt)可在Zenodo(https://doi.org/10.5281/zenodo.55537107)上获得 ,并提供并行的重复蒙版(Replemaskgenome.snakefile.snakefile.snakefile.snakefile.snakefile.snakefile),可在(https://github.com/chaissonlab/segdupannotation) 。这两个步骤的结合产生了完整的重复掩蔽。   在每个组件中掩盖重复序列后,使用SEDEF85注释SD。用20多个副本注释的重复对应于未注释的移动元素 ,并被排除在分析之外 。注释SDS的管道可在GitHub(https://github.com/chaissonlab/segdupannotation/ReleaSes/tag/vhprc)上获得。   所有单倍型组件的可靠区域与T2T-CHM13(v.2.0)对齐,然后细分为5 kb窗口,并与T2T-CHM13的SD注释相交。分析时 ,在T2T-CHM13(v.2.0)上染色体Y的SD注释无法获得 。此外,添加到T2T-CHM13中的染色体Y来自HPRC HG002样品。因此,排除了染色体Y。对于每类不可靠的区域(未知 ,错误,重复,崩溃和单倍体) ,我们计算了基本对的平均数量在单倍型组件上重叠SD ,并注释了每个5 KB窗口,具有最具代表性的重叠性SD(具有最高标识和长度的SD) 。然后,使用最具代表性的SD ,我们计算了所有94个单倍型的SD的平均长度和身份,并比较了组件中不同类型的错误类型的SDS身份的长度 。该分析的代码可在GitHub(https://gist.github.com/mrvollger/3BDD2D34F312932C12917A4379A555973)上获得。   为了创建高信心注释,开发了一条新的Ensembl注释管道。管道簇和地图并联在空间近端基因(以避免在相同的旁系同源物附近单独映射的问题) ,并试图通过考虑基因社区的同步grch38注释以及基础映射的身份和覆盖范围来解决不一致的映射 。   从Gencode(V.38)基因的子集创建了一个参考基因集,该子集通过两次通行过程映射到HPRC组件。该排除在斑块或单倍型上的读取基因和基因,仅包括X/Y PAR区域上的一个基因副本(在PAR基因的emembl表示中仅模拟了X/Y PAR区域上的基因(只有一个副本 ,染色体X)。   首先,为了最大程度地减少在相同旁系同源物附近映射的难度,使用长度为100 kb的跳跃窗口来识别并行映射的基因簇(补充图43) 。初始窗口位于GRCH38参考中每个染色体的最多5'基因的开始时 ,并从基因的开头扩展了100 kb。然后将窗口完全或部分重叠的任何基因都包含在群集中。然后确定了未重叠的下一个3'基因,然后确定了一个新窗口,并重复了过程 。这导致了聚类的基因和非簇基因(当窗口中只有一个基因时 ,认为没有聚类的基因)。然后 ,根据最多5'基因的开始和每个簇中最多3'基因的末端(或在非簇基因的情况下,基因的5'和3'端)鉴定了要映射的区域。   对于上一步中定义的每个区域,然后选择锚点以帮助映射目标基因组上的区域 。从该地区的5'和3'边缘创建了两个10 KB的锚点 ,并在GRCH38基因组中的该地区的中点周围创建了一个中央10 KB锚。这些锚的序列是使用minimap2(参考文献86)用以下命令对目标基因组进行了映射:   检查了所得的命中以确定目标基因组中的高信任区域。高信心区域是其中所有三个锚处于相同的顶级序列,分别为colinear顺序,序列序列≥99%序列身份和≥50%的命中率 ,并且与参考基因组相比,锚点之间的距离相似 。如果未发现所有三个锚的合适候选区域,则以类似方式评估成对的映射锚 。   然后使用MAFFT检索序列选择的区域或区域并对齐相应的GRCH38区域。对于每个基因 ,检索了相应的外显子,并通过两个区域的比对进行了坐标。然后从投影外显子重建成绩单 。对于每个转录本,计算与GRCH38的父转录本对齐时的覆盖范围和身份。   如果由此产生的成绩单的覆盖范围 <98% or an identity of <99%, the parent transcripts were aligned to the target region using minimap2 in splice-aware mode, with the high-quality setting for Iso-Seq/cDNA transcripts enabled. The maximum intron size was set to 100 kb by default. For transcripts with reference introns larger than 100 kb, the maximum intron size was scaled and set as 1.5 times the length of the longest intron (to allow some variability):   For each transcript that mapped to the target genome, the quality of the mapping was assessed on the basis of aligning the original reference sequence with the newly identified target sequence. Again, if the coverage or identity of the aligned sequence was <98% or <99%, respectively, the reference transcript sequence was re-aligned to the target region, this time using Exonerate87. Exonerate, although slower than minimap2, has the ability to handle very small exons and can incorporate CDS data to preserve the CDS (introducing pseudo-introns as needed). The following command was used:   When more than one approach was used to model the transcript, the mapping with the highest combined identity and coverage was selected.   For genes not mapped through the initial regional anchors, a second approach was used. The expected location of the gene was located using high-confidence genes mapped during the first phase. High-confidence mappings were those for which there was a single mapped copy of the gene, all the transcripts had mapping scores of 99% coverage and identity on average and the gene also had a similar gene neighbourhood to the neighbourhood in the reference (at least 80% of the of the same genes in common for the 100 closest neighbouring genes in the reference). After this step, the entire genome region underlying the missing gene, including a 5 kb flanking sequence, was mapped against the target genome using minimap2:   The resulting hits were then filtered on the basis of overlap with the expected region that the missing gene should lie in. If there was no expected region calculated (cases in which no pair of high-confidence genes could be found to define the 5′ and 3′ boundaries of the expected location of the missing gene, for example, at the edge of a scaffold) or no hit overlapping the expected region was found, the top reported hit was used providing it passed an identity cut-off of 99%. The selected hit or hits were then extended on the basis of how much of the original reference gene they covered to ensure that minor local variants between the reference and target regions did not lead to the target region being truncated. Once extended, the remaining hits were then clustered on the basis of genomic overlap and merged into unique regions. The missing genes were then attempted to be mapped to these regions using an identical process as described above for the initial mappings, involving MAFFT, minimap2 and Exonerate.   To minimize the occurrences of mis-mapped paralogues, each gene was checked for exon overlap in both the target and the reference. If the overlapping genes were not identical at a locus between the reference and the target, then a conflict was identified. For each gene present, filtering was done to reduce or remove the conflict based on a number of factors, including whether the genes were in the expected location, whether the genes were high-confidence mappings, the average percent identity and coverage of the transcript for the genes and the neighbourhood score. When it was not possible to resolve a conflict between two genes, both were kept. This concluded the primary mapping process.   After this process, potential recent duplications were identified. To search for recent duplications, the canonical transcript of each gene (the longest transcript in the case of noncoding genes, or the transcript with the longest translation followed by the longest overall sequence for protein-coding genes) was selected and aligned across the genome using minimap2 in a splice-aware manner:   Mappings that had exon overlap with existing annotations from the primary mapping process on the target genome were removed. For new mappings that did not overlap existing annotations, the quality of the alignment was then assessed by aligning the mapped transcript sequence to the corresponding reference transcript to calculate the coverage and per cent identity of the mapping. Different coverage and per cent identity cut-offs were used for these mappings on the basis of the type of transcript mapped. Protein-coding and small noncoding transcripts used a coverage and identity cut-off of 95%, whereas long noncoding transcripts used a coverage and identity cut-off of 90%. Pseudogene transcripts had a lower coverage cut-off of 80%, but the same identity cut-off of 90% as long noncoding transcripts.   When looking for new paralogues, for cases in which multiple canonical transcripts mapped to a locus, a single representative transcript was selected. This was based on the following hierarchy of gene biotype groups: coding, long noncoding, pseudogene, small noncoding, and miscellaneous or undefined.   If there were multiple transcripts for the highest represented group, the transcript with the longest sequence was selected as the representative.   For the Ensembl and CAT gene annotation sets, we identified the locations of frameshifting indels by iterating over the coding sequence of each transcript and looking for any gaps in the alignment. If the gap had a length that was not a multiple of 3, and its length was <30 bp long (to remove probable introns from consideration), the gap was determined to be a frameshift and its location saved to a BED file.   We also analysed the number of nonsense mutations that would cause early stop codons in both the Ensembl and CAT gene annotation sets. We identified nonsense mutations by iterating through each codon in the coding sequence of the predicted transcripts. If there was an early stop codon before the canonical stop codon at the end of the transcript, we saved the location in a BED file.   For both sets of mutations, we then lifted over the coordinates of the mutations to be on the GRCh38 reference so that we could use existing variant call sets on GRCh38. We used to lift over each set of coordinates, using the GRCh38-based HAL file from the MC alignment. Then we used to intersect with the variant call file for each of the assemblies.   The following sample commands were used:   The VCF files used in this intersection were downloaded from the 1KG (https://urldefense.com/v3/__https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20201028_3202_raw_GT_with_annot/20201028_CCDG_14151_B01_GRM_WGS_2020-08-05_chr$i.recalibrated_variants.vcf.gz__;!!NLFGqXoFfo8MMQ!r6nD4EtteJ7k2BauOrREfgIrlxEI2Upx35sNHiqyI8Did-a6UUUzzxGVQkwYkb-bE_rlHQN2Jw2cBdlw7te_-Q$).   Where was replaced with each chromosome number. From there, each chromosome VCF was split so that each sample was in its own file using . The chromosome files for each sample were combined into one VCF using .   Duplicated genes were detected as multi-mapped coding sequences using Liftoff88 supplemented by a complementary approach (gb-map) with multi-mapped gene bodies. The combined set was formed by including all Liftoff gene duplications and duplicated genes detected by gb-map.   We ran Liftoff (v.1.6.3) to annotate extra gene copies in each of the assemblies. Liftoff was run with the flag to find additional copies of genes, with an identity threshold of at least 90%. An example command is below:   The additional copies of the genes were identified as such in the output gff3 with the field (equal to anything other than 0). For this analysis, we also only considered genes that were multi-exon, protein-coding genes. The additional gene copies were further filtered to remove any genes outside the ‘reliable’ haploid regions as determined by the Flagger pipeline.   The gene-body mapping pipeline identifies duplicated genes by first aligning transcripts of protein-coding and pseudogenes (GENCODE v.38) to each assembly and then multi-mapping the genomic sequences of each corresponding gene. Alignments of at least 90% identity and 90% of the length of the original duplication were considered candidate duplicated genes. Candidates were removed if they overlapped previously mapped transcripts from other genes, low-quality duplications and genes identified through CAT and Liftoff analysis.   To account for gene duplications in high-identity gene families, gene families were identified on the basis of sequence alignments from gb-map. Genes that mapped reciprocally with 90% identity and 90% length were considered a gene family. A single gene was selected as the representative gene for the family, and any gene duplication in the family was counted towards that gene.   Minigraph can rapidly perform assembly-to-graph mappings using a generalization of the minimap2 algorithm34. New SVs of at least 50 bp detected in the mapping can then be added to the graph. To construct a pangenome graph, one chosen reference assembly, GRCh38 in this case, was used as a starting graph, and the mapping and SV addition steps were repeated for each additional assembly, greedily. This iterative approach is analogous to partial order alignment (POA)89. Graphs constructed in this way describe the structural variation within the samples and provide a coordinate system across the reference and all insertions. Minigraph does not produce self-alignments. That is, it will never align a portion of the reference assembly onto another portion of the reference assembly. In this way all reference positions have a unique location within the created pangenome. Minigraph (v.0.14) was used with options. The input order was GRCh38, CHM13 then the remainder in lexicographic order by sample name.   Graphs constructed by Minigraph only contain structural variation (≥50 bp) by default. The aim of the MC pipeline is to refine the output of Minigraph to include smaller variants, down to the SNP level. Doing so allows the graph to comprehensively represent most variation, as well as to embed the input haplotypes within it as paths, which is important for some applications49. To remove noisy alignments from the MC pangenome, long (≥100 kb) non-reference sequences identified as being satellite, unassignable to a reference chromosome or which appear unaligned to the remainder of the assemblies were removed from the graph. This resulted in a pangenome with significantly reduced complexity that nevertheless maintained all sequences of the starting reference assembly and the large majority of those in the additional haplotypes. The MC pipeline is composed of the five steps described below and in more detail in ref. 35. The script and commands to reproduce this process can be found at GitHub (https://github.com/ComparativeGenomicsToolkit/cactus/blob/81903cb82ae80da342515109cdee5a85b2fde625/doc/pangenome.md#hprc-version-10-graphs). A newer, simpler version of the pipeline that no longer requires satellite masking can be found at GitHub (https://github.com/ComparativeGenomicsToolkit/cactus/blob/5fed950471f04e9892bb90531e8f63be911857e1/doc/pangenome.md#hprc-graph).   Paths from the reference, GRCh38, are acyclic in the MC graph. Paths from any other haplotypes can contain cycles (as a result of different query segments mapping to the same target), but they are relatively rare.   The PGGB uses a symmetric, all-to-all comparison of genomes to generate and refine a pangenome. We applied it to build a pangenome graph from all genome assemblies and references (both GRCh38 and CHM13). The resulting PGGB graph represents all alignment relationships between input genomes in a single graph. The PGGB graph is a lossless model of the input assemblies that represents all equivalently. This arrangement enables all of our pangenome assemblies to be used as reference systems, a property that we used to explore the scope of pangenome variation in a total way. Owing to ambiguous placement of variation in all-to-all pairwise alignments, many SV hotspots, including the centromeres, are transitively collapsed into loops through a subgraph representing a single repeat copy, a feature that tends to reduce the size of variants found in repetitive sequences. In contrast to MC, PGGB does not filter rapidly evolving satellite sequences or the regions that do not reliably align. This increases its size and complexity relative to the MC graph and adds a significant amount of singleton sequences relative to the Minigraph and MC graphs. However, this property enables annotations and coordinates of all assemblies in the pangenome to be related to the graph structure and utilized in subsequent downstream analyses. We applied the PGGB model to investigate the full pangenome and integrate annotations established de novo on the diverse assemblies into a single model for analyses of pangenome diversity and of complex structurally variable loci (MHC and 8p inversion).   PGGB generates a pangenome graph in three phases. (1) Alignment: in the first phase, the wfmash aligner95 is used to generate all-vs-all alignments of input sequences. This method, wfmash, applies the mapping algorithm of MashMap2 to find homologies at a specified length and per cent identity. It then derives base-level alignments using a high-order version of the WFA algorithm (wflign), which first aligns sequences in segments of 256 bp, then patches up the base-level alignment with local application of WFA. wfmash was designed and developed specifically for the problem of building all-to-all alignments for large pangenomes. (2) Graph induction: the input FASTA sequences and PAF-format alignments produced by wfmash are converted to a graph (in GFA format) using seqwish96. This losslessly transforms the input alignments and sequences into a graph. (3) Graph normalization: we applied a normalization algorithm—smoothxg97—to simplify complex motifs that occur in STRs and other repetitive sequences, as well as to mitigate underalignment. The graph is first sorted using a path-guided stochastic gradient descent method98 that organizes the graph in one-dimension to optimize path distances and graph distances. This sort provides a way to partition the graph into smaller pieces over which we applied a multiple sequence alignment algorithm (abPOA)91. These pieces were laced back into a final graph. We iterated this process twice using different target POA lengths to remove boundary effects caused at the borders of the MSA problems. Finally, we applied GFAffix94 to remove redundant furcations from the topology of the graph.   To build the HPRC PGGB graph, we used both the CHM13 and GRCh38 references as a target and mapped all contigs against these with wfmash, requiring a full length mapping at 90% total identity, collecting all contigs that mapped to a given chromosome. Contigs that did not map under this arrangement were then partitioned using a split-mapping approach, requiring 90% identity over 50 kb to seed the mappings, and putting the contig into the chromosome bin for which it had the best split mapping. We thus initially partition the data into 25 chromosome sets: one for each autosome, one for each sex chromosome, and finally the mitochondria.   We then applied PGGB (v.0.2.0+531f85f) to each partition to build a chromosome-specific graph. Run in parallel over 6 PowerEdge R6515 AMD EPYC 7402P 24-core nodes with 384 GB of RAM, this process requires 22.49 system days, or around 3.7 days wallclock. To develop a robust process to build the HPRC graph, the PGGB team iterated the build 88 times. The final chromosome graphs were compacted into a single ID space using , then for each reference (GRCh38 and CHM13) a combined VCF file was generated from the graph with (v.1.36.0/commit 375cad7).   A handful of key parameters defined the shape of the resulting graph. First, in wfmash, we required >100 kb mappings at 98% identity. We mapped each HPRC assembly contig and reference chromosome (both GRCh38 and CHM13) to all the other 89 input haplotypes. To reduce complexity, and false-positive SNPs resulting from misaligned regions, we applied a minimum match length filter (in seqwish) of 311 bp. This meant that the graph that we induced was relatively ‘underaligned’ locally, and only through normalization in smoothxg did we compress the bubble structures that are produced. For smoothxg, our first iteration attempts to generate 13,033 bp-long POA problems, whereas the second is 13,177 bp. These lengths provided a balanced trade-off between run time and variant detection accuracy.   In addition to a graph (in GFA), PGGB generates visualizations of the graph in one and two dimensions, which show both the topology (two dimensional) and path-to-graph relationship (one dimensional). A code-level description of the build process is provided at GitHub (https://github.com/pangenome/HPRCyear1v2genbank).   Variant sites in Minigraph and in MC and PGGB graphs were discovered using (v.0.5)75 and 99, respectively. Large (>10 MB)使用选项使用VCFBUB(V.0.1.0)100删除MC中的虚假删除和PGGB图。接下来 ,将变体位点分为小型变体(<50 bp) and SV (≥50 bp) sites. The SV sites were then annotated as described in the methods section of article that describes Minigraph34. In brief, the longest allele sequence of each SV site was extracted and stored in the FASTA format. The interspersed repeats, low-complexity regions, exact tandem repeats, centromeric satellites and gaps in the longest allele sequences were then identified using RepeatMasker (v.4.1.2-p1) with the NCBI/RMBLAST (v.2.10.0) search engine and Dfam (v.3.3) database, SDUST (v.0.1)101, ETRF (commit fc059d5)102, dna-brnn (v.0.1)90 and (v.1.3)103, respectively. SDs were identified if the total node length in a site was ≥1,000 bp and ≥20% of bases of these nodes were annotated as SD in the reference or in individual assembly (‘SD annotation’ subsection). To find hits to the GRCh38 reference genome, minimap2 (v.2.24) with options was used to align the longest allele sequences to the reference genome. Based on the identified features, SV sites were classified into various repeat classes using (https://github.com/lh3/minigraph/blob/master/misc/mgutils.js) with minor modifications to enable it to work with the files derived from the MC and PGGB graphs.   We use the heaps tool of the odgi pangenome analysis toolkit98 to estimate how the euchromatic autosomal pangenome grows with each additional genome assembly added. Here we approximated euchromatic regions by non-satellite DNA, which was identified by dna-brnn in the construction of the MC graph (see the ‘MC’ subsection). Although the MC non-reference haplotypes of the MC graph do not contain satellite DNA, the PGGB graph does. Consequently, we subset the PGGB graph to segments contained in the MC graph. We additionally excluded reference haplotypes (GRCh38 and CHM13) from the analysis. We then sampled permutations of the 88 non-reference (neither GRCh38 nor T2T-CHM13) haplotypes. In each permutation, we calculated the size of the pangenome after adding the first 1, 2, …, N haplotypes in both graphs. This produced a collection of saturation curves from which we derived a median saturation curve onto which we fitted a power law function known as Heaps’ Law. The exponent of this function is generally understood to represent the degree of openness—or diversity—of a pangenome39. Summing up, we called to generate pangenome saturation curves for 200 permutations. Next to calculating a non-permuted cumulative base count, we also counted the number of common (≥5% of all non-reference haplotypes) and core (≥95% of all non-reference haplotypes) bases in the pangenome graphs. To this end, we used a tool called panacus104 and supplied a list of the samples in which they are grouped according to their assigned superpopulation (). We repeated the count, this time including only segments of depth ≥2, that is, contained at least twice in any haplotype sequence.   Pangenome graphs were decomposed topologically into a set of nested subgraphs, termed snarls, that each correspond to one or a collection of genetic variants. These snarls were then converted to VCF format using 99. Large (>使用选项的VCFBUB(v.0.1.0)100删除MC中的100 kb)删除和PGGB图 。为了简化每个人的变体与其他呼叫集的比较,使用多样本VCF文件使用使用样本VCF文件,并使用多级级别站点将其分为biallelic记录。由于SNARL分解的局限性 ,SNARL可能包含多种变体 ,这些变体无法进一步分解为使用的嵌套SNARL。如果将这种咆哮与真相呼唤进行比较,则评估将不准确 。我们通过比较每个SNARL的参考和替代等位基因遍历来解决这个问题,以推断变体的极简主义表示(补充图44)。   使用99发现了MC和PGGB图中的变体位点。然后根据等位基因遍历(基于等位基因遍历的分解pangenome图分解pangenome图)进行分解 。使用RTG工具(v.3.12.1)83 ,将多核苷酸多态性和复杂的indels进一步分解为SNP和简单的indels,以便稍后可以用gnomad注释它们。为了进行比较,还使用了使用deepvariant读取的变体读取的变体 ,并使用了使用DIPCALL的单倍型分辨的组件。对于每种发现方法,小型变体(<50 bp) were extracted and normalized using . Next, all per-sample VCF files were combined into one VCF file using after dropping individual genotype information using . To annotate small variants with AFs from gnomAD105, the gnomAD (v.3.1.2) per-chromosome VCF files were downloaded and concatenated into one VCF file using . The VCF file was then compressed into a file in the gnotate format using from (v.0.2.7)106 with options . The small variants were annotated with gnomAD using .   The PacBio HiFi reads were aligned to the GRCh38 human reference genome with no alternatives using Winnowmap2 (v.2.03)107 with . The MD tags required by Sniffles were calculated using . The resulting BAM files were sorted and indexed using SAMtools.   For small variants, the two-pass mode of DeepVariant (v.1.1.0)107 with WhatsHap (v.1.1)108 was used to call SNPs and indels from the PacBio HiFi read alignments. The resulting VCF files were used as truth sets for small variant benchmarking.   Three discovery methods were used to call SVs from the PacBio HiFi read alignments. For PBSV (v.2.6.2)41, SV signatures were identified using with to improve the calling performance in repetitive regions. SVs were then detected using with from the signatures. For SVIM (v.2.0.0)44, SVs were called using with . In contrast to PBSV and Sniffles, SVIM outputs all calls no matter their quality. To determine the threshold used for filtering low-quality calls, a precision–recall curve was generated across various quality scores by comparing with the GIAB (v.0.6) Tier 1 SV benchmark set for HG002 (Supplementary Fig. 45). Consequently, SVIM calls with a quality score lower than ten were excluded. For Sniffles (v.1.0.12b)42, SVs were discovered with . Unlike PBSV and SVIM, Sniffles does not generate consensus sequences of insertions from aggregating multiple supporting reads. Therefore, Iris (v.1.0.4)43 was used to refine the breakpoints and insertion sequences with . All resulting VCF files were sorted and indexed using BCFtools.   Three discovery methods were used to call SVs from the haplotype-resolved assemblies generated using Trio-Hifiasm.   For SVIM-asm (v.1.0.2)45, assemblies were aligned to the GRCh38 human reference genome with no alternatives using minimap2 (v.2.21)86 with and then sorted and indexed using SAMtools. SVs were called using with . The resulting VCF files were sorted and indexed using BCFtools.   For PAV (v.0.9.1)5, assemblies were aligned to the GRCh38 human reference genome with no alternatives using minimap2 (v.2.21)86 with options . These alignments are then trimmed to reduce the redundancy of records and to increase the contiguity of alignments. SVs, indels and SNPs were called by using cigar string parsing of the trimmed alignments. Inversion calling in PAV uses a new k-mer density assessment to resolve inner and outer breakpoints of flanking repeats, which does not rely on alignment breaks to identify inversion sites. This is designed to overcome limitations in alignment methodologies and to expand inversion calls, which result in duplications and deletions of sequence on the boundaries.   The Hall-lab pipeline is as documented in the WDL workflow (https://github.com/hall-lab/competitive-alignment/blob/master/call_assembly_variants.wdl) (commit 830260a). In brief, the maternal and paternal assemblies were aligned to the GRCh38 human reference genome using minimap2 (v.2.1)86 with options . Large indels (>使用基于Paftools(v.2.17-R949-Dirty)的“ call_small_variants”任务检测到50 bp) 。对于大型SV,根据“ CALL_SV ”任务中的一系列自定义Python脚本分类 ,根据汇编重叠群的分配对准与参考基因组进行映射,并将其分类为SV 。然后,根据组件重叠群的参考基因组覆盖范围(使用(V.2.28.0)计算) ,对断点映射的SV进行过滤。对于每个单倍型组件,定义了一个“排除区域”的床文件,其中包括一个以上不同的重叠群覆盖的基因组区域 ,或者由单个重叠群覆盖了3倍以上的覆盖范围。断点映射的SVS ,其中断点或> 50%的外跨跨度相交的排除区域被过滤 。   为了集成由三个基于HIFI和三个基于组装的SV呼叫者生成的每个样本VCF文件,使用了SVTools109。对于每个人,来自六个呼叫者的VCF文件是共同分类的 ,然后使用svtools合并,然后首先使用严格的标准(),然后使用更宽松的第二种Merge()。共识SV呼叫集中包含至少两个呼叫者支持的常染色体SV呼叫以进行比较 。   对于SVS ,使用DIPCALL生成了自信的区域。尽管对小型变体有用,但是当前的基准测试工具(例如hap.py/vcfeval)无法正确比较SV中和周围的小型变体的不同表示。因此,对于每个样本 ,来自DIPCALL的自信区域被进一步处理如下:   使用VG 99发现了MC和PGGB图中的变体位点 。然后使用VCFBUB(V.0.1.0)100删除具有大于100 kb的等位基因的变体位点,并使用选项。使用VCFLIB110的VCFWAVE进一步处理了结果的VCF文件。简而言之,VCFWAVE使用双向波前对齐(BIWFA)算法对每个变体位点的参考等位基因进行了重新调整 ,将复杂等位基因分解为原始的等位基因 。然后使用多样本VCF文件转换为每样本VCF文件,并使用多重型站点将多重位点拆分为双重记录。接下来是常染色体小型变体(<50 bp) from a given pangenome graph (query set) were compared with the HiFi-DeepVariant call set (truth set) using vcfeval from RTG tools (v.3.12.1)83 with options . Note that the multi-nucleotide polymorphisms and complex indels were reduced to SNPs and simple indels using from RTG tools (v.3.12.1)83 The comparison was performed independently for each individual. Recall and precision were calculated within the refined Dipcall confident regions (‘Defining confident regions for variant benchmarking’ subsection) and then stratified using the GIAB (v.3.0) genomic context. To evaluate the SV (≥50 bp) calling performance, the autosomal SVs from a given pangenome graph (query set) were compared to the consensus SV call set (truth set) for each individual using (v.3.2.0)112 with options . Recall and precision were then stratified using the GIAB (v.3.0) genomic context and by variant length.   PacBio HiFi reads from 44 HPRC samples (excluding the held out samples) were aligned to the MC graph using GraphAligner (v.1.0.13)113 with option and stored in the GAF format34. For each read that aligned to multiple places in the graph, the alignment with the highest score was retained. To remove low-quality alignments, a read with <80% of read length aligned to the graph was discarded. After filtering the read-to-graph alignments, the read depth of each edge was calculated using (v.1.33.0)114 with options . Note that the resulting GAF files did not contain a mapping quality (encoded as 255 for missing) for each alignment, therefore the option was given to to ensure that these alignments were used during read-depth calculation. Next, the edges of each sample were classified into either on-target or off-target depending on whether they were on the sample paths (encoded as W-lines in MC GFA files) or not.   ONT reads obtained from 29 HPRC samples (samples labelled HPRC in Supplementary Table 1) were aligned against the MC graph. The alignments were produced using GraphAligner (v.1.0.13) with parameter settings . The number of reads in these datasets range between 1 million and 5.4 million and have an average read length of 28.4 kb. On average, 99.68% of the reads received hits from one or more locations in the graph. For each read, we determined its best hit based on alignment score and discarded all its lower-scoring alignments in subsequent analyses. The alignment identities of these best hits peaked above 95%, with an average ratio of alignment-length-to-read-length (ALRL) of 0.880 (s.d. = 0.302) and average MAPQ value of 59.35. The alignment set was further quality-pruned by discarding alignments that either had an ALRL lower than 0.8 or a MAPQ value lower than 50. The surviving alignments had an overall average ALRL of 0.968 (s.d. = 0.047) and effectuate an overall genome coverage between 10.5-fold and 43-fold across the 29 samples (Supplementary Fig. 46).   We ran CAT46 to annotate each of the genomes within a pangenome graph. CAT projects a reference annotation, in this case GENCODE (v.38), to each of the haplotypes using the underlying alignments within the graph. CAT (commit eb2fc87) was run on both of the GRCh38-based and the CHM13-based MC graphs . For each graph, the autosomes were first run all together, and then the sex chromosomes were run on the appropriate haplotypes. The parameters used were default parameters, except as shown below. An example CAT command run is:   Comparisons were made between the resulting CAT annotations and those from the Ensembl pipeline by looking at the parent GENCODE identifiers for each gene and transcript in the sets. Numbers of shared and unique identifiers between the sets were tabulated. Because the two annotation sets used slightly different versions of GENCODE (v.38), only those identifiers attempted to be mapped by both pipelines were considered. Additionally, features were considered to be at the same locus if their genomic intervals overlapped.   SV sites in Minigraph were discovered using . To obtain the number of observed alleles per site, per-sample alleles were called using (Supplementary Table 21). SV sites with alleles larger than 10 kb and at least five observed alleles were selected as complex SV sites. The complex SV sites were further filtered on the basis of whether they overlapped with medically relevant protein-coding genes47 using . To understand whether the medically relevant complex SV sites are known in previous studies, the coverage of SVs from the 1KG call sets10,19 was computed using (Supplementary Table 16). All complex SVs were examined using Bandage115 and visually compared to previous short-read SV call sets10,19,68,116 using IGV.   We extracted subgraphs and paths for five loci in the MC and PGGB graphs using gfabase (v.0.6.0)117 and odgi (v.0.6.2)98 with the following example commands:   We then visualized the graph structures of the subgraphs using Bandage (v.0.8.1)115.   We aligned Ensembl (release 106)77 GRCh38 version gene sequences to the MC graph and PGGB graph using GraphAligner (v.1.0.13)113 with parameter settings to identify the gene positions within the graphs. To show locations of genes on Bandage plots, we applied colour gradients from green to blue to the nodes of each gene. Lines alongside the Bandage plots showing approximate gene positions, exons and transcription start sites based on Ensembl Canonical transcripts were drawn by hand.   Sequences of each assembly are represented by paths in a GFA file. We identified SVs in each assembly by tracing these paths through different ‘big’ bubbles (>5,000 bp) in either the MC graph or PGGB graph within those gene regions. We selected the 5,000 bp bubble size based on manual inspection of Bandage plots. An example command to identify big bubbles at a RH locus is as follows:   To identify gene conversion events (as gene conversions are not shown as bubbles in the graphs), we identified nodes that were different between a gene and its homologous gene (for example, RHD and RHCE) based on the GraphAligner alignments described above. We refer to these as paralogous sequence variants. A gene conversion event was detected if a path of a gene goes through more than four paralogous sequence variants of its homologous gene in a row.   We counted the number of assemblies for each structural haplotype and computed their frequency. We visualized linear haplotype structures (for example, in Fig. 5c) using gggenes (v.0.4.1)118 based on the structural haplotypes determined for each assembly from the pangenome graphs. The length of intervals between genes is fixed (except for TMEM50A and RHCE, because those two genes are immediately next to each other). Lengths of genes are shown as proportional to gene lengths in GRCh38.   The short reads were first split into chunks to parallelize the read mapping to the ‘allele-filtered graph’ pangenome, as defined in the ‘MC’ subsection. This pangenome is included within the dataset accompanying this paper and can be identified as ‘clip.d9.m1000.D10M.m1000’. Mapping was performed with Giraffe49 from vg release v.1.37.0. For trio-based runs, the trio-sample sets of short reads were mapped to the pangenome using Giraffe from vg release v.1.38.0. Note that the core vg algorithms for Giraffe mapping and surjection (conversion from graph space to linear space) are the same in both vg v.1.37.0 and v.1.38.0. The output alignments, surjected to GRCh38 in BAM format as explained below, are available at https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=publications/PANGENOME_2022/DeepTrio/samples in the bam directory of each sample’s directory, and are organized by aligner.   To perform variant calling, GAM alignments were surjected onto the chromosomal paths from GRCh38 (chromosomes 1–22, X and Y) using and the option to prune short and low-complexity anchors during realignment. The BAM files were sorted and split by chromosome using SAMtools (v.1.3.1)119. The reads were realigned, first using from FreeBayes (v.1.2.0)120, and then with ABRA (v.2.23)121 on target regions that were identified using from GATK (v.3.8.1)122 and expanded by 160 nucleotides with (v.2.21.0)123.   To perform variant calling with DeepVariant and DeepTrio, we trained machine-learning models specific to our graph reference and Giraffe alignment pipeline based on our alignments. For all models, chromosome 20 was entirely held out from all input samples to provide a control.   Training was performed on Google’s internal cluster, using unreleased Google tensor processing unit (TPU) accelerators, from a cold start (that is, without using a pre-trained model as input). We believe that nothing about the way in which we executed the training is essential to the results obtained. Cold start training is estimated to be feasible outside the Google environment; therefore the claims we present here are falsifiable, but it is not expected to be cost-effective. Researchers looking to independently replicate our training should consider doing warm start training from a base model trained on other data, using commercially available graphics processing unit (GPU) accelerators. An example procedure can be found in the DeepVariant training tutorial at GitHub (https://github.com/google/deepvariant/blob/r1.3/docs/deepvariant-training-case-study.md). We predict that this more accessible method would produce equivalent results.   For both DeepVariant and DeepTrio, the true variant calls being trained against came from the GIAB benchmark (v.4.2.1).   For DeepVariant, we trained on the HG002, HG004, HG005, HG006 and HG007 samples, with HG003 held out. The trained DeepVariant model is available at https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=publications/PANGENOME_2022/DeepVariant/models/DEEPVARIANT_MC_Y1.   For DeepTrio, we trained two sets of models: one on HG002, HG003, HG004, HG005, HG006 and HG007, with HG001 held out; and one on HG001, HG005, HG006 and HG007, with the HG002, HG003 and HG004 trio held out. Each DeepTrio model set included parental and child models. The two trained child deeptrio models are available at https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=publications/PANGENOME_2022/DeepTrio/models/deeptrio/child and https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=publications/PANGENOME_2022/DeepTrio/models/deeptrio-no-HG002-HG003-HG004/child, respectively. The two trained parental DeepTrio models are available at https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=publications/PANGENOME_2022/DeepTrio/models/deeptrio/parent and https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=publications/PANGENOME_2022/DeepTrio/models/deeptrio-no-HG002-HG003-HG004/parent, respectively.   DeepVariant (v.1.3) was evaluated on HG003, using the model we trained with HG003 held out (see ‘Model training’). We used the flag (introduced to support this analysis) and a minimum mapping quality of 1 in the step, before calling the variants with . Both VCFs and gVCFs were produced. The WDL workflow used for single sample mapping and variant calling was deposited into Zenodo (https://doi.org/10.5281/zenodo.6655968).   Small variants were also called using a more traditional pipeline. We aligned reads with BWA-MEM50 to GRCh38 with decoys but no ALTs. DeepVariant then called small variants from the aligned reads. The same version and parameters were used for DeepVariant. Only the model was changed to the default DeepVariant model.   Small variants were also called using DeepTrio (v.1.3). For HG001, we used the DeepTrio models we trained with HG001 held out (see ‘Model training’). For the HG002, HG003 and HG004 trio and HG005, HG006 and HG007 trio, we used the models trained with the HG002, HG003 and HG004 trio held out; the HG005, HG006 and HG007 trio (except for chromosome 20) was still included in the training set. We used the and a minimum mapping quality of 1 in the step before calling the variants with . Both VCFs and gVCFs were produced and are available at https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=publications/PANGENOME_2022/DeepTrio/samples in the vcf directory of each sample’s directory, and are organized by mapping and calling condition. The WDL workflow used for trio-based mapping and variant calling was deposited into Zenodo (https://doi.org/10.5281/zenodo.6655962).   For DeepTrio, small variants were also called using a more traditional pipeline and a graph-based implementation of Illumina’s Dragen platform (v.3.7.5). The conditions evaluated were each a combination of a mapper and a reference. The Giraffe-HPRC condition used Giraffe (v.1.38.0)49 to align reads to the HPRC reference. The BWA-MEM condition used BWA-MEM (v.0.7.17-r1188)50 to align reads to the hs38d1 human reference genome with decoys but no ALTs. The Dragen-DeepTrioCall condition used Illumina’s Dragen platform (v.3.7.5)51 against their default graph, which was constructed using the same GRCh38 reference with decoys but no ALTs, and population contigs, SNPs and liftover sequences from datasets internal to their platform. DeepTrio then called small variants from the aligned reads. The same version and parameters were used for DeepTrio (v.1.3). Only the default model was used for these conditions. We also applied the native Dragen caller and joint genotyper to the Dragen-Graph-based alignments for comparison purposes, referred to as Dragen-DragenCall and Dragen-DragenJointCall, respectively. Dragen-DragenCall implements a single-sample based method and is what is the default use-case for processing Dragen-Graph-mapped data. Dragen-DragenJointCall uses a pedigree-backed implementation that informs which variants are likely to be de novo and which are erroneous given the genotype information of the parents. To make a fairer comparison with Dragen, we tested these configurations to assess what implementation of Dragen variant calling produced the best results given the available trio data.   The small variant calls were evaluated using HG001–HG007 with the GIAB benchmark (v.4.2.1)124, on HG002 in challenging medically relevant autosomal genes47, and on HG002 using a preliminary draft assembly-based benchmark. For the draft assembly-based benchmark, we used Dipcall38 to align a scaffolded, high-coverage Trio-Hifiasm assembly21,38 to GRCh38 and call variants, and then we excluded structurally variant regions from the dip.bed file as described above for the benchmarking of small variants from the pangenome graph. The comparison between the call sets and truth set was made with RTG’s vcfeval83 and Illumina’s hap.py tool82 on confident regions of the benchmark. We used high-coverage read sets of the GIAB HG001, HG002 and HG005 trio child samples and evaluated performance within the held-out chromosome 20 for the GIAB (v.4.2.1) truth set, or the entire genome for the reduced truth set of the challenging medically relevant autosomal genes. The evaluation was also stratified using the set of regions provided by the GIAB at GitHub (https://github.com/genome-in-a-bottle/genome-stratifications)125.   We applied our small variant calling pipeline to the high-coverage read sets for the 3,202 samples of the 1KG19. The output alignments, in the GAM format, and the VCFs were saved in public buckets at https://console.cloud.google.com/storage/browser/brain-genomics-public/research/cohort/1KGP/vg/graph_to_grch38. We selected 100 trios among those samples to further evaluate the quality of the calls. We tested all variants that have at least one alternative allele in a trio for Mendelian consistency. In addition, for each variant, we only considered trios for which the child’s genotype was different from the genotype of at least one of the parents to minimize bias created by systematic calls (for example, all homozygous or all heterozygous). We looked at the fraction of variants–trios that failed Mendelian consistency in the entire genome and in sites that do not overlap simple repeats as defined by the ‘simpleRepeat’ track downloaded from the UCSC Genome Browser. The results were compared with Mendelian consistency of calls provided by the 1KG that used GATK HaplotypeCaller on the reads aligned to GRCh38. We also repeated this analysis on the two trios of the GIAB (v.4.2.1) benchmark (HG002–HG007) and across the different methods of our evaluation described above (BWA-MEM and DragenGraph mappers; DeepVariant, DeepTrio and Dragen variant callers).   We used a VCF file created on the basis of snarl traversal of the MC graph as a basis for genotyping. The records contained in this VCF represent bubbles in the underlying pangenome graph and their nested variants, derived from the snarl tree. Each variant was marked according to their level in this tree. Variants annotated by ‘LV=0’ correspond to the top-level bubbles. We used vcfbub (v.0.1.0)100 with parameters and to filter the VCF. This removed all non-top-level bubbles from the VCF unless they were nested inside a top-level bubble with a reference length exceeding 100 kb; that is, top-level bubbles longer than that are replaced by their child nodes in the snarl tree. The VCF also contained the haplotypes for all 44 assembly samples, representing paths in the pangenome graph. We additionally removed all records for which more than 20% of all 88 haplotypes carried a missing allele (“.”). This resulted in a set of 22,133,782 bubbles. In a next step, we used PanGenie (v.1.0.0)54 to genotype these bubbles across all 3,202 samples from the 1KG based on high-coverage Illumina reads19.   We genotyped all top-level bubbles across all 1KG samples. Whereas biallelic bubbles can be easily classified representing SNPs, indels or SVs, this becomes more difficult for multiallelic bubbles contained in the VCF. In particular, larger multiallelic bubbles can contain a high number of nested variant alleles overlapping across haplotypes, represented as a single bubble in the graph. This is especially problematic when comparing the genotypes computed for the entire bubble to external call sets, as coordinates of the bubble do not necessarily represent the exact coordinates of individual variant alleles carried by a sample in such a region (Supplementary Fig. 20).   To tackle this problem, we implemented a decomposition approach that aimed to detect all variant alleles nested inside multiallelic top-level records. The idea was to detect variants from the node traversals of the reference and alternative alleles of all top-level bubbles. Given the node traversals of a reference and alternative path through a bubble, our approach was to match each reference node to its leftmost occurrence in the alternative traversal, resulting in an alignment of the node traversals (Supplementary Fig. 21a). Nested alleles could then be determined based on indels and mismatches in this alignment. As the node traversals of the alternative alleles can visit the same node more than once (which is not the case for the reference alleles of the MC graph), this approach is not guaranteed to reconstruct the optimal sequence alignment underlying the nodes in these repeated regions.   As an output, the decomposition process generated two VCF files. The first one is a multiallelic VCF that contains exactly the same variant records as the input VCF, just that annotations for all alternative alleles of a record were added to the identifier tag in the INFO field. For each alternative allele, the identifier tag contains identifiers encoding all nested variants it is composed of, separated by a colon. The second VCF is biallelic and contains a separate record for each nested variant identifier defining reference and alternative allele of the respective variant (Supplementary Fig. 21b). Both VCFs are different representations of the same genomic variation, that is, before and after decomposition. We applied this decomposition method to the MC-based VCF file, used the multiallelic output VCF as input for PanGenie to genotype bubbles, and used the biallelic VCF as well as the identifiers to translate PanGenie’s genotypes for bubbles to genotypes for all individual nested variant alleles. All downstream analyses of the genotypes are based on this biallelic representation (that is, after decomposition).   Although the majority of short bubbles (<10 bp) are biallelic, particularly large bubbles (>1,000 bp) tend to be multiallelic. Sometimes each of the 88 non-reference (neither GRCh38 nor T2T-CHM13) haplotypes contained in the graph covered a different path through such a bubble (Supplementary Fig. 22a), leading to a VCF record with 88 alternative alleles listed. We determined the number of variant alleles located inside biallelic and multiallelic bubbles in the pangenome after decomposition. As expected, the majority of SV alleles was located inside of the more complex, multiallelic regions of the pangenome (Supplementary Fig. 22b).   We conducted a leave-one-out experiment to evaluate PanGenie’s genotyping performance for the call set samples. For this purpose, we repeatedly removed one of the panel samples from the MC VCF and genotyped it using only the remaining samples as an input panel for PanGenie. We later used the genotypes of the left-out sample as ground truth for evaluation. We repeated this experiment for five of the call set samples (HG00438, HG00733, HG02717, NA20129 and HG03453) using 1KG high-coverage Illumina reads19. PanGenie is a re-genotyping method. Therefore, like every other re-typer, it can only genotype variants contained in the input panel VCF, that is, it is not able to detect variants unique to the genotyped sample. For this reason, we removed all variant alleles (after decomposition) unique to the left-out sample contained in the truth set for evaluation. To evaluate the genotype performance, we used the weighted genotype concordance54. Extended Data Fig. 9 shows the results stratified by different regions. Extended Data Fig. 9a shows concordances in biallelic and multiallelic regions of the MC VCF. The biallelic regions include only bubbles with two branches. The multiallelic regions include all bubbles in which haplotypes cover more than two different paths. Extended Data Fig. 9b shows the same results stratified by genomic regions defined by GIAB that we obtained from the following genotypes: easy (https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/genome-stratifications/v3.0/GRCh38/union/GRCh38_notinalldifficultregions.bed.gz); low-mappability (https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/genome-stratifications/v3.0/GRCh38/union/GRCh38_alllowmapandsegdupregions.bed.gz); repeats (https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/genome-stratifications/v3.0/GRCh38/LowComplexity/GRCh38_AllTandemRepeats_gt100bp_slop5.bed.gz); and other-difficult (https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/genome-stratifications/v3.0/GRCh38/OtherDifficult/GRCh38_allOtherDifficultregions.bed.gz).   Here and in the following, we considered results for SNPs, indels (1–49 bp), SV deletions, SV insertions and other SV alleles, defined as follows: SV deletions include all alleles for which length(REF) ≥50 bp and length(ALT) = 1 bp; SV insertions include all alleles for which length(REF) = 1 bp and length(ALT) ≥ 50 bp; all other alleles with a length ≥50 bp are included in ‘others’.   Overall, weighted genotype concordances were high for all variant types. In particular, variant alleles in biallelic regions of the graph were easily genotypable. Alleles inside multiallelic bubbles were more difficult to genotype correctly as PanGenie needs to decide between several possible alternative paths, whereas there are only two such paths for biallelic regions (Extended Data Fig. 9a). Furthermore, genotyping accuracy depended on the genomic context (Extended Data Fig. 9b). Regions with low mappability, repetitive regions and other difficult regions were harder to genotype than regions classified as ‘easy’ by GIAB.   We generated genotypes for all 3,202 1KG samples with PanGenie and defined a high-quality subset of SV alleles that we could reliably genotype. For this purpose, we applied a machine-learning approach similar to what we have previously presented5,54. We defined positive and negative subsets of variants based on the following filters: ac0_fail, a variant allele was genotyped with an AF of 0.0 across all samples; mendel_fail, the mendelian consistency across trios is <80% for a variant allele. Here, we use a strict definition of Mendelian consistency, which excludes all trios with only 0/0, only 0/1 and only 1/1 genotypes; gq_fail, <50 high-quality genotypes were reported for this variant allele; self_fail, genotyping accuracy of a variant allele across the panel samples is <90%; nonref_fail, not a single non-0/0 genotype was genotyped correctly across all panel samples.   The positive set included all variant alleles that passed all five filters. The negative set contained all variant alleles that passed the ac0_fail filter but failed at least three of the other filters. We trained a support vector regression approach based on these two sets that used multiple features, including AFs, Mendelian consistencies or the number of alternative alleles transmitted from parents to children. We applied this method to all remaining variant alleles genotyped with an AF > 0, resulting in a score between –1 (bad) and 1 (good) for each. We finally defined a filtered set of variants that included the positive set and all variant alleles with a score of ≥ –0.5.   We show the number of variant alleles contained in the unfiltered set, the positive set and the filtered set in Supplementary Table 18. As our focus is on SVs and as 65% of all SNPs and indels are already contained in the positive set, we applied our machine-learning approach only to SVs. We found that 50%, 33% and 26% of all deletion, insertion and other alleles, respectively, were contained in the final, filtered set of variants. Note that these numbers take all distinct SV alleles contained in the call sets into account. Especially for insertions and other SVs, many of these alleles are highly similar, with sometimes only a single base pair differing. Therefore, it is probable that many of these actually represent the same events. Our genotyping and filtering approach helped to remove such redundant alleles.   To evaluate the quality of the PanGenie genotypes, we compared the AFs observed for the SV alleles across all 2,504 unrelated 1KG samples to their AFs observed across the 44 assembly samples in the MC call set. Supplementary Figs. 23–25 show the results for SV deletions, insertions and other SV alleles. We observed that the AFs between both sets matched well, resulting in Pearson correlations of 0.93, 0.87 and 0.81 for deletions, insertions and other alleles, respectively, contained in the unfiltered set. For the filtered set, we observed correlations of 0.96, 0.93 and 0.90, respectively. We also analysed the heterozygosity of the PanGenie genotypes across all 2,504 unrelated 1KG samples and observed a relationship close to what is expected by Hardy–Weinberg equilibrium (Supplementary Figs. 23–25, lower panels).   We compared our filtered set of variant alleles to the HGSVC PanGenie genotypes (v.2.0 lenient set)5 and Illumina-based SV genotypes19. A direct comparison of the three call sets is difficult. The HGSVC and HPRC call sets are based on variant calls produced from haplotype-resolved assemblies of 32 and 44 samples, respectively5. For each call set, variants were re-genotyped across all 3,202 1KG samples. Note that the call set samples for HPRC and HGSVC are disjoint. As re-genotyping cannot discover new variants, both call sets will miss variants carried by 3,202 samples that were not seen in the assembly samples. By contrast, the 1KG call set contains short-read based variant calls produced for each of the 3,202 1KG samples. Another difference between the HGSVC and HPRC call sets is that in the HGSVC call set, highly similar alleles are merged into a single record to correct for representation differences across different samples or haplotypes. The HPRC call set, however, keeps all these alleles separately even if there is only a single base pair difference between them. To make the call sets better comparable, we merged clusters of highly similar alleles in the HPRC filtered set before comparisons with other call sets. This was done with Truvari (v.3.1.0)112 using the command: .   To be able to properly compare the call sets despite their differences, we counted the number of SV alleles present in each sample (genotype 0/1 or 1/1) in each call set and plotted the corresponding distributions stratified by genome annotations from GIAB (same as above, Fig. 6d). We also generated the same plot including only common SV alleles with an AF >所有3,202个样品中的5%(补充图28)。这两个图都表明,两个基于组装的呼叫集(HPRC和HGSVC)都能够在基因组中访问更多的SV ,而不是基于简短的1公斤呼叫集 ,尤其是删除 <300 bp and insertions (Fig. 6e). This result confirms that SV callers based on short reads alone miss a large proportion of SVs located in regions inaccessible by short-read alignments, which has been previously reposed by several studies5,8. In the ‘easy’ regions, the number of SVs per sample was consistent across all three call sets. For the other regions, however, results indicated that the HPRC-filtered genotypes provided access to more variant alleles than the HGSVC lenient set, especially insertions and variants in regions of low mappability and tandem repeats (Fig. 6d,e).   To evaluate the new SVs in our filtered HPRC call set, we revisited the leave-one-out experiment we had previously performed on the unfiltered set of variants (see above). We restricted the evaluation to the following subset of variants: (1) those that are in our filtered set but not in the 1KG Illumina calls (novel); (2) those in our filtered set and in the 1KG Illumina call set (known); and (3) all variants in our filtered set. To find matches between our set and the Illumina calls, we used a criterion based on reciprocal overlap of at least 50%. Results are shown in Supplementary Fig. 29. We generated two versions of this figure: the first one (top) excludes variants that are unique to the left-out sample and therefore not typable by any re-genotyping method, and the second one includes these variants (bottom). In general, genotype concordances of all lenient variants (brown, dark purple) were slightly higher compared with the concordances we observed for the unfiltered set (Supplementary Fig. 29). Furthermore, concordances of the known variants were highest. This is expected, as these variants tended to be in regions easier to access by short reads. Concordances for novel variants were slightly worse. This was also expected, as these variants tended to be located in more complex genomic regions that are generally harder to access. However, even for these variants, concordances were still high, which indicated that the PanGenie genotypes for these variants are of high quality.   In addition to all 3,202 1KG samples, we genotyped sample HG002 based on Illumina reads from ref. 18. We used the GIAB CMRG benchmark containing medically relevant SVs47 downloaded from https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA24385_son/CMRG_v1.00/GRCh38/StructuralVariant/ for evaluation. Similar to the 1KG samples, we used the MC-based VCF (see above) containing variant bubbles and haplotypes of 44 assembly samples as an input panel for PanGenie. We extracted all variant alleles with a length ≥50 bp from our genotyped VCF (biallelic version, after decomposition). We converted the ground truth VCF into a biallelic representation using and kept all alleles with length ≥50 bp. We used Truvari (v.3.1.0)112 with parameters to compare our genotype predictions to the medically relevant SVs. Results are shown in Supplementary Table 22. As PanGenie is a re-typing method, it can only genotype variants provided in the input and therefore cannot detect novel alleles. As HG002 is not among the panel samples, the input VCF misses variants unique to this HG002 sample. Thus, these unique variants cannot be genotyped by PanGenie and will be counted as false negatives during evaluation. Therefore, we computed an adjusted version of the recall that excluded SV alleles unique to HG002 (that is, alleles not in the graph) from the truth set for evaluation. To identify which SV alleles were unique, we compared each of the 44 panel samples to the ground truth VCF using Truvari to identify the false negatives for each sample. Then we computed the intersection of false-negative calls across all samples. The resulting set then contained all variant alleles unique to the HG002 ground truth set. We found 15 such unique SV alleles among the GIAB CMRG variants. We removed these alleles from the ground truth set and recomputed precision–recall statistics for our genotypes. Adjusted precision–recall values are shown in Supplementary Table 22.   All genotyping results produced by PanGenie are available at Zenodo (https://doi.org/10.5281/zenodo.6797328).   Raw VNTR coordinates on GRCh38 (chromosomes 1–22 and sex chromosomes only) were generated using TRF (v.4.09)56 with command . Only repeats with a period size between 6,000 and 10,000 bp, total length >选择了100 bp,而不是与中心区域重叠,总共有98,021个非重叠基因座 。使用grch38上的原始VNTR坐标作为输入 ,使用danbing-tk(v.1.3)55中的构建模块(dist_scan = 700,dist_merge = 1,trwindow = 1 ,trwindow = 100000,mbe_th1 = 0.6,mbe_th1 = 0.3 ,使用了96个单倍型(包括GRCH38和CHM13)的VNTR区域(包括GRCH38和CHM13) 。   对于每个基因组,以约30倍的模拟全基因组配对的无误读读数,或每种单倍型的约15倍。每20 bp生成一个读取对 ,碎片大小为500 bp,读取长度为150 bp。使用命令使用长颈鹿(v.1.39.0)49完成对MC图的成对末端读取映射,而使用BWA-MEM(V.0.7.17-R1188)50使用该命令进行映射到GRCH38进行映射 。为了进行公平的比较 ,GRCH38加诱饵减去ALT/HLA重叠群被用作匹配MC图中包含的路径的参考。   为了评估使用MC Graph Plus长颈鹿的读取映射的性能 ,使用Danbing-TK的VNTR信息通过遍历每个单倍型路径来注释图中的每个节点。覆盖VNTR区域的每个节点都有一个元组,表示相交的间隔;任何与间隔重叠的对齐读取的读数都被视为映射到VNTR 。同样,从与VNTR的间隔重叠中的模拟的读取被认为是从VNTR得出的。为了评估GRCH38加上BWA-MEM的性能 ,使用床托的BAMTOBED子模块(v.2.30.0)123获得了每个读取的映射区域。GRCH38上的VNTR注释用于确定读取是否映射到VNTR 。   对于每个读取,我们都会跟踪其源并映射VNTR和VNTR,并使用此信息来计算准确性。Danbing-TK的注释中只有VNTR被跟踪。否则 ,它们被标记为未跟踪,与非VNTR区域相同 。真正的积极表示从VNTR到其原始VNTR的映射。外源性假阳性表示从未跟踪区域映射到VNTR。内源性假阳性表示从VNTR到另一个VNTR的映射 。真正的负面表示从未跟踪区域到未跟踪区域的映射 。假阴性表示从VNTR到未跟踪区域的映射(补充图30)。长颈鹿的JSON输出中不包含映射场的任何比对均被视为未映射。一对未映射到bwa-mem绘制到同一染色体的读取对的两端也被认为是未映射的 。   35个基因组的WGS样品(HG00438,HG00621 ,HG00673,HG00733,HG00733 ,HG00735,HG01071,HG01106 ,HG01109 ,HG01109,HG01117,HG01175HG01952 ,HG01978,HG02055,HG02080 ,HG02145,HG02148,HG02257 ,HG02572,HG02622,HG02622 ,HG02630,HG02630,HG02717 ,HG02717 ,HG02723,HG02723,HG02723 ,HG028189898988SG028,028,0,0,0,0,0,0,0,0,0,0质节如“ Pangenome Point Genotyping ”小节中所述,HG03453,HG03486 ,HG03492,HG03492,HG03492 ,HG03579,NA18906和NA19240)使用Giraffe映射到MC图。使用上一节中描述的VNTR注释,将映射到MC图中每个VNTR区域的读取数作为VNTR长度的代理计算。从分析中除去了35个基因组的不变长度的VNTR ,总共有60,861个基因座 。基因组中VNTR的地面真相是根据VNTR跨越的基础数量计算出来的,该基础的数量是从两种单倍型中平均的。   作为基线控制,还使用命令使用MOSDEPTH(v.0.3.1)126计算了通过映射到GRCH38产生的35个WGS样品的每个VNTR区域的读取深度。为了能够与基于图的方法进行比较 ,进一步删除了GRCH38上缺少注释的VNTR ,总共剩下60,386个VNTR 。   我们用边缘的剪接连接边缘增强了等位基因过滤图(请参见“ pangenome点基因分型”小节),以使用VG Toolkit(V.1.38.0)中的RNA子命令创建剪接的pangenome图,并以最大节点长度设置为32()57。用于定义剪接连接的转录本注释由每个组件上的CAT转录物注释以及Gencode(v.38)注释的剪接连接组成29。进一步添加了来自Gencode(V.38)注释的转录本作为剪接Pangenome图的路径 。为了进行比较 ,我们还创建了一个从参考序列构建的剪接参考,再次使用Gencode(v.38)转录本注释。对于这两个图,我们创建了使用带有默认参数的VG工具包(V.1.38.0)映射所需的索引 ,除非修剪,否则将恢复嵌入式路径上的边缘()。此外,对于剪接的HPRC Pangenome图 ,必须使用更严格的修剪参数() 。对于剪接引用,我们还使用默认参数创建了RNA-Seq mapper Star所需的索引 。   我们模拟了RNA-Seq读取的管道读取,该管道旨在保留模拟数据中的复杂基因组变异。用于仿真的转录序列源自从HG002中投射到组装的单倍型上的Gencode(V.38)转录物注释。具体而言 ,我们使用MC在GRCH38和两个HPRC HG002组件单倍型之间创建了一个对齐,这些单体型被放在主要的pangenome图中进行基准测试 。然后,我们使用CAT将成绩单注释抬高到这些单倍型。我们使用子命令构建了一个剪接的个人基因组图 ,然后使用Illumina novaseq cdna读取集(SRR18109271)模拟读取(commit 2cea1e2) ,以拟合模型参数。这本质上等于直接从投影的成绩单序列进行模拟 。用均匀的表达模拟转录本,在两个单倍型之间均匀分裂,从而将每个单倍型的读数分开。此表达曲线在生物学上不是现实的 ,但它避免了选择特定表达曲线作为所有组织和发育阶段的代表性的困难。此外,现有的估计配置文件将偏向用于估计它们的工具 。我们模拟了5,000,000个配对端150 bp RNA-seq读取。   模拟和真实的Illumina RNA-seq读取均使用默认参数(Commit C0C4816)映射到图形。此外,使用默认参数58的Star(v.2.7.10a)将读取映射到剪接的参考文献 。对于真实数据 ,我们使用了先前发布的数据NA12878 RNA-seq数据(SRR1153470)127和编码项目(ENCSR000AED,REPLICATE 1)128,129。   我们使用与前面描述的57相同的方法来评估对齐方式。简而言之,对于模拟数据 ,通过估计其在参考基因组路径上的重叠来将图形比对与真实比对进行比较 。使用图形对齐被投影到参考路径 。如果将其重叠的90%的真实对齐方式重叠,则认为独特的对齐读数(一台MAPQ≥30)是正确的。如果在相同的标准下有任何多映射正确,则将多映射读取视为正确。对于实际数据 ,将每个外显子的平均读取覆盖范围与从投影图对准计算的参考路径上的平均读取覆盖范围与从长阅读比对估计的相应覆盖率进行了比较 。对于长阅读数据,我们使用了来自编码项目(ENCSR706的所有重复)的PACBIO ISO-SEQ对齐,这些对齐与Illumina数据相同。长阅读对准被用于定义外显子 ,并且仅使用映射质量至少30个的主要长阅读对准。将四个ISO-SEQ重复的比对(ENCFF247TLH ,ENCFF431IOE,ENCFF520MMC和ENCFF626GWM)合并并使用Samtools(V.1.15)130进行了过滤 。使用Bedtools(V.2.30.0)将比对转换为外显子坐标123。   我们还使用映射实验的结果来量化等位基因偏差。我们用来识别来自HG002的单倍型的MC图中的变异位点,其中缺失大于10 kb以避免虚假变体 。选择重叠外显子并使用BCFTools(V.1.16)130进行标准化。接下来 ,我们计算了从每个杂合外显子变体57重叠的两个单倍型中的映射读数数量。具体而言,每个等位基因的读数计算是在等位基因的两个断点中的平均计数 。这样做是为了平等地对待不同的变体类型和长度。然后,我们使用α= 0.01的两侧二项式测试对等位基因偏置的读取覆盖率至少为20个变体进行了测试。所有拒绝零假设的测试都是误报 ,因为读取是没有等位基因偏差的 。结果分为不同类别的变体,并针对达到至少20的地点的地点数量,这是映射灵敏度的粗略指标 。排除了大于50 bp的indels。   我们还使用映射读数比较了基因表达推断。对于VG MPMAP(V.1.43.0)图形映射 ,我们在转录本推理模式下使用了RPVG(COMPL 1D91A9E)57来量化表达式 。两个不同的转录本注释用作RPVG的输入。MC Pantranscriptome是根据每个组件上的CAT转录物注释和Gencode(v.38)转录组创建的。将MPMAP-RPVG管道与鲑鱼(V.1.9.0)131和RSEM(V.1.3.3)132进行了比较 。这些方法被提供了Gencode(v.38)转录组作为输入,无论是在GRCH38替代重叠群上带有转录本。过滤替代重叠群独有的任何基因。对于鲑鱼,将GRCH38参考用作诱饵 ,并保留重复的转录本 。Bowtie2(v.2.4.5)133用作RSEM的映射器。通过求和每个基因的相应转录水平值来计算基因表达值。GFFRead(V.0.12.7)用于创建来自转录本注释的基因名称和成绩单标识符表134 。精度以两种方式测量:(1)模拟表达值和推断表达值之间的Spearman相关性,以及(2)模拟和推断的表达值之间的平均绝对相对差。   用于图形构造,读取模拟 ,映射和评估的脚本可在GitHub(https://github.com/jonassibbesen/hprc-rnaseq-analyses-scripts)上获得。   我们将H3K4ME1 ,H3K27AC和ATAC-SEQ对准了从单核细胞衍生的巨噬细胞获得的30个个体59的巨噬细胞,使用13到GRCH38参考基因组图和HPRC基因组图 。然后,我们使用Graph Peak Caller(v.1.2.3)135在两组对齐中调用了峰值 ,这两组的比对,用于30 H3K4ME1,H3K27AC和ATAC-SEQ样品 。为了识别仅HPRC的峰 ,我们使用Graph Peak Caller将HPRC坐标预测到GRCH38路径,并使用BedTools123进行了比较。我们将HPRC峰命名为重叠的GRCH38峰为公共峰,而不像HPRC的峰”。我们通过从所有样品的峰中重新采样每个样品的峰并重新计算了重叠的数量 ,从而计算了30个样品中常见峰和仅HPRC峰的预期频率(作为反向累积分布) 。我们重复了100次模拟,并绘制了平均曲线。我们通过使用和基因分型将每个样品的WGS数据集与使用和基因分型对变体进行对齐,从而确定了样品中样品中的杂合变量。我们在每个样品中缩小了50 bp以上的杂合SV列表 ,目的是寻找等位基因特异性峰 。对于每个表观基因组样品,我们通过在表观基因组HPRC比对上运行的峰内获得了等位基因特异性读数,该峰位于先前识别的基因座上 ,该比对在泡沫中输出了每个路径上的读数数(VCF输出中的DP和AD场)。然后 ,我们将峰分配给SV或参考等位基因,或两个等位基因,具有在等位基因上的读取之和和p = 0.05的读数之和的两尾二项式测试。任何具有读数的峰值在一个等位基因上 ,而另一个等位基因则没有分配给等位基因 。对参考等位基因和SV等位基因之间的长度差异进行了比例调整。   上述步骤的处理数据,脚本和代码可在Zenodo(https://doi.org/10.5281/zenodo.6564396)上获得。   尽管我们的pangenome中代表的种群样本的大小很小,但它提供了对先前确定的基因组地区不足区域的访问 。我们试图根据CHM13和GRCH38参考文献 ,使用区域PCA来理解这些区域对未来人群遗传研究的潜在效用。对于这些分析,我们考虑了PGGB(整个Pangenome,组合)和MC(基于参考的CHM13和GRCH38)图。对于这两个图模型 ,CHM13 VCF提供了基于GRCH38的研究以前未观察到的区域的访问,基于GRCH38,基于短阅读的研究可能会难以可靠地对齐和调用变体 。结合起来 ,这两个图提供了这些新区域中隐含种群遗传模式的交叉验证,我们在这里探索 。   为了理解特定于染色体的变异模式,我们将PCA应用于每个常染色体染色体 ,从PGGB(PGGB-CHM13 ,PGGB-GRCH38)独立地应用于VCF。   为了确保观察到的模式不会从杂技p-arms的重复区域中较高的组装误差速率得出,我们使用我们的弗拉格 - 信心区域注释来修剪PGGB图形(将自信区域作为子路径注入自信区域,然后将包括不可靠区域的完整原始路径注射到仅限自信区域的完整原始路径)。然后 ,我们重新应用此图以获取一组新的SNP(可以在修剪图上修剪和变体的代码可以在以下链接上找到:https://github.com/pangenome/pangenome/hprcyear1v2genbank/blob/blob/main/main/workflows/workflows/confident_variants.md) 。全基因组,我们发现修剪只能将所谓的SNP的数量减少1.188%(先前n = 23,272,652,修剪n = 22,996,113)。accentrics的总降低较高 ,SNP少6.29%(先前的n = 3,735,605,修剪n = 3,676,746),这表明组装这些区域的困难。我们注意到 ,PCA样本分布几乎保持几乎相同(数据未显示),这表明尽管组装问题,在完整图中观察到的模式仍保持不变 。在这些过滤后的PGGB-CHM13和PGGB-GRCH38 VCF中 ,我们考虑了所有双重SNP相对于所选参考,而与所选参考相对于所选的参考,无论变异嵌套水平如何 ,数据未显示;仅滤波SNP lv = 0或LV> 0的过滤均可产生几乎相同的结果)。定性评估表明 ,元中心染色体的PCA模式没有显着差异(补充图47)。然而,在accentrics的p臂(染色体13 、14、15、21和22)中,可以在PGGB-CHM13 VCF中访问 ,我们观察到人口差异的降低和最低主要主分量中的较高方差率 。   为了进行定量研究,我们测量了PCA使用k-means群集对PGGB-CHM13 VCF所隐含的簇数,以自动确定每个PCA的最佳簇数(GAP_STAT聚类在FVIZ_NBCLCLUST函数中的FACTOEXTRA R套件中可用的FVIZ_NBCLUST函数(分析)(分析)(分析)(分析)https://github.com/silviabuonaiuto/hprcpopgeranalysis)。将此方法应用于每个染色体VCF的三种PCA ,我们获得了P-ARM,Q-ARM和整个染色体的最佳簇计数。在元中心染色体中,我们通常观察到大约对应于输入基因组中预期世界人口组的数量的最佳簇数量(3-5 ,如补充图48) 。然而,在杂技的p臂中,我们观察到与杂技染色体的其他部分相比 ,通常只有一个群集,这表明人口分化减少。这种模式仅在基于CHM13的PGGB图中显而易见。为了定量评估差异,我们应用了Wilcoxon秩和测试 ,以比较整个染色体 ,Q-ARM,Q-ARM和P-ARM中元中心与杂型染色体中群集计数分布之间的差异 。在染色体尺度和Q臂中,杂技和元中心染色体之间的分布之间存在非显着的差异 ,但是在肢体中心p-arms的情况下,存在显着差异(Wilcoxon P = 0.013)(补充图49) 。   该分析表明,在人群遗传研究中使用这些新区域仍然存在重大挑战。在所有染色体中 ,在Pangenome的PCA投影中观察到的模式表明,在杂技杂志短臂内种群之间的差异共享过程明显。实际上,当使用CHM13组装作为参考时 ,我们在这些区域观察到了一个更加同质的人群 。该参考包含这些区域中的实际序列,而GRCH38包含差距,这使得分析不可能。明显的种群均质化可能是由误差驱动的。我们通过仅利用在Flagger-Condident区域中发现的SNP来减轻此问题 ,但这并不能防止潜在的对齐误差来源,这些误差可能会被这些基因座的重复序列放大 。两个图模型应用的染色体特异性分区过程也可能无法在这些短臂上正确划分重叠群。短臂之间已知的同源性增强了非同源染色体之间持续的序列信息交换的可能性,这与我们观察到的模式是一致的。总而言之 ,该分析表明 ,当使用CHM13作为参考时,PGGB图中accentrics短臂上序列的行为与Pangenome中其他序列的序列不相似 。   有关研究设计的更多信息可在与本文有关的自然投资组合报告摘要中获得。

本文来自作者[admin]投稿,不代表象功馆立场,如若转载,请注明出处:https://wap.xianggongguan.cn/zskp/202506-473.html

(26)
admin的头像admin签约作者

文章推荐

发表回复

作者才能评论

评论列表(3条)

  • admin的头像
    admin 2025年06月18日

    我是象功馆的签约作者“admin”

  • admin
    admin 2025年06月18日

    本文概览:  我们从1公斤中鉴定出父母三重奏,其中Coriell医学研究所在NHGRI样本存储库中用于人类遗传研究的子细胞系被列为零扩展和两个或更少的段落,并且通过等级订购的代表个体如下...

  • admin
    用户061808 2025年06月18日

    文章不错《人类pangenome参考草案》内容很有帮助

联系我们

邮件:象功馆@gmail.com

工作时间:周一至周五,9:30-17:30,节假日休息

关注微信