唐氏综合症中人类胎儿血液的单细胞多摩学图

　　人胎骨和肝脏样本是从15个胎儿中获得TS21 12-20年后年龄（PCW）的胎儿和5例疾病胎儿，在终止怀孕并获得知情书面同意后，年龄为11-19 PCW。人类胎儿材料由MRC/Wellcome Trust（Grant MR/R006237/1）提供人类发育生物学资源（http://www.hdbr.org），并在孕产妇知情同意下，并符合国家卫生服务（NHS）研究健康局的道德认可，Rec：18/LO/LO/0822 。HDBR由英国人类组织管理局（HTA; www.hta.gov.uk）监管，并根据相关的HTA实践守则进行操作。样本量不是预先确定的，而是基于样本可用性，并且在时间段时限制。由于样品是根据其核型收集的，因此随机化和盲目不适用，并且在自动计算管道后分析了数据。　　在L15培养基中收到胎儿肝脏和股骨，并在解剖后3小时内处理。将肝脏切成较小的小块，用手术刀切成含有预热消化培养基的管：rpmi（Gibco），补充了10％FBS（Gibco），青霉素 - 链霉素（10 U mL -1青霉素和100 ng ml -1 sttreptlycin ，sigma -allypomycin，sigma -aldrich），sigma -aldrich emantific nimific nimimine（2 mm -glutam-NEAA（GIBCO），1 mM丙酮酸钠（Gibco）和1.6 mg ML-1胶原酶IV（Sigma-Aldrich）。将管子涡旋10 s ，然后在37°C下孵育30分钟，每15分钟涡旋10 s 。通过100μm滤波器过滤消化的组织，并在冷D-PBS（Gibco）中稀释。将细胞以300克离心5分钟，然后在敲除血清替代（Gibco）+5％DMSO（Sigma-Aldrich）中等分并冷冻保存。对于股骨，去除粘附的材料，然后用手术刀除去epiphyses ，然后用D-PBS冲洗骨髓。将剩余的骨头切成小块，然后使用消化培养基与砂浆和杵磨碎，并在37°C下孵育30分钟，每15分钟涡流一次。将消化的材料和骨髓冲洗混合并通过100μm的过滤器过滤。将细胞以300克离心5分钟，然后在敲除血清替代（Gibco）+5％DMSO（Sigma-Aldrich）中等分并冷冻保存。将细胞存储在液氮中，直到进一步分析。　　通过定量荧光PCR（QF-PCR）确定本研究中每个样品的核型。分析是由从中获得样品的组织库进行的。使用染色体特异性的微卫星标记进行QF-PCR 。该分析显示出正常的结果，其染色体的二倍体补体显然是正常的二倍体补体13、15 、16、18、21和22，以及疾病样品中的性染色体和TS21样品中21染色体的三体染色体。未检测到镶嵌物。　　在FACS分类的那天，将细胞在37°C迅速解冻，并转移到完整的RPMI培养基（RPMI（GIBCO）（GIBCO）中，补充了10％FBS（GIBCO），青霉素 - 链霉素 - 链霉素（10 U ML -1 Penicillycin（10 U ML -1），100 ng ML -1链霉菌素和100 ng ml -1链霉菌素，sigma -alfrutific，sigma -alfrutific，sigma -aldrich -2 mm- 2 mm- glutific sigma -aflutific ，2 mm- glutific，2 mm -2 mm -2 mm-1×MEM NEAA（GIBCO）和1 mM丙酮酸钠（Gibco））。按照制造商的说明，使用MACS Dead Onsoval套件（130-090-101 ，Miltenyi Biotec）进行活细胞富集。当耗尽CD235A+细胞时，使用CD235A Microbead（130-050-501，Miltenyi Biotec）和Macs LS列（130-042-401 ，Miltenyi Biotec）进行了磁性负选择。　　对于FACS排序，用僵尸水上（Thermo Fisher）染色细胞，以排除死细胞和抗体的鸡尾酒（补充表21，SC分类面板）在4°C下持续30分钟。将细胞在4°C下在300G处离心5分钟，在PBS中重悬于500μl5％FBS的最终体积中，随后过滤为聚丙烯FACSS管（352063，Thermo Fisher），然后在BD Facsaria Facsaria Facusion上排序。　　按照制造商的说明，使用单细胞G芯片套件，化学v3.1（10x基因组学）将每个细胞悬浮液用于3'SCRNA-SEQ。在针对每个细胞的50,000个读取的Illumina Novaseq S4上对库进行了测序，并使用细胞Ranger Toolkit（v3.0.0）映射到GRCH38人类参考基因组。　　根据10倍基因组学建议进行核制剂。将活的CD45+分级细胞在4°C下以300克离心10分钟。Pellets were resuspended in 45 µl chilled lysis buffer (10 mM Tris-HCl (pH 7.4), 10 mM NaCl, 3 mM MgCl2, 0.1% Tween-20, 0.1% Nonidet P40 substitute, 0.01% digitonin, 1% BSA, 1 mM dithiothreitol and 1 U ml−1 RNase inhibitor in nuclease-free water) and在冰上孵育5分钟。Of chilled wash buffer (10 mM Tris-HCl (pH 7.4), 10 mM NaCl, 3 mM MgCl2, 1% BSA, 0.1% Tween-20, 1 mM dithiothreitol and 1 U ml−1 RNase inhibitor in nuclease-free water), 50 µl was added and nuclei were centrifuged at 500g for 7 min at 4 °C.去除上清液后，用45 µL冷水稀释的核缓冲液（1×核缓冲液，1 mM二硫代硫代醇和1 U ML-1 RNase抑制剂在无核酸酶的水中）洗涤核弹。将细胞核在4°C下以500克离心10分钟。将核重悬于7 µL冷藏稀释的核缓冲液中并计数。将一千核针对图书馆制备。按照制造商的说明，使用Chromium Next Gem J单细胞试剂盒（10x基因组学）提交每个核悬浮液进行库制备。在针对每个核的50,000个读取的Illumina Novaseq S4上对库进行了测序，并使用细胞Ranger Arc Toolkit（V1.0.1）映射到GRCH38人类参考基因组。　　将组织冷冻在干冰冷的异戊烷中，并在-80°C下存储在气密组织中。在进行任何空间转录组学方案之前，将组织嵌入最佳切割温度化合物（OCT）中，并用RNA完整性数（RIN）测试了RNA质量。将> 7的RIN值的组织在厚度为10μm的预冷低温恒温器中冷冻。在预冷的低温恒温器中以10μm的厚度冷冻两个连续切片，并转移至四个6.5 mm×6.5 mm的基因表达载体的捕获区域。将载玻片固定在甲醇中30分钟，然后用血久毒素和曙红染色，然后使用Nanozoomer滑动扫描仪成像。组织进行透化6分钟。使用KAPA SYBR Fast-QPCR Kit（KAPA Biosystems）使用定量PCR，使用逆转录（QRT – PCR）在玻片上进行逆转录和第二链合成，并在QuantStudio（Thermo Fisher）上进行分析。图书馆构造后，将其定量并以2.25 nm的浓度合并。使用150 bp配对的双索引设置在Novaseq SP（Illumina）上对每张载玻片的汇总库进行测序，以根据10x Genomics建议获得大约50,000个读取的测序深度。　　如前所述51，在胎儿HSC上进行了单细胞菌落形成单元（SC-CFU）。单，LIVE，LIN-，CD34+ ，CD38-，CD62L+，CD52+细胞从三个不同胎儿的胎儿肝脏中分离出TS21（中位数为13 pCW）和疾病（12 pcw的中位数）和型组合成分为96孔板（补充表21 ，HSC SC-CFU PANELED SPANEPLEND SPENSPAN STEMSPAN STEMPAN STEMCAN STEMCANSFEM（STEMCAN STEMCAN SFEM）STESFEM SFEM SFEM（STEMCAN SFEM SFEM）青霉素 - 链霉素（10 U ML-1青霉素和100 ng ML-1链霉素，Sigma-Aldrich），2 mM- glutamine（Thermo Scientific），20 ng ml-1 g-csf（peprotech），20 ng ml-1 scf（Peprotech）50 ng mL-1 TPO（peprotech），20 ng ml-1 IL-3（peprotech），20 ng ml-1 il-6（peprotech），20 ng ml-1 il-1 il-5（peprotech），20 ng ml-1 ml-1 ml-csf（peprotech）和20 ng ml-csf（peprotech），20 ng ml-csf（peprotech（peprotech）20 U ML -1 EPO（RND）。将细胞在37°C下在5％CO2下培养15天。在培养结束时，通过使用A BD LSR-Fortessa Analizzer（补充桌21，SC-CFU）linee（sc-Cfu linea cdye of CD41A（Megakaryocytic），CD235A（毛促性细胞），CD235A（红细胞）（淋巴样）和CD11b（髓样）（髓样型）的表达来评估菌落的谱系输出。如果在相对门中检测到30个或更多细胞，则认为菌落是谱系阳性的。菌落中的细胞总数通过使用伯爵夫人II细胞计数器（Thermo Fisher）（Thermo Fisher）来确定。为了评估TS21与疾病之间的菌落输出差异，我们使用TS21作为观察到的分布和疾病作为预期分布进行了卡方检验。　　要将TS21红细胞谱系输出与最大菌落中的疾病红细胞谱系输出进行比较，我们首先将细胞计数高于所有疾病菌落的95％的TS21菌落（相当于所有TS21菌落中的20％），以及在所有TS21菌落中的20％），在所有TS21菌落中的20％），以及在混合菌落中的最高菌群，以及混合菌群中的最高菌群。将多线菌落的输出二进制为相对门中细胞数量最高的谱系。然后，我们进行了二项式测试，n = 17观察到TS21红细胞谱系，k = 23总TS21谱系和p = 0.5（红细胞动物的疾病谱系的比例）。　　将细胞在37°C的完整RPMI培养基中迅速解冻，然后以300g离心5分钟，然后再次用D-PBS洗涤。然后将细胞重悬于D-PBS中，并以1：800的最终浓度添加活/死蓝。将细胞在室温下在黑暗中孵育15分钟，然后用D-PBS洗涤。然后在室温下用抗体鸡尾酒（补充表21，表型面板）在黑暗中染色30分钟，在BD地平线明亮的染色缓冲液（最终稀释1：4）和Miltenyi FCR阻滞剂（最终稀释1：5）中，最终体积为200 µl 。用D-PBS洗涤细胞，并立即在Cytek Aurora（五个激光器设置）上获得。在FlowJo（v10.8.2）上分析了数据。　　Mitosox是一种用于活细胞线粒体的特异性荧光染料，通过线粒体超氧化物氧化后会产生鲜绿色的荧光。mitracker绿色FM在线粒体中积聚，独立于膜电位和氧化应激，是线粒体质量测量的可靠工具38。考虑到HSC和祖细胞中的染料外排偏置39，我们使用Verapamil处理来阻断异生物生物外排泵并减轻优先染料外排（补充图16C ，E），以确保更准确地表示线粒体质量和mtros质量。　　将细胞在37°C的完整RPMI培养基中迅速解冻，然后以300g离心5分钟，然后再次用D-PBS洗涤。在与染料孵育之前，将MitRacker绿色FM试剂溶解在DMSO中，并将Mitosox Green溶解在无水N ，N-二甲基甲酰胺中，浓度为1 mm 。将细胞重悬于1 mL D-PBS中，并与Mitracker绿色FM（最终稀释1：1,000）或2 µM Mitosox Green在37°C下在维拉帕米（Verapamil）的存在下（从水溶液10 mm溶液中稀释）中孵育30分钟。然后用D-PBS洗涤细胞，并在冰上在黑暗中用抗体鸡尾酒（补充表21，mito面板）染色30分钟，并在BD Horizon Brilliant trilliate污渍缓冲液（最终稀释1：4），Miltenyi FCR阻断试剂（最终稀释1：5）和50 µm Verapamil，在最终体积中，100 µL。再次将细胞在D-PBS中再次洗涤，并立即在Cytek Aurora（五个激光器设置）上获得。在FlowJo（v10.8.2）上分析了数据。感兴趣的种群（LIN+ ，CD38+，CD38-，HSC）被导出为FSC文件，并通过FlowCore 2.2在R中导入R，以获取每个单元的每个线粒体探针的荧光数据。　　为了测试TS21细胞的线粒体质量显着不同，我们与缺血细胞的MTROS显着，我们拟合了高斯广义线性混合模型。在单细胞分辨率下，我们使用级别的正常转换来转换线粒体质量值，并将转换值用作响应变量。我们将年龄作为固定效果和样本作为随机截距。我们在模型中测试了疾病状态（固定效应）的影响，并在八个拟合模型中使用FDR（四种细胞类型通过两个响应变量）确定了显着性。　　为了制备慢病毒，HEK293T细胞与慢病毒TFR2表达质粒或空载体（从载体构造者购买）共转染，并使用lipofocectamine 3000（热渔夫）以及PSPAX2填料质粒和PMD22。转染后48小时收获病毒上清液，并在转染后72小时收获，并通过超速离心（在90,000克时进行90分钟）进行聚合和浓缩。将沉淀重悬于Stemspan SFEM II培养基（Stemcell Technologies）中，并在-80°C下储存。慢病毒的功能和滴度通过在293T细胞上的系列稀释液确定，并在转导后48小时评估GFP/TFR2表达。如前所述52 ，对HUDEP-2细胞进行培养并分化。In brief, the cultivation medium of HUDEP-2 is based on StemSpan SFEM II medium (Stemcell Technologies) supplemented with 2% penicillin–streptomycin (20 U ml−1 penicillin and 200 ng ml−1 streptomycin, Sigma-Aldrich), 2 mM -glutamine (Thermo Scientific), 50 ng ml−1 SCF (Peprotech),3 U ML -1 EPO（R＆D），1 µM地塞米松（Sigma）和1μgml -1强力霉素（Sigma）。将HUDEP-2细胞与浓缩病毒（感染的多样性1）孵育6小时，旋转并重悬于新鲜培养培养基中。After 2 days, cells were spun down and resuspended in differentiation medium, which is composed of IMDM (Thermo Fisher), supplemented with 2% penicillin–streptomycin (20 U ml−1 penicillin and 200 ng ml−1 streptomycin, Sigma-Aldrich), 2 mM -glutamine (Thermo Scientific), 50 ng ml−1 SCF (Peprotech),3 U ML-1 EPO（R＆D），5％人类AB血清（Sigma），10μgml-1胰岛素（Sigma），330μgml-1 Holo-1 Transferrin（Sigma），2 U ML-1肝素（Sigma）和1μgml-1 doxycycycline（Sigma）和1μgml-1将细胞在5％CO2的37°C下的分化培养基中培养4天。使用BD LSR-Fortessa分析仪，通过流式细胞仪评估细胞的TFR2和红细胞谱系标记表达（CD235A，CD71和CD36）（补充表21 ，Hudep面板）。HUDEP-2和HEK293T细胞分别由Grønbæk和Issazadeh-Navikas Laboratories（哥伦比亚大学，哥本哈根大学）赠予，并经常对菌根污染进行了测试。　　对于对应于特定技术和生物学重复的每个CellRanger输出，我们通过使用R Package Packletututils53应用Barcoderanks和空滴管函数来识别低质量的单元或空液滴。然后，我们将所有CellRanger输出合并为单个Scanpy Object54。按样本液滴去除后，根据三个参数进行质量控制：总唯一分子标识符计数（下Upper阈值（750，110,000）），检测到的基因数量（下UPPER阈值（250，8,500）），以及MITochondial Gene基因计数计数的比例为Cell Cell Cell（占20％）。我们进一步应用了Scrublet55以删除潜在的双线仪。　　接下来，我们将其子为相同的器官和TS21状态样本，然后合并到一个数据集中（例如，仅包含TS21肝样本的数据集）。我们认为TS21和疾病细胞转录组将受到21号染色体的额外副本的严重影响（例如，TS21细胞中不存在的TS21细胞中的表达高50％）。结果，高度可变的基因将受到疾病与TS21差异的影响，并且不足以捕获与肝保留或股骨保留细胞相关的基因。因此，我们选择创建缺血性肝脏，TS21肝脏，股骨和TS21股骨数据集，以最准确地注释数据中单个细胞的种群。在四个合并的数据集中的每个数据集中，我们使用缩放系数10,000应用对数差异化，以使用Seurat（v5.0.3）实现56来校正图书馆大小的样本间差异，并计算出高度可变的基因。我们在高度可变的基因上进行了主成分分析（PCA），以降低维数，并使用scree lot肘规则保留了前15个组件。使用Harmonony16对数据进行了批处理校正，以说明非生物原始样品之间产生的其他技术变化。　　然后，我们执行了一个迭代聚类过程，以识别单细胞数据中的簇。从广义上讲，我们的迭代聚类过程首先使用莱顿算法找到了初始簇，然后从看似相同的细胞群中合并了簇，最后用K-均值聚类群中群集成了进一步的精制人群。因此，迭代聚类使我们能够进一步完善初始聚类，从而可以进一步将包含多种细胞类型的初始簇分为较低级别的细胞类型。这对于进一步分为早期和晚期红细胞细胞的诸如红细胞细胞之类的广泛细胞类型特别有用。按照上面的样本批化校正后，我们使用Scanpy中实现的均匀歧管近似和投影（UMAP）方法计算了一个邻域图，随后用Leiden算法聚集。出于可视化目的，我们使用UMAP歧管嵌入在两个和三个维度中捕获全局特征。我们通过使用这些标记基因和规范标记基因进行注释的簇来确定每个群集的标记基因，并通过FDR校正进行了wilcoxon签名的秩检验。我们通过使用k均值聚类，合并相同细胞类型的簇并执行标记基因检测来进行进一步的聚类，并通过手动选择簇为子集群。通过这种方法，我们为TS21（肝脏和股骨）和疾病（肝脏和股骨）数据集生成了四个独立的带注释的SCRNA-SCRNA-SCRNA-SEQ数据集以及相关的标记基因。　　为了比较细胞型丰度，我们计算了每个样品中每个主要细胞类型组的比例。我们使用Mann-Whitney U检验将发育阶段匹配的TS21与缺陷样本之间的细胞类型比例与相同的分类策略进行了对比。最后，我们使用FDR纠正了多次测试，并评估了FDR的显着性< 0.1. 　　We inferred statistically significant ligand–receptors and their corresponding cell types using CellPhoneDB on a subsampled Ts21 liver dataset, such that the proportion of cells in the reduced sample recapitulated the proportion in the full Ts21 dataset and corresponded to the number of cells in the disomic dataset. We repeated the same analysis on the full disomic dataset (which now has an identical cell size). We kept any pairs that did not involve HLA or a protein complex, and kept only those that involved a single receptor. Among the significant ligand–receptors (P < 0.001), we selected ligands or receptors identified in HSC/MPPs and used to communicate with vascular endothelial cells, and performed gene set enrichment analysis (GSEA) on those using EnrichR. 　　For the disomic and Ts21 femur, we computed partition-based graph abstraction (PAGA) using all annotated stromal cells (with a PAGA threshold of 0.05). We also computed a force-directed diffusion graph using Pegasus57, and overlaid the Pegasus and PAGA outputs. 　　Next, we focused on two different cases of osteo-linage transitioning: (1) within CAR cells, LepR+ CAR cells, osteoprogenitors and osteoblasts, and (2) within arterial endothelial cells, transitioning endothelial cells and osteoblasts. We computed pseudotime using Scanpy, and used the pseudotimes as input into PseudotimeKernel in CellRank58 (without usage of RNA velocity information) to obtain generalized Perron cluster cluster analysis (GPCCA) estimators for identifying macrostates and computing transition probabilities among them. We set the terminal state number according to the shape of the force-directed graph. For case (1), we chose two states in disomic cells based on the observation that there are two clear branches splitting between CAR cells and osteoblasts, and we chose one state in Ts21 because we observed a single branch leading to osteoprogenitors. For case (2), we chose two states considering osteoblasts as one end and some transitioning endothelial cells as the other. Next, we plotted the STREAM plot using the scVelo package59 to visualize the cell-type transition matrix. Finally, we correlated gene expression with estimated absorption probabilities (Pearson correlation, as implemented in the CellRank package). We identified the positively or negatively correlated genes at a significance level of FDR < 0.05 separately in Ts21 and the disomic femur. We checked the Gene Ontology terms of the top 500 genes that were most positively or negatively correlated to absorption probabilities using the clusterProfiler R package60. 　　Within each cell type, four distinct differential expression analyses were performed to identify differentially expressed genes (DEGs) due to disease status (Ts21 or disomic) or the microenvironment (liver or femur). 　　Previous literature has shown that pseudobulk differential expression methods have improved FDRs compared with single-cell differential expression methods61. As a result, our analyses were performed by first computing cell-type-specific pseudobulk profiles for each sample and then analysing pseudobulk RNA-seq profiles using limma62. 　　To calculate sample-level pseudobulk profiles, we aggregated the read counts across cells of the same type. We kept samples for analysis that contained at least ten cells, and we used the filterByExpr() function in the edgeR package with default settings to retain genes for differential expression analysis and reduce the burden of multiple test correction, by removing genes with low expression across samples63. 　　Next, limma-voom was used to perform a statistical analysis for differential expression. In brief, sample-level weights were calculated by computing normalization factors for transforming count data into log2 counts per million and deriving weights based on a mean-variance relationship (using the calcNormFactors() function in edgeR and the voom() function in limma in R). Log fold changes for each gene were estimated using a linear model with sorting strategy as a covariate. P values were estimated after empirical Bayes shrinkage (lmFit and eBayes() functions in limma). A Benjamini–Hochberg FDR correction was applied across all gene P values, and significance was assessed at FDR < 0.05. 　　As we observed an exponential cross-dependency between the proportion of DEGs on chromosome 21 and other chromosomes, we investigated additional factors that could be relevant. We first tested whether cell-type-specific overexpression of a particular gene on chromosome 21 can lead to greater dysregulation on either chromosome 21 or on other chromosomes. As the probability of a gene being differentially expressed is linked to the number of cells tested in the differentially expressed analysis, we tested this using log fold change values. For each chromosome 21 gene and across cell types, we tested the Pearson correlation between log2 fold change of gene expression (from Ts21 versus disomic samples) and (1) chromosome 21 DEG (%) or (2) non-chromosome 21 DEG (%), and determined the significance of the correlation using FDR < 0.1. Second, we reasoned that overall expression of an important chromosome 21 gene could lead to greater dysregulation, as a highly expressed chromatin modifier or transcription factor might have a consistent 50% overexpression in Ts21 across cell types, but the overexpression might matter more in the cell types where the gene is expressed. We tested this for each chromosome 21 gene. Across cell types, we tested the Pearson correlation between average cell-type-specific gene expression (from Ts21 and disomic samples) and (1) chromosome 21 DEG (%) or (2) non-chromosome 21 DEG (%). 　　It is difficult to ascertain whether a gene is commonly or uniquely upregulated in single-cell data (for example, a gene upregulated in Ts21 liver HSCs compared with Ts21 femur HSCs, but not disomic liver HSCs compared with disomic femur HSCs). The presence of a DEG in one cell type and the absence in another may be a result of differences in population size, and thus purely statistical. 　　As there are sample and cell count differences between datasets, we could not directly take the Ts21 liver versus femur DEGs as the Ts21 population was much larger than the disomic datasets. Instead, we identified DEGs specific to disease status (in liver versus femur analyses) and the microenvironment (in Ts21 versus disomic analyses) in HSC/MPPs by using a subsampling procedure. Downsampling allows the ability to compare two analyses from distinct datasets that are confounded by differences in size. We did not repeat the same procedure across additional cell populations to conclude whether genes are differentially expressed in specific cell populations, as this would require to downsample to the smallest population sizes. This would erode statistical power and be computationally expensive. 　　Within liver versus femur analyses, we downsampled the Ts21 liver and Ts21 femur dataset to have the same number of fetuses contributing the same number of samples with the same number of HSC/MPPs as the disomic liver and disomic femur data. As a result, the Ts21 data matched the disomic data in terms of fetus sample cell counts. Similarly, within Ts21 versus disomic analyses, we downsampled the Ts21 and disomic liver data based on fetus sample cell counts in the Ts21 and disomic femur data. As an additional restriction in our downsampling, we ensured that fetuses present in both liver and femur data, with equal or greater number of cells and samples in the liver than in the femur, would still be selected in the downsample. The downsampling routine was repeated 100 times, such that 100 new datasets were created that match the smaller dataset. Differential expression analysis was performed identically to the full data using sample-level pseudobulks and limma-voom. The median nominal P value for each DEG was calculated across 100 iterations. We verified the robustness of this choice of 100 iterations by visualizing the variability of the median P value across iterations to assess its stability. 　　Next, we used differential expression analyses in the full data and in the downsampled data to categorize the context dependence of DEGs. In the liver versus femur analysis, we implicate environment-driven DEGs, Ts21-induced DEGs and Ts21-reverted DEGs. 　　To visualize gene–environment interactions, we examined the expression of environment-driven DEGs and Ts21-induced DEGs across Ts21 and disomic liver and femur HSC/MPPs. We scaled expression across cells for each DEG to mean = 0 and variance = 1. We averaged the scaled expression across genes within the environment-driven and Ts21-induced gene sets, such that each cell has its own value for each gene set. We visualized the mean and standard error of these values across cells in Fig. 2c. 　　GSEA of upregulated Ts21-induced genes was performed by inputting the list of genes into EnrichR. Scatterplots show the top Gene Ontology terms (molecular function and biological process) or ENCODE and ChEA transcription factors. 　　We assigned a cell cycle using the score_genes_cell_cycle() function in Scanpy with the standard list of cycling genes from Tirosh et al.64, as applied to all cells from samples of the same environment and disease type. We determined cycling by the predicted cycling phase being equal to ‘G1’ or not (either ‘G2M’ or ‘S’). We compared the proportions of cycling cells using a Mann–Whitney U-test. 　　Using publicly available data, we evaluated the hypothesis that the regulatory landscape is affected in Down syndrome to influence where somatic mutations occur. First, we downloaded somatic mutation data from Hasaart et al. in fetal Ts21 and disomic HSCs4, and converted the mutation positions from hg19 to hg38 using liftOver. Second, we identified genes expressed in Ts21 HSCs. We used the filtered set from the Ts21 cycling versus less-cycling HSC differential expression analyses, which were identified by the filterByExpr() function. Next, we narrowed down the Ts21 and disomic sets of mutations within the HSC-expressed genes to the set of non-coding intronic mutations. Finally, we downloaded candidate cis-regulatory elements (cCREs) in hg38 from ENCODE; for Ts21 and disomic mutation sets, we calculated the proportion of intronic somatics in Ts21 HSC-expressed genes that overlap with ENCODE cCREs. To determine significant differences, we bootstrapped the disomic intronic mutation set for 1,000 times and compared the observed Ts21 proportion to the disomic distribution. We calculated P values as the proportion of bootstrapped disomic mutation sets with larger values than the Ts21 value. 　　For disomic and Ts21 liver datasets, we used our annotated cell types from the disomic and Ts21 large scRNA-seq datasets as our input reference data for Cell2location65. Next, we merged all SpaceRanger outputs of tissue sections for the disomic liver and then separately for the Ts21 liver to create two Scanpy objects56. We removed mitochondrial genes and spots with the total expressed gene count of less than 800 (the remaining spots were of sufficient good quality for downstream analysis). 　　We then estimated cell-type abundances for each spatial spot. Using Cell2location, we trained a negative binomial regression model on the input reference data. We applied our model to the Scanpy formatted data, considering tissue section as a covariate to account for distinct batches. We used the estimated posterior mean value of each cell type (from Cell2location) as the local abundances. For each section, we computed the section-level relative abundance of each cell type as the proportion of its estimated abundance across all spots over the total estimated abundance of all cell types across all spots. We compared relative abundances between disomic and Ts21 using a Wilcoxon rank-sum test, and corrected P values by using Benjamini–Hochberg. 　　To evaluate cell-type colocalization, we computed spot-level relative abundance of each cell type, dividing the abundance of each cell type on an individual spot by the total abundance of all cell types on the same spot. We then computed a Pearson distance matrix among cell types, based on these spot-level relative abundances across sufficient-quality spots of tissue sections in the same disease status, respectively, for the Ts21 liver and disomic liver. We next performed hierarchical clustering, with inter-cluster distance estimated by the Ward variance minimization algorithm. 　　We performed the initial processing of multiome data using Seurat (v5.0.3) and Signac (v1.13)66. After Cellranger processing, we identified high-quality multiome cells for downstream analysis if they satisfied the following criteria: more than 750 RNA unique molecular identifiers, more than 250 expressed genes, less than 40% mitochondrial read fraction, transcription start site (TSS) enrichment score of more than 3 and more than 1,000 ATAC fragments in peaks. We next identified and annotated transcriptionally distinct clusters within Ts21 and disomic samples using Seurat. We created Ts21-specific and disomic-specific expression matrices by merging the matrices across Ts21 or disomic samples, respectively. Within the separate Ts21 and disomic expression datasets, we log-normalized with a scaling factor of 10,000, identified 2,000 highly variable genes, scaled and centred the data, performed PCA, and used Harmony with lambda = 1 to batch correct for sample-specific variation. We then constructed a k-nearest neighbours graph based on the Euclidean distance in PCA space (using the first 30 components), and identified transcriptionally distinct clusters using the Leiden algorithm. We nominated marker genes for each cluster by performing a Wilcoxon signed-rank test that compares cells within one cluster to all other cells. We performed further clustering by performing K-means clustering and we merged clusters of the same cell type. With this approach, we annotated each cell in two separate multiome datasets (Ts21 and disomic). The Harmony-corrected datasets were visualized in two dimensions using UMAP. 　　Our overarching process for creating the chromatin accessibility matrix was to call peaks within each sample before calling a final set of peaks from cell-type-specific ATAC profiles. We first called peaks within each sample using macs2, as implemented with default parameters in the callPeaks() function in Signac. We created a unified set of peaks across all samples by combining any intersecting peaks into a single peak, and removing the combined peaks that were less than 20 bp or more than 10 kb wide. This set of peaks was used to compute a cell × peaks matrix for each sample from the ATAC fragments file. Using all peaks present in at least ten cells, we ran latent semantic indexing (a two-step procedure of first using term frequency-inverse document frequency normalization and then singular-value decomposition) to project the ATAC matrix into a reduced dimension representation. We performed batch correction over all samples using Harmony before constructing a k-nearest neighbours graph across the first 30 components (except omitting the first component, which correlates with sequencing depth). This graph was used to perform clustering using the Leiden algorithm with resolution = 1; within each cluster, a new set of peaks were called using macs2. This set of peaks was combined into a unified set of peaks, which was used to form the final cell × peaks matrix, which contained all final Ts21 and disomic cells. 　　For downstream analyses involving a single combined dataset, the two datasets were merged into a single matrix. Log-normalization, scaling, PCA, Harmony batch correction and UMAP were applied to the combined dataset. 　　Processed snRNA data from multiome was subset to cells of the myeloid lineage. Both disomic and Ts21 cohorts were downsampled to 15,000 cells. A trajectory graph was calculated using the force-directed layout (FLE) function in Pegasus. Instead of the UMAP space, the cell trajectories were plotted in the FLE space. Trajectory analysis of the RNA expression data closely followed the CellRank tutorial ‘CellRank beyond RNA velocity’. Moments of connectivity were calculated using scVelo with 30 principal components and 15 neighbours. The root cell was manually selected according to the diffusion map of HSCs, selecting the cell with the greatest Euclidean distance in FLE space from the centre of the cluster, indicating a cell with a divergent transcriptome. The pseudotimeKernel was used to calculate pseudotime with default parameters. The CytoTRACEKernel was used to compute the transition matrix with the parameters used in the tutorial (threshold_scheme = ‘soft’, nu = 0.5). To compute terminal states and the probability of each cell differentiating towards each terminal state, the GPCCA estimator was utilized with default parameters. Schur decomposition was performed, and five terminal states were automatically selected according to an eigengap in the real part of the eigenvalues. Terminal states were labelled according to the cell type with the closest association (late erythroid, monocytes, mast cells, pDCs and megakaryocytes) and absorption probability was calculated. Significant differences in predicted terminal states between Ts21 and disomic HSCs were calculated using a binomial test that used disomic proportions as the background probabilities. 　　We used MIRA to perform topic modelling of HSCs in the 10X multiome dataset42. Only HSCs were included for all downstream MIRA analysis. Out of the 6,215 HSCs, 3,784 were Ts21 and 2,431 were disomic. All MIRA analysis closely followed the online tutorials. 　　To generate the latent topics for RNA, the variational autoencoder (VAE) framework uses raw expression counts as input. Rare genes were removed by filtering genes only expressed in 15 or fewer cells. Exogenous genes (n = 7,905) were selected using the highly variable gene function in Scanpy, selecting for all genes with a minimum mean dispersion of 0.1. Exogenous genes are genes that will be captured in topics but will not be used as VAE features. Endogenous genes (n = 4,359) were selected by filtering the exogenous genes for those with a normalized dispersion greater than 0.5. Endogenous genes were used as features for the VAE network. The ExpressionTopicModel was instantiated with the default parameters. The learning rate bounds were manually tuned to cover the portion of the learning rate versus loss curve with the steepest slope. The model was then tuned using TopicModelTuner with default iterations, a minimum number of topics set to 2, a maximum number of topics set to 15, a batch size of 32, threefold cross-validation and a training size of 0.8. 　　To generate the latent topics for ATAC, the VAE framework used binarized peak counts. Peaks were filtered according to the epiScanpy tutorial. Using the same process for RNA, 72,541 exogenous peaks and 45,095 endogenous peaks were selected according to a minimum mean dispersion of 0.05 and a normalized dispersion greater than 0.5. The AccessibilityTopicModel was instantiated with the default parameters and ‘dataset_loader_workers’ set to 3. The learning rate bounds were manually tuned to cover the portion of the learning rate versus loss curve with the steepest slope. The model was then tuned using TopicModelTuner with default iterations, a minimum number of topics set to 2, a maximum number of topics set to 15, a batch size of 8, onefold cross-validation and a training size of 0.8. 　　GSEA was performed on the expression topics using wrapper of enrichr in MIRA and the top 200 genes associated with each topic. For the accessibility topics, transcription factor-binding sites were annotated in peaks using the motif scanning in MIRA and the hg38 reference from the UCSC repository. Each peak was scanned according to the JASPAR 2020 vertebrate collection of transcription factor-binding motifs. Transcription factors that were not expressed in the RNA data were removed. For comparing RNA and ATAC topics, the joint data were split into a disomic and Ts21 dataset and the ‘get_topic_cross_correlation’ function in MIRA was performed. 　　First, a joint embedding space was calculated using the ‘make_joint_representation’ function in MIRA to combine both modalities. A neighbourhood graph and UMAP embedding were performed on the joint representation (15 neighbours and a minimum distance of 0.1). The datasets were then batch corrected using Harmony on the joint UMAP features. The neighbourhood graph and UMAP embedding were re-calculated on the Harmony-adjusted feature space. 　　We next calculated several different HSC branches. First, the diffusion map was calculated using Scanpy ‘diffmap’ with default parameters and then normalized by MIRA to regularize distortions in magnitude of the eigenvectors. Schur decomposition was performed and the eigengap heuristic was used to automatically select the proper number of diffusion components within the data (3). The data were subsetted to three diffusion components and the neighbourhood graph was calculated on the diffusion map embedding and components were connected. Pseudotime was calculated using the ‘get_transport_map’ function in MIRA, which defines a transport map using a Markov chain model of forward differentiation. A root cell was selected as the maximum value of the third diffusion component based on the suggestion in the tutorial. This root cell was located in the centre of the highest density of HSCs. Terminal cells were identified using the ‘find_terminal_cells’ function in MIRA with 8 iterations and a threshold of 0.01. There were three distinct clusters of terminal cells, the cell in each cluster farthest away from the root cell was selected, and the probability of each cell differentiating towards those three branches was calculated. The lineage probabilities were parsed into a bifurcating tree structure using ‘get_tree_structure’ with the threshold set to 1. Expression and accessibility dynamics across time were plotted using a streamgraph that depicts this tree structure. 　　LITE modelling in MIRA was performed to link gene expression to nearby cis-regulatory elements. For every gene, MIRA learns a regulatory window describing a range in which changes in local accessibility appear to influence gene expression. The regulatory window decays exponentially both upstream and downstream according to the TSS of each gene. Consequently, each gene is associated with a unique TSS using non-redundant human TSS annotations (hg38 GENCODE VM39). The model was trained on the union of genes that were highly variable and were the top 5% most activated in each of the 10 expression topics (n = 5,367). The genes that did not have an annotated TSS were removed leaving 4,454 genes. The LITE model was instantiated with default parameters and the raw expression and accessibility data were then fitted (4 out of 4,454 genes failed to fit). Now that each gene contained a trained regulatory potential model, the expression of each gene was estimated by calculating the maximum a posteriori prediction given the accessibility state of each gene in each cell. 　　Next, we identified transcription factors that regulate the expression of genes specific to each of the three HSC branches using the probabilistic in silico depletion method in MIRA. MIRA simulates ‘computational knockouts’ of each transcription factor. MIRA uses regulatory potential modelling to predict gene expression based on local chromatin accessibility, and then masks cis-regulatory elements with specific motifs to define transcription factors where motif accessibility is important to gene expression prediction. This is measured by the changes in performance of the regulatory potential model of the gene to predict expression after computationally masking the binding sites of every transcription factor. In this manner, transcription factors that strongly regulate the expression of a gene will be prioritized because masking of their binding site will significantly decrease the accuracy of the LITE model prediction. The function ‘probabilistic_isd’ was run with default parameters across all modelled genes. To counteract the inefficiency of noisy transcription factor-binding site predictions, co-varying genes associated with individual topics were queried for a shared association across many transcription factors. The ‘driver_TF_test’ function was utilized to identify potential transcription factors regulating the expression of branch-specific topics. The top 150 topic-specific genes were included, and a Wilcoxon rank-sum test was performed over the association scores. After identifying transcription factors regulating topic expression, the top genes being regulated by these transcription factors were queried using ‘fetch_ISD_matrix’ and selected according to the ranked association score. We additionally performed this procedure across all RNA topics, which we reported in Supplementary Table 16. 　　To identify genes in which expression cannot be accurately predicted by local chromatin accessibility alone, MIRA NITE modelling was performed. In addition to cis-regulation, the NITE model expands on the scope of the LITE model by incorporating accessibility topics, which are genome wide. The NITE model was initialized using the same parameters, topics and genes as the LITE model using ‘spawn_NITE_model’. The model was fit and expression was predicted using default parameters. The difference between LITE and NITE model performance for every gene was calculated using ‘get_chromatin_differential’. The ‘get_NITE_score_genes’ function was used to calculate a cumulative metric per gene describing the divergence of local accessibility and expression across all cells. Genes with a high cumulative NITE score indicate genes that are regulated in part by non-local mechanisms. The top 500 genes with the highest NITE score were incorporated into the GSEA analysis in MIRA. To ascertain whether Ts21 HSCs expressed genes enriched for non-local regulation, pseudobulk differential expression was performed (DESeq2) between disomic and Ts21 HSCs. The distribution of cumulative gene NITE scores for the top 300 condition-specific genes (log2FC) was compared using Wilcoxon rank-sum test with Benjamini–Hochberg correction. 　　We downloaded a set of curated promoter annotations from the EPD43 available through EPDnew for build hg38 (https://epd.expasy.org/epd/human/human_database.php?db=human). We then intersected these annotations with peaks that were differentially accessible in Ts21 HSCs (as compared with disomy HSCs) using the genomicRanges R package. Differentially accessible promoters (FDR < 0.1) were then tested to determine whether they were more likely to be linked to a significantly DEG (FDR < 0.1) using a Fisher’s exact test. 　　We defined functional enhancers as those with a high potential to regulate gene expression. Following this definition, we used the ABC model to construct genome-wide maps of enhancer–gene connections45. The ABC scores generated for each enhancer reflect the chromatin activity and chromosome interaction between the enhancer and surrounding genes. We utilized merged ATAC-seq mapping results from disomy HSCs and Ts21 HSCs to identify potential enhancer locations and quantify their chromatin activity. We applied the averaged Hi-C contact data provided by the ABC authors to map contact interactions between chromosomal regions at 5-kb resolution. The default parameters and thresholds were then used to run the ABC model pipeline available on GitHub for implicating a list of relevant enhancers. For motif analyses, non-promoter enhancers were kept (those ABC enhancers that did not overlap with an EPD promoter). 　　To identify transcription factor motifs that are responsible for differential accessibility, we utilized the JASPAR2022 (v0.99.7)67 and monaLisa (v0.63.0)68 R packages. First, liver HSC peaks with FDR < 0.1 in a Ts21 versus disomy HSC analysis were identified and stratified into separate bins based on the direction of differential accessibility. Next, to account for any size differences between groups, we resized all peaks to the median peak length across all groups. After accounting for differences in peak length, we used the ‘calcBinnedMotifEnrR’ function from monaLisa, which internally corrects for GC content and identifies motifs that are significantly enriched in either Ts21-biased or disomy-biased peaks. The motif enrichment analysis above was repeated using all differentially accessible peaks, only promoters and only enhancers. 　　Next, to assess differential accessibility of promoters, we downloaded a set of curated promoter annotations from the EPDnew database for hg38 (https://epd.expasy.org/epd/human/human_database.php?db=human). We then intersected these annotations with peaks that were differentially accessible using the genomicRanges R package. Differentially accessible promoters were then tested to determine whether they were more likely to be linked to a significantly DEG (FDR < 0.1) using a Fisher’s exact test. The motif enrichment analysis above was repeated using the EPD promoters. 　　To assess utilization of the AP-1 motif between Ts21 and disomy, we specifically used motifs for AP-1 identified in Isakova et al.44. We first identified which motifs belonged to the TRA-response element (TRE) or CRE family by calculating the similarity between motifs as the Pearson correlation between the position frequency matrices using the ‘MotifSimilarity’ function from monaLisa. We then identified two distinct clusters of motifs using hclust() and cutree() R functions with K = 2. These clusters corresponded to the TRE and the CRE motif families. All TRE and CRE motifs were then used in the pipeline described above to identify differential usage of AP-1 motifs. Gene Ontology terms were then assessed using enrichR. 　　Finally, we quantified the contribution of each motif to differential accessibility in Ts21 HSCs. We utilized motifmatchR (v1.20.0) to identify ABC enhancers or EPD promoters that contain a significant match to each motif. Then, for each motif, we fit a linear regression model logFC ~ match + peak_type + match:peak_type, where match indicates whether the peak contains a significant match of the motif, peak_type indicates whether the peak is an enhancer or promoter, and logFC represents the log fold change values of the peak. This model allows each motif to have a different effect on differential accessibility between enhancers and promoters. To quantify the contribution of the motif to overall differential accessibility, we computed the multiple R2 from the regression model. 　　Trait-relevant individual cells were calculated using SCAVENGE69, which combines network propagation and SNP enrichment analysis to map causal variants to their cellular context. First, variants with posterior inclusion probability (PIP) > 0.001 were downloaded in the format used by Yu et al.69, which were originally processed and described within Vuckovic et al.70. Second, the full multiome ATAC dataset (including all cell populations) was used as input to downstream tasks. A mutual k-nearest neighbour graph was computed to represent the relationship between neighbouring single cells, using k = 30. Next, g-chromVAR71 was used to calculate bias-corrected z scores for each tested trait and each single cell to estimate cell-trait relevance. The top 5% ranked cells served as seed cells for the SCAVENGE network propagation, which were further scaled and normalized to calculate the final SCAVENGE trait relevance score (TRS). 　　SCAVENGE TRS was contrasted between different lineage branches using a pseudobulk approach similar to the approach that we used for differential expression analysis. On a per-sample-and-branch level, we pseudobulked the SCAVENGE scores by summing SCAVENGE scores across all cells of the same branch and of the same sample. We then applied three linear models to assess the effect of each of the three branches on the SCAVENGE TRS. In each model, we accounted for the number of cells within the pseudobulk as a covariate. We determined significance by multiple test corrections at FDR < 0.1 across the 9 analyses (3 branches (1, 2, 3) and 3 traits (RBC, white blood cell and lymphocyte counts)). 　　SCENT72 was used to identify peak–gene links in disomic HSCs and Ts21 HSCs, by correlating peak accessibility (binarized) and gene expression (raw) counts. SCENT is a Poisson regression model that recomputes the standard errors in the model coefficients by using bootstrapping, which helps to maintain false-positive rates. We first identified all peak–gene combinations within 500 kb of each other. Separately within disomic and Ts21 HSCs, we retained all peak–gene combinations where more than 5% of cells were accessible and expressed the peak and the gene, retaining a total of 38,478 peak–gene combinations. Owing to biological differences between disomy and Ts21, this was an overlapping but not identical peak–gene set tested in disomic and Ts21 HSCs. 　　Next, we applied a SCENT model using binarized peak accessibility, percent mitochondrial reads, log(number of UMIs) and sample as the covariates, and expression counts as the dependent variable to each peak–gene combination. On the basis of the SCENT paper, we set up an iterative bootstrapping scheme to balance runtime and P value accuracy based on the Poisson regression model P values, where P > 0.1 consisted of 100 bootstraps, P < 0.1 consisted of 1,000 bootstraps and P < 0.01 consisted of 10,000 bootstraps. SCENT was performed on a computing cluster using chunks of 100–500 peak–gene sets. We corrected for multiple testing using FDR, and determined significant peak–gene links at FDR < 0.2. Finally, we repeated the Ts21 analysis after downsampling the 3,784 Ts21 HSCs to match the sample size of 2,431 disomic HSCs to assess the impact of cell population size. 　　Furthermore, we used SCENT to test whether the effect of peak accessibility on gene expression is modified by trisomy. To do so, we included an interaction term between ATAC peak accessibility and Ts21 status in the SCENT model to address whether the effect of accessibility on expression depends on trisomy. We applied the new SCENT interaction model to a combined Ts21 and disomic HSC dataset. The analysis was performed on all significant peak–gene links in the disomic-only or Ts21-only analyses, and significant interaction terms were assessed at FDR < 0.2. 　　One reason why Ts21-only peak–gene links were only identified in Ts21 yet have no significant interaction term would be because of a lack of gene expression or peak accessibility in disomic cells. To test whether Ts21-only peak–gene links without a significant interaction term were more accessible and expressed in Ts21 HSCs than in disomic HSCs, we used differential accessibility (from the 10X multiome ATAC) and differential expression results (from the large scRNA-seq analysis). To calculate differential accessibility in the 10X multiome ATAC, we computed pseudobulk profiles and performed limma-voom with trisomy status as the covariate of interest, as similarly described in the large scRNA-seq analysis. We filtered peaks for differential analysis using the filterByExpr() function in edgeR. We performed two separate binomial tests to assess whether the list of (1) Ts21-only peaks or (2) Ts21-only genes were upregulated in Ts21 compared with all other peaks or genes tested in differential analyses. We defined upregulated in Ts21 as nominal P < 0.05 and logFC > 0 in the limma-voom results. 　　We performed two enrichment tests with regard to our Ts21 peak–gene links and RBC GWAS. We defined SCENT peaks and SCENT genes according to three significance thresholds from the SCENT peak–gene analysis: FDR < 0.1, FDR < 0.2 and nominal P < 0.05, and identified GWAS-related peaks and genes using fine-mapped RBC GWAS SNPs at PIP > 0.2. In the first analysis, we assessed the enrichment of fine-mapped RBC GWAS SNPs (PIP >0.2）使用Fisher的精确测试，在TS21气味峰（也称为活性顺式调节区域）中。在这里，我们正在比较具有GWAS变体的峰是否更可能与基因表达相比，而不是与没有GWAS变体的一组背景峰，其中包括可访问的峰（以上超过5％的细胞）和接近基因（小于500 kb）（在细胞中超过5％）。在第二个分析中，我们评估了由TS21气味峰 - 元素链接定义的靶GWAS基因的差异表达的富集。我们使用大型SCRNA-SEQ数据将差异表达定义为HSC的差异表达分析中的差异表达。我们将其子束划分为显着的峰值链接，并计算出反映峰是否包含精细映射变体的2×2应急表，以及（2）靶基因是否差异表达。Fisher的精确测试评估了重要的增强子（包含精细映射GWAS SNP）在HSC中的差异表达（影响其靶基因）中的作用。　　在结果中，我们在使用名义p <0.05（来自气味）的峰值– gene链接时报告文本中的富集p值。此外，在结果中报告峰值示例时，我们使用名义p值进行气味峰测试，差异表达测试或差异可访问性测试。我们在补充表中包括完整的摘要统计数据。　　有关研究设计的更多信息可在与本文有关的自然投资组合报告摘要中获得。

本文来自作者[admin]投稿，不代表象功馆立场，如若转载，请注明出处：https://wap.xianggongguan.cn/xgzx/202506-1050.html