成年小鼠大脑中基因调节元件的地图集

　　所有使用活动物的实验程序均由SALK Institute动物护理和使用委员会批准，该协议编号为18-00006。成人C57BL/6J雄性小鼠购自杰克逊实验室。从56-63天大的小鼠中提取大脑，并通过使用来自CNS Muslice LLC的脑切片机，将沿冰冷的解剖培养基中的前后轴沿前后轴的600-μm冠状切片。根据艾伦脑的参考ATLAS28（扩展数据1）和核分离的核区域，如前所述63。从3-31个性别中汇总区域，以获得每个生物复制品的SNATAC-SEQ的足够的核，并为每个解剖区域处理两个生物复制品。　　按照针对自动化的步骤进行了描述，进行了SNATAC-SEQ25,26。可用于库准备的逐步合作：https：//www.protocols.io/view/sequencing-open-open-chromatin-open-single-cell-cell-nuclei-nuclei-nuclei-sn-pjudknw/abstract。　　将脑核用摆式铲斗离心机（500G，5分钟，4°C； 5920r，Eppendorf）固定。将核沉淀重悬于1 ml核核中通透缓冲液中（5％BSA，0.2％igepal-Ca630 ，1 mM DTT和完整分和完整，无EDTA无蛋白酶抑制剂鸡尾酒（Roche）在PBS中，并在PBS中再次进行（500g ，500g，500g，500g ，500 grest）（500g，4 mil，4°c; 5920r ，epp），5920年。将细胞核重悬于500μl高盐标记缓冲液（36.3 mM三乙酸乙酸酯（pH 7.8），72.6 mM钾，乙酸钾，11 mm毫克 - 乙酸乙酸乙酸氢变含量，17.6％二甲基甲酰胺）中，并使用血囊计计数。将浓度调节至每9μl的1,000–4,500核，并将1,000–4,500个核分配到96孔板的每个孔中。为了进行标记，使用96台（Mettler Toledo，RRID：SCR_018093）（补充表26）加入1μl条形码TN5转座体26（补充表26），混合了五次，并在37°C下在37°C下孵化60分钟（500 rpm）。为了抑制TN5反应，将10μl的40 mM EDTA与台式96（Mettler Toledo ，RRID：SCR_018093）一起添加到每个井中，并在37°C下以摇动（500 rpm）在37°C下孵育15分钟。接下来，使用基准96（Mettler Toledo ，RRID：SCR_018093）添加20μL2×排序缓冲液（2％BSA，PBS中2 mM EDTA）。将所有井组合成FACS管中，并用3μMDRAQ7（细胞信号传导）染色（扩展数据图23）。使用SH800（Sony），将20个核分类为八个96孔板（总计768孔，总计30,720个核总计，每样品15,360个核），其中包含10.5μlEB（25 pmol Primer i7 ，25 pmol i7，25 pmol Primer i5，200 ng bsa（sigma），如果要处理两个样品，则使用了两个样品。单独的板核与单个板的标记核合并在一起。。然后，将1μl12.5％Triton X-100添加到每个孔中以淬灭SDS。接下来，添加12.5μl的NEBNEXT高效率2×PCR主混合物（NEB），并通过PCR（72°C 5分钟5分钟，98°C 30 S（98°C 10 S ，63°C 30 S，63°C 30 S，72°C 60 S）×12个赛车放大样品，在12°C下。PCR之后，将所有井合并。根据Minelute PCR纯化套件手册（Qiagen）使用真空歧管（Qiavac 24 Plus，Qiagen）纯化库，并使用Spri珠（Beckmann Coulter ，0.55X和1.5X）进行尺寸选择。图书馆再用spri珠子纯化了一次（贝克曼·库尔特（Beckmann Coulter），1.5倍）。使用量子荧光计（Life Technologies，RRID：SCR_018095）对库进行定量，并使用挂接（高灵敏度D1000，Agilent）验证核小体模式。使用自定义测序引物，25％Spike-In库和以下读取长度：50 + 43 + 37 + 50（Read1 + Index1 + Index2 + Index2 + Read2），在HISEQ2500 Sequencer（RRID：SCR_016383，Illumina）上测序了用索引版本1（补充表1）生成的库（补充表1）。使用具有以下读取长度的自定义测序引物（READ1+INDEX1+INDEX2+READ2），在HISEQ4000（RRID：SCR_016386 ，Illumina）上测序了用索引版本2（补充表1）生成的库（补充表1）。补充表26中的索引引物和测序引物。补充表26中指示了每个文库使用的核索引版本（V1或V2）。　　为了生成SNATAC-SEQ库，我们最初使用了先前描述的索引方案（版本1）22,25。在这里，合并了16个P5和24个P7索引，以生成384个索引用于标记的索引，16 i5以及48 i7索引合并为768 PCR索引的阵列。由于此库设计，需要对所有四个索引进行对所有四个索引进行测序，以将读取为具有长读数的特定核和两个指数读取I和P条形码之间的恒定碱基序列。因此，将所得的库在HISEQ2500（RRID：SCR_016383）上用25％的Spike-In库进行测序，并将这些读取长度进行测序：50 + 43 + 37 + 50（参考文献25）。　　为了生成与其他测序仪兼容的库，而不需要峰值库或自定义测序配方，我们修改了库方案（版本2）。为此，我们使用了384个单独的T7索引，并与一个具有通用索引序列的T5结合使用进行标记（总计384个标记索引）。对于PCR ，我们使用了768个不同的i5索引，并与通用i7引物索引序列结合使用。标记指数为10 bp，PCR指数长12 bp。我们确保每两个条形码之间的锤距≥4 ，GC含量在37.5–62.5％之间，重复≤3的数量。所得的库是在带有自定义引物的Hiseq4000上测序的，这些长度和这些读取长度：50 + 10 + 12 + 50（补充表26）。　　将脑核用摆式铲斗离心机（500G，5分钟，4°C； 5920r，Eppendorf）固定。将核沉淀重悬于1 ml核核化通透缓冲液中（5％BSA，0.2％igepal-Ca630 ，1 mM DTT和完整网，PBS中的无EDTA无EDTA蛋白酶抑制剂鸡尾酒（Roche））。Nuclei were pelleted again (500g, 5 min, 4 °C; 5920R, Eppendorf) and washed with wash buffer (10 mM Tris-HCl (pH 7.5), 10mM NaCl, 3 mM MgCl2, 0.1% Tween-20, and 1% BSA (Proliant 7500804) in molecular biology-grade water).将核颗粒（500G，5分钟，4°C； 5920R，Eppendorf）重悬于30μL的1×核缓冲液（10x基因组学）中。使用血细胞计数计对细胞核进行计数，并使用15,360个核进行标记。使用Chromium单细胞ATAC和凝胶珠套件（10X Genomics ，1000110），铬芯片E单细胞ATAC试剂盒（10x Genomics，1000155）和铬I7多路复用套件N（10X Gentomics ，set A（10X Genomics），STER A STER ATE ATAC，使用制造商指令。使用量子荧光计（Life Technologies）对最终文库进行定量，并使用贴纸（高灵敏度D1000 ，Agilent）验证核小体模式。在NextSeq 500和Novaseq 6000序列（Illumina）上对库进行了测序，其中读取长度：50 + 8 + 16 + 50（read1 + index1 + index1 + index2 + read2）。反复使用后，将index2（单元格索引）转移到读取名称中，以保留相同的FastQ格式以进行下游处理。　　对配对的测序读取被解复，并将单元格索引转移到读取名称中。使用BWA64将测序读取与MM10参考基因组对齐。对齐后，我们使用R软件包ATACSEQQC（1.10.2）65检查片段长度的贡献，这是ATAC-SEQ库的特征。接下来，我们将测序读取与片段相结合，对于每个片段，我们执行以下质量控制：（1）仅保留片段质量分数mapq> 30;（2）仅保留长度正确的配对片段 < 1,000 bp; (3) PCR duplicates were further removed with SnapTools (https://github.com/r3fang/SnapTools, RRID:SCR_018097)26. Reads were sorted on the basis of the cell barcode in the read name. 　　Enrichment of ATAC-seq accessibility at TSSs was used to quantify data quality without the need for a defined peak set. The method for calculating enrichment at TSS was adapted from previously described. TSS positions were obtained from the GENCODE database v16 (RRID:SCR_014966)38. In brief, Tn5 corrected insertions (reads aligned to the positive strand were shifted +4 bp and reads aligned to the negative strand were shifted –5 bp) were aggregated ± 2,000 bp relative (TSS strand-corrected) to each unique TSS genome-wide. Then this profile was normalized to the mean accessibility ± 1,900–2,000 bp from the TSS and smoothed every 11 bp. The max of the smoothed profile was taken as the TSS enrichment. 　　We used a modified version of Scrublet (RRID:SCR_018098)66 to remove potential doublets for every dataset independently. Peaks were called using MACS2 scores for aggregate accessibility profiles on each sample. Next, cell-by-peak count matrices were calculated and used as input, with default parameters. Doublet scores were calculated for both observed nuclei {xi} and simulated doublets {yi} using Scrublet (RRID:SCR_018098)66. Next, a threshold θ is selected based on the distribution of {yi}, and observed nuclei with doublet score larger than θ are predicted as doublets. To determine θ, we fit a two-component mixture distribution by using function normalmixEM from R package mixtools67. The lower component contained most embedded doublet types, and the other component contained majority of neo-typic doublets (collision between nuclei from different clusters. We selected the threshold θ where the in which p denotes probability; pdf denotes probability density function. This value suggested that the nuclei have same chance of belonging to both classes. 　　We used an iterative clustering strategy using the snapATAC26 package (RRID:SCR_018097) with slight modifications as detailed below. For round 1 clustering, we clustered and finally merged single nuclei to three main cell classes: non-neurons, GABAergic neurons, and glutamatergic neurons. For each main cell class, we performed another round of clustering to identify major cell subclasses. Last, for each subclass, we performed a third round of clustering to find cell types. 　　Detailed description for every step is as follows. (1) Nuclei filtering. Nuclei with ≥1,000 uniquely mapped fragments and TSS enrichment > 将10个用于单个数据集过滤。其次，还删除了单个数据集的潜在条形码碰撞。（2）特征垃圾箱选择。首先，我们独立并随后合并了矩阵的每个数据集以5 kb分辨率计算一个单元格矩阵。其次，我们将逐键计数矩阵转换为二进制矩阵。第三，我们滤除了与编码黑列表重叠的所有垃圾箱（MM10，http://mitra.stanford.edu/kundaje/kundaje/akundaje/release/release/blacklists/mms10-mouse/mm10-mouse/mm10.blacklist.bed.gz）。第四，我们专注于1-19 ，X和Y染色体上的垃圾箱。最后，我们删除了Count Matrix的最高读取覆盖范围的前5％bin 。（3）减少维度。Snapatac应用了称为扩散图的非线性尺寸还原方法，该方法对噪声和扰动非常强大26。但是，扩散算法的计算时间随着细胞数量的增加而成倍地缩放。为了克服这一限制，我们将NyStröm方法（一种采样技术）和扩散图组合在一起，以呈现NyStröm地标扩散图，以生成大型数据集的低维嵌入。　　Nyström地标扩散地图算法包括三个主要步骤：　　（1）采样。我们以“地标 ”为“ n个总细胞”中的K（K n）细胞的子集。　　（2）嵌入。我们计算了K地标的扩散图嵌入。　　（3）扩展。我们将剩余的（N - K）细胞投射到低维嵌入到从地标中学到的低维嵌入，以为所有细胞创建关节嵌入空间。我们一开始就拥有超过800,000个单核，我们决定在第1轮和第2轮聚类上应用此策略。将共有10,000个细胞作为地标采样，其余的查询单元被投影到嵌入地标的扩散图上。后来，对于第3轮聚类，从所有核直接计算了扩散图嵌入。　　（4）特征向量选择。为了确定扩散操作员的特征向量的数量，包括用于下游分析，我们生成了一个“肘图”，以根据每个差异的百分比对所有特征向量进行排名。对于每一轮聚类，我们选择了捕获大多数方差的前10-20个特征向量。　　（5）基于图的聚类。使用选定的重要特征向量，我们接下来构建一个k-neartient邻居图。每个细胞是一个节点，每个单元的k-neart邻居根据欧几里得距离识别，并在图中的邻居之间绘制边缘。接下来，我们使用Python软件包Leidenalg（https://github.com/vtraag/leidenalg)69应用了K-Nearest邻居图上的Leiden算法。　　（6）对群集分辨率的优化。我们测试了不同的“ Resolution_Parameter ”参数（在0到1 x 0.1之间的步骤），以确定不同细胞群体的最佳分辨率。对于每个分辨率值，我们测试了核之间是否存在明显的分离。为此，我们生成了一个逐个细胞共有矩阵，其中每个元素代表观测值的比例两个核是同一群集的一部分。一个完美稳定的矩阵将完全由零和一个组成，这意味着两个核要么在每次迭代中都聚集在一起，要么不聚集在一起。共有矩阵的相对稳定性可用于推断最佳分辨率。为此，我们基于300发莱顿聚类和随机起始种子s生成了共有矩阵。令MS表示由使用不同种子应用于数据集DS的N×N连接矩阵。MS的条目定义如下：　　如果是Nucleus I和J在相同的扰动数据集DS中，则（i ，j）Then的（i，j）Then等于1，而为0。然后，将共有矩阵C定义为所有扰动DS的所有连接矩阵的归一量总和。　　共有矩阵中的条目（i ，j）是将单核I和j聚集在一起除以它们一起选择的总数的次数。矩阵是对称的，每个元素均在范围内定义。我们检查了累积分布函数（CDF）曲线，并计算出模棱两可的聚类（PAC）得分的比例，以量化每种分辨率的稳定性。局部最小PAC分数的分辨率表示最佳簇的参数。在这些情况下，这些是多个局部最小PAC，我们选择了具有较高分辨率的pac 。另一个测量值是分散系数，它反映了0.5值的共有矩阵M的色散（范围为0到1）。接近1的是分散系数，更完美的是共识矩阵，因此群是聚类的越稳定。在完美的共识矩阵中，所有条目均为0或1，这意味着所有连接矩阵都是相同的。色散系数定义为：　　最后，对于每个群集，我们测试了使用“ Finddar”函数与所有其他核（背景）和最近的核（本地背景）相比，我们是否可以识别差异特征。　　（7）可视化。为了可视化，我们应用了UMAP58 。使用细胞嵌入，我们同时应用了k-nearthign邻居批处理效应测试（KBET）和局部辛普森局部索引（LISI）分析，以测试浮标结果的鲁棒性，以使测序深度，信噪比的变化，信噪比和批次的变化。　　我们进行了分离的细胞聚类，以遵循上述两个谱系的策略：　　1。SGZ中的成年神经发生：我们从SGZ或周围的8个脑解剖学中提取了83,583个核，包括Ca-1 ，Ca-2，Ca-3，Ca-3 ，Ca-4，Ca-4，DG-1 ，DG-1，DG-2，DG-3，DG-3 ，DG-3，DG-4（补充表1）。然后，我们对6种细胞类型进行了83,583个核上的细胞聚类：星形胶质细胞（ASC）；齿状回径向神经胶质样细胞；nipcs;颗粒神经细胞（DGNBL1/2）;和颗粒神经元。　　2。在SVZ中的成年神经发生：我们从SVZ或周围的8个脑部解剖中提取了25,923个核，包括MOB，AON，ACB-1 ，ACB-1，ACB-2 CP-1，CP-1 ，CP-2，CP-2，LSX-1和LSX-2（补充表1）。然后，我们对5种细胞类型进行了25,923个核上的细胞聚类：星形胶质细胞；脑室下radial胶质样细胞；神经元中间祖细胞；神经细胞（obnbl）；和嗅觉中的抑制神经元（OBGA1）。　　首先，我们为CCRE计算了每个集群中位可访问性，并将此值用作聚类质心。接下来，我们计算了主要细胞类型的每个元素的簇质心的变体系数。最后，我们仅保留了可变元素，其变体系数大于1.5，用于树状图构造。　　我们使用上面定义的一组变量特征来计算基于相关的距离矩阵。接下来，我们使用r packClust（v.2.0）70使用参数method.dist =“ cor”和method.hclust =“ ward.d2”执行了链接层次聚类。通过Bootstrap重新采样方法估计树的每个分支的信心，并以1,000发子弹估算。　　特异性得分定义为Jensen -Shannon差异，该分差衡量了两个概率分布之间的相似性。对于每种细胞类型，首先计算出不同大脑区域的贡献。然后，我们将这种分布与所有采样细胞计算出的大脑区域的贡献进行了比较。我们将R pafferentropy的功能“ JSD ”用于此分析71。　　我们根据ENCODE ATAC-SEQ管道（https://www.encodeproject.org/atac-seq/）进行了峰值调用。对于每个细胞群，我们将所有正确配对的读数组合在一起，以生成一个伪膨化的ATAC-SEQ数据集，以进行单个生物学重复。此外，我们产生了两个伪复制物，其中构成了每种生物学重复的一半读数。我们为四个数据集中的每个数据集和两个池命名了峰值。使用这些参数的Macs2 Score30在TN5校正的单基础插入上执行峰值调用：–Shift -75- extsize 150 – Nomodel – call-call-summits-summits – spmr -q 0.01。最后，我们将两侧的峰顶峰值扩大到250 bp，以合并和下游分析的最终宽度为501 bp 。为了生成可再现峰的列表，我们保持了峰值的峰值，即在合并的数据集中检测到（1），并且峰值长度的≥50％，两个单个重复的峰值；或（2）在汇总的数据集中检测到峰值长度的≥50％，两种伪恢复峰的峰值。　　我们发现，当细胞群在读取深度或核的数量上变化时，由于MACS2 Scores30中泊松分布测试的性质，MACS2评分差异成正比。理想情况下，我们将进行峰值的读数归一化，但是实际上，这种类型的归一化是不可能的，因为我们不知道会得到多少峰。为了根据单个群集中的读取深度和/或核数量来解决MacS2分数的性能差异，我们将MACS2峰得分（-LOG10（Q-VALUE））转换为“百万分数” 72。我们通过选择2分的分数为2分，以滤除可重复的峰，从而过滤了可再现的峰。　　我们仅在染色体1-19和两个性染色体上保持可再现的峰值，并过滤编码的MM10黑名单区域（MM10，http://mitra.stanford.edu/kundaje/akundaje/akundaje/akundaje/release/release/release/release/blacklists/mmblacklists/mmms10-10-mouse/mmmouse/mmm glmm10.blacklist.bed..bed.bed）。通过使用BedTool（RRID：SCR_006646）73合并所有单元格的峰值集获得了整个数据集的联合峰列表。　　最后，由于SNATAC-SEQ数据非常稀疏，因此我们仅选择了在每个群集中很大一部分细胞中被识别为开放染色质的元素。为此，我们首先从基因组中随机选择相同数量的非DHS区域（约670,000个元素）作为背景，并计算了每种细胞类型的核的分数，这些细胞类型在这些位点显示了信号。接下来，我们拟合了零充气的beta模型，并从经验上确定了FDR的显着性阈值 < 0.01 to filter potential false positive peaks. Peak regions with FDR < 0.01 in at least one of the clusters were included in downstream analysis. 　　Accessibility of cCREs in individual clusters was quantified by counting the fragments in individual clusters normalized by read depth (CPM). For each gene, we summed up the counts within the gene body +2 kb upstream to calculate the gene activity score. The gene activity score were used for integrative analysis with scRNA-seq. For better visualization, we smoothed the gene activity score to 50 nearest neighbour nuclei using Markov affinity-based graph imputation of cells (MAGIC)74. 　　For integrative analysis, we downloaded level 5 clustering data from the Mouse Brain Atlas website (http://mousebrain.org)2. First, we filtered brain regions that matched samples profiled in this study using these attributes for ‘region’: ‘CNS’, ‘cortex’, ‘hippocampus’, ‘hippocampus,cortex’, ‘olfactory bulb’, ‘striatum dorsal’, ‘striatum ventral’, ‘dentate gyrus’, ‘striatum dorsal, striatum ventral’, ‘striatum dorsal, striatum ventral, dentate gyrus’, ‘pallidum’, ‘striatum dorsal, striatum ventral, amygdala’, ‘striatum dorsal, striatum ventral’, ‘telencephalon’, ‘brain’ and ‘sub ventricular zone, dentate gyrus’. 　　Second, we manually subset cell types into three groups by checking the attribute in ‘taxonomy_group’: non-neurons: ‘vascular and leptomeningeal cells’, ‘astrocytes’, ‘oligodendrocytes’, ‘ependymal cells’, ‘microglia’, ‘oligodendrocyte precursor cells’, ‘olfactory ensheathing cells’, ‘pericytes’, ‘vascular smooth muscle cells’, ‘perivascular macrophages’, ‘dentate gyrus radial glia-like cells’, ‘subventricular zone radial glia-like cells’, ‘vascular smooth muscle cells’, ‘vascular endothelial cells’, ‘vascular and leptomeningeal cells’; gabaergic neurons: ‘non-glutamatergic neuroblasts’, ‘telencephalon projecting inhibitory neurons’, ‘olfactory inhibitory neurons’, ‘glutamatergic neuroblasts’, ‘cholinergic and monoaminergic neurons’, ‘di- and mesencephalon inhibitory neurons’, ‘telencephalon inhibitory interneurons’, ‘peptidergic neurons’; glutamatergic neurons: ‘dentate gyrus granule neurons’, ‘di- and mesencephalon excitatory neurons’, ‘telencephalon projecting excitatory neurons’. 　　To directly compare our single-nucleus chromatin accessibility-derived cell clusters with the single-cell transcriptomics defined taxonomy of the mouse brain2, we first used the snATAC-seq data to impute RNA expression levels (gene activity scores) according to the chromatin accessibility of gene promoter and gene body as previously described75. We performed integrative analysis with scRNA-seq using Seurat 3.0 (RRID:SCR_016341) to compare cell annotation between different modalities75. We randomly selected 200 nuclei (and used all nuclei for cell cluster with fewer than 200 nuclei) from each cell cluster for integrative analysis. We first generated a Seurat object in R by using previously calculated gene activity scores, diffusion map embeddings and cell cluster labels from snATAC-seq. Then, variable genes were identified from scRNA-seq and used for identifying anchors between these two modalities. Finally, to visualize all the cells together, we co-embedded the scRNA-seq and snATAC-seq profiles in the same low dimensional space. 　　To quantify the similarity between cell clusters from two modalities, we calculated an overlapping score as the sum of the minimum proportion of cells or nuclei in each cluster that overlapped within each co-embedding cluster5. Cluster overlaps varied from 0 to 1 and were visualized as a heat map with snATAC-seq clusters in rows and scRNA-seq clusters in columns. We found strong correspondence between the two modalities, which was evidenced by co-embedding of both transcriptomic (T-type) and chromatin accessibility (A-type) cells in the same RNA–ATAC joint clusters (Extended Data Fig. 10a, Supplementary Table 5). For this analysis, we examined GABAergic neurons, glutamatergic neurons and non-neuronal cell classes separately (Extended Data Fig. 10, Supplementary Table 5). 　　We used non-negative matrix factorization76 to group cCREs into cis-regulatory modules on the basis of their relative accessibility across major clusters. We adapted non-negative matrix factorization (Python package: sklearn77) to decompose the cell-by-cCRE matrix V (N × M, N rows: cCRE, M columns: cell clusters) into a coefficient matrix H (R × M, R rows: number of modules) and a basis matrix W (N × R), with a given rank R: 　　The basis matrix defines module related accessible cCREs, and the coefficient matrix defines the cell cluster components and their weights in each module. The key issue to decompose the occupancy profile matrix was to find a reasonable value for the rank R (that is, the number of modules). Several criteria have been proposed to decide whether a given rank R decomposes the occupancy profile matrix into meaningful clusters. Here we applied two measurements ‘sparseness’78 and ‘entropy’79 to evaluate the clustering result. Average values were calculated from 100 times for non-negative matrix factorization runs at each given rank with random seed, which will ensure the measurements are stable. 　　Next, we used the coefficient matrix to associate modules with distinct cell clusters. In the coefficient matrix, each row represents a module and each column represents a cell cluster. The values in the matrix indicate the weights of clusters in their corresponding module. The coefficient matrix was then scaled by column (cluster) from 0 to 1. Subsequently, we used a coefficient > 0.1 (approximately the 95th percentile of the whole matrix) as threshold to associate a cluster with a module. 　　In addition, we associated each module with accessible elements using the basis matrix. For each element and each module, we derived a basis coefficient score, which represents the accessible signal contributed by all cluster in the defined module. In addition, we also implemented and calculated a basis-specificity score called ‘feature score’ for each accessible element using the ‘kim’ method79. The feature score ranges from 0 to 1. A high feature score means that a distinct element is specifically associated with a specific module. Only features that fulfil both of the following criteria were retained as module specific elements: (1) feature score greater than median + 3 standard deviations; (2) the maximum contribution to a basis component is greater than the median of all contributions (that is, of all elements of W). 　　To identify cCREs that were differentially accessible either in subtypes or in brain regions, we constructed a logistic regression model predicting cluster/region membership based on each cCRE individually and compares this to a null model with a likelihood ratio test. We used two functions ‘fit_models’ and ‘compare_models’ in R package Monocle3 (v.0.2.2)80 to perform the differential test. We designed the full model as 　　and a reduced mode as 　　in which Pij represents the probability of ith site is accessible in the jth cell, a is the log10-transformed total number of sites observed as accessible for the jth cell, m is membership of the jth cell in either cluster or region being tested, r is the replicate label for jth cell and ε is an error term. 　　For each set of testing, between subtypes or between regions in cell type, we kept only cCREs that overlapped with peaks identified in corresponding cell types. A likelihood ratio test was then used to determine whether the full model (including cell cluster or region membership) provided a significantly better fit of the data than the reduced model. After correcting P values using Benjamini–Hochberg method, we set an FDR cut-off as 0.001 to filter out significant differential cCREs. 　　The log2-transfomed fold change is used for two-group comparison, for multiple groups, we calculated a Jensen–Shannon divergence-based specificity score previously described22 to better assign differential cCREs to cell type or brain region. The fraction of accessibility of each cluster f was first calculated for each ith site. We normalized these scores by multiplying by corresponding scaling factors, which are considering different overall complexity across groups. To do so, median number of sites accessible c in individual cells for each cluster was calculated and followed with log10-transforming. Then, we took the ratio of the average of c (across all clusters) over the median accessibility in each cluster as scaling factor. These corrected fraction of accessibility for each cCRE was then converted to probability by scaling by groups. Then, we calculated Jensen–Shannon divergence between two probability distributions. For example, the probability distribution for the first cCRE as d1, to test whether this cCRE is specific in group 1, we assumed another probability distribution: 　　Function ‘JSD’ in R package ‘philentropy’ was used to calculate Jensen–Shannon divergence between these two probability distributions, and Jensen–Shannon-based specificity (JSS) scores was defined as: 　　For each group, we calculated the JSS score for every cCRE. To find a reasonable cut-off to determine restricted or general cCREs, we consider JSS scores from all cCREs that are not identified as differential accessible (from likelihood ratio test) as a background distribution, and JSS scores from cCREs that passed our likelihood ratio test threshold and had positive values to be true positives. We set an empirical FDR cut-off where the type I error was no more than 5%. 　　Finally, the differential cCREs could be aligned to several cell types or brain regions based on the JSS score, we named the one can be assigned to only one type or region as region-specific or cell-type-specific cCREs. 　　To validate the regional specificity of cell types, we took advantage of the spatially mapped quantified ISH expression from ABA44 in five matched major brain structures, isocortex, olfactory areas, hippocampal formation (HPF), striatum (STR), pallidum. We used the ‘differential search’ function to identify 10,269 genes with increased expression in these five brain regions compared to all brain regions ‘grey matter’ (expression level >1和折叠更改> 1）。我们还使用SEURAT75鉴定了细胞类型的基因，该基因从集成分析（折叠变化> 1和FDR <0.05）中为每个关节群集默认参数鉴定了（扩展数据）。505个细胞型特异性基因（从1到53，平均为15个）与五个大脑区域表达增加的基因列表重叠。对于每种细胞类型，我们根据SNATAC-SEQ数据集估计的五个大脑区域的相对贡献，以及基于平均归一化的ISH信号的细胞类型特异性标记基因基于的平均归一化ISH信号，计算了区域特异性评分（请参见上一节：细胞类型的区域特异性）。对于每个细胞类型特异性基因，我们计算了五个大脑结构中细胞组成与从ISH得出的五个脑结构中的空间表达水平之间的PCC。　　由于我们研究中鉴定的星形胶质细胞亚型在SCRNA-SEQ研究中未解决，因此我们使用snatac-Seq使用染色质可及性，使用了可能的比率检验，确定了体形胶质细胞亚型的亚型特异性基因。通过FDR检验的可能性比测试和经验FDR临界值不超过5％，通过FDR过滤细胞型特异性基因。然后，我们计算了来自不同大脑结构的空间映射ISH基因和具有星形胶质细胞亚型特异性可及性的基因之间的重叠的比例。　　首先，使用CICERO37分别鉴定出每个细胞簇中的所有开放区域（随机选择的200个核）分别鉴定出所有细胞簇中的所有开放区域（随机选择的200个核），使用CICERO37与以下参数分别鉴定出所有核作为细胞簇）：聚集k = 10，窗口尺寸= 500 kb，距离约束= 250 kb。为了找到每个群集的最佳共同访问阈值，我们生成了一个随机洗牌的ccre by-by-cel-becl矩阵作为背景，并从此洗牌矩阵中鉴定出可共访问区域。我们通过使用R Package FitDistrplus81将共同访问性得分的分布拟合到正态分布模型中。接下来，我们测试了所有共同访问对，并将截止性设置为共同访问得分，其经验定义的显着性阈值为FDR <0.01。　　CCRE在Gencode MM10中的TSS±1 kb之外（v.16，rrid：scr_014966）38 。被认为是远端。接下来，我们分配了三个组的共接触性对：近端到螺旋，远端到距离和远端到远端。在这项研究中，我们仅关注远端到良好的对。我们进一步使用了来自匹配的T型的RNA表达来过滤与非表达基因相关的对（标准化的UMI <5）。　　我们计算了跨关节RNA-ATAC簇的基因表达和CCRE可及性之间的PCC，以检查共访问对之间的关系。为此，我们首先为每个关节群集从SCRNA-SEQ和SNATAC-SEQ中汇总了所有核或细胞，以计算可访问性得分（LOG2（CPM））和相对表达水平（Log2（Log2（归一化UMI）））。然后，为每个基因以TSS为中心的1-MB窗口内的每个CCRE – GEN对计算PCC。我们还通过随机从不同染色体和群集标签改组的区域来生成一组背景对。最后，我们拟合了正态分布模型，并以PCC分数定义了一个截止模型，其经验定义为FDR <0.01的显着性阈值，以选择显着的正相关的CCRE – Gene对。　　我们使用Taiji Pipeline45来识别细胞簇中的候选驱动器转录因子。简而言之，对于每个细胞类型簇，我们通过在可访问的染色质区域扫描转录因子基序并将其链接到最近的基因来构建转录因子调节网络。该网络针对从转录因子到靶基因的边缘。网络中基因的权重根据RNA表达水平（仅SGZ NIPC的基因活性评分，因为没有相应的T型）的相应T型型确定。边缘的权重通过源转录因子的启动子的相对可访问性计算。然后，我们使用个性化的Pagerank算法对网络中的转录因子进行排名。太极管道的输出是带有pagerank分数的逐个转录型矩阵。从输出矩阵中，我们计算了跨单元类型的变异系数。为了识别驾驶员转录因子，我们使用了以下标准：（1）转录因子的FDR小于0.001；（2）转录因子具有大于变体系数的变体系数；（3）Pagerank评分应在每种细胞类型的所有转录因子的前25％中排名；（4）RNA表达水平（CPM）大于20，我们认为这是相应细胞类型中表达的转录因子。　　我们使用荷马（v.4.11，rrid：scr_010881）61进行了从头开始和已知的基元富集分析。对于共识列表中的CCR，我们扫描了元素中心周围±250 bp的区域。对于近端或启动子区域，我们在TSS周围扫描了±1,000 bp的区域。随机选择的背景区域用于图案发现。为了识别富含不同细胞类型或大脑区域的基序，我们将可变CCR作为输入和不变的CCR作为背景。　　使用Great（版本4.0.4，RRID：SCR_005807）82具有默认参数的基因本体论注释。基因本体论生物学过程用于注释。　　我们使用R套件富集（RRID：SCR_001575）83进行基因本体学富集分析。基因集库“ go_biological_process_2018”与默认参数一起使用。合并分数定义为使用Fisher的精确测试计算的P值乘以与预期等级的偏差的Z分数。　　为了与人类表型的GWASS进行比较，我们将举重与设置“ -minmatch = 0.5 ”一起使用，将可访问元件从MM10转换为HG19基因组坐标51 。接下来，我们将元素恢复到MM10 ，只保留映射到原始基因座的区域。我们进一步删除了长度大于1 kb的转换区域。　　We obtained GWAS summary statistics for quantitative traits related to neurological disease and control traits (Supplementary Table 25): age first birth and number of children born84, tiredness85, Crohns disease86, attention deficit hyperactivity disorder87, allergy88, birth weight89, bipolar disorder90, insomnia91, sleep duration92, neuroticism93, coronary artery disease94,类风湿关节炎95，教育疗法96，精神分裂症97 ，初潮年龄98，烟草使用障碍（ftp://share.sph.umich.edu/ukbb_saige_hrc/，表型：318）99 ，Intelligence100，amyotrophics sclerssipy1111111111111111111111年。　　我们为连锁不平衡得分回归的标准格式准备了摘要统计。我们使用每种主要细胞类型的同源序列作为二进制注释，所有候选调节峰的超集作为背景控制。　　对于每个性状，我们都使用细胞类型的特定连锁不平衡得分回归（https://github.com/bulik/ldsc）来估计每个注释与背景对照共同的注释的富集系数52 。　　我们从精神病基因组协会网站（https://www.med.unc.edu/pgc/）获得了精神分裂症的99％可信度。与CCR重叠的关联分数后验概率的潜在因果变异。　　相交分析区域所用的数据集如下：HG19和MM10的代表性DNase高敏位点区域均从屏幕数据库（https://screen.encodeproject.org）104,105获得。从GitHub下载了Chromhmm 31,33个状态的鼠标脑状态（https://github.com/gireeshkbogu/gireeshkbogu/chromatin_states_states_chromhmm_mm9），并使用升降机绘制Chromhmm状态的坐标（https:///genome.ucsc.ucsc.edu/cgi-bgi/cgi-binfiftimparife tomaintiaptional painsim.ucsc.eedu/cgi-binfiftairptairpartimentiaptiar toomaint 5从UCSC基因组浏览器（http://hgdownload.cse.ucc.edu/golden/goldenpath/mm10/phastcons60way/）下载了PHASTCONS59保守元素。从鼠标Encode Project31（http://chromosome.sdsc.edu/mouse/）下载了CTCF结合网站。这项研究使用了来自皮质和嗅球的CTCF结合位点。将峰从峰顶的基因座扩展到±500 bp，并使用升降机映射到MM1051 。　　没有使用统计方法来预先确定样本量。没有样品随机分配，研究人员并未对正在研究的标本视而不见。但是，根据染色质的可及性，以公正的方式进行单核的聚类，并在聚类后分配细胞类型。如上所述，将低质量的核和潜在条形码碰撞排除在下游分析中。　　有关研究设计的更多信息可在与本文有关的自然研究报告摘要中获得。

本文来自作者[admin]投稿，不代表象功馆立场，如若转载，请注明出处：https://wap.xianggongguan.cn/kexue/202506-467.html