没有使用统计方法来预先确定样本量。实验是完全随机的,研究人员在进行实验或测序时不知道基因型标识符 。
为了形成多样性面板 ,从2010年到2018年收集了来自天然和普通花园来源的种子,根茎和克隆繁殖体。从种子生长的植物遵循标准生长程序16。简而言之,在9厘米的正方形盆中播种了10-15种种子 ,其中包含Promix Bx盆栽土壤(主要技术园艺)和Turface MVP钙化粘土(Turface Athletics)的混合物,并在4°C下春化7天 。然后将花盆放在14小时长的温室和30-°C/22°C白天/夜间温度中。在3叶阶段将幼苗变细至每锅1植物,并允许生长直到5-Tiller阶段。将根茎的繁殖体和5矿幼苗转移到含有细磨碎的松树皮覆盖物(孤星覆盖物)和时间释放肥料(渗透量14-14-14 ,Scottsmiraclegro)的5加仑锅中 。从2016年到2018年,克隆分公司在奥斯汀繁殖了所有单独的植物,每个独特的登录目标> 10个克隆。Cleary 3336F系统性杀菌剂(Cleary Chemicals)根据需要控制真菌病原体的必要条件。将植物放入1加仑的锅中进行最终传播 。
在现场地点种植发生在2018年5月15日至7月10日 ,随后发布了先前发表的方法16。简而言之,将植物乘卡车运到每个地点,每个田地都覆盖着一层Dewitt杂草布。将植物置于被切成杂草布的孔中 ,进入蜂窝设计,每种植物都有四个最近的邻居,它们彼此彼此1.56 m 。为了防止边缘效应,在每个边缘位置都种植了低地Blackwell品种。移植后 ,手工饮用植物。所有植物的地上部分都在2018 - 2019年冬季停留,并于2019年春季在春季耕种之前的出现之前拆除。在2019赛季结束时,植物被直立地绑定为一堆 ,并用镰刀杆割草机收获 。
我们为2019年生长季节产生了两种适应性措施:对数转化的生物质(kg)和冬季生存的比例(补充数据8)。生物量数据是从2019年10月和11月收获期间从所有活着的人那里获得的。估计质量的植物 <750 g were placed in paper bags and dried whole at 60 °C until no additional moisture loss occurred, then weighed for total dry biomass. Plants with an estimated mass >750 g在田间称重,以悬挂量表,分辨率为±5克 。为了确定这些植物的生物量 ,从每种植物中取样了大约500 g的全耕作品,称重,干燥 ,并重新升高。然后将整个植物样品的湿生物质乘以子样本中的%水分,以近似总干生物量。在2018 - 2019年冬季,植物被认为经历了冬季死亡率 ,当时在2019年6月1日之前没有从植物冠中看到新的生长 。从实验中切除了死植物冠,并于2019年7月或9月被Blackwell品种的植物取代。
我们使用整个基因组shot弹枪测序策略和标准测序方案在能源联合基因组研究所和Hudsonalpha生物技术研究所对Alamo Switchgrass基因型AP13进行了测序。该基因组是从4,520,785个PACBIO读取(121.66×原始序列覆盖范围,从总计59个P6C4 2.0和2.1化学细胞的原始序列覆盖范围,具有10-H电影时间的化学细胞 ,P-Read收益率为91.76 GB)(使用Mecat Assopsbler52和Arrow Polilyer 53 。最终的基因组抛光和误差校正是用一个400 bp插入物2×150 bp Illumina Hiseq片段(177.1×)进行的。读取> 95%的简单序列重复和读取 <50 bp after trimming for adaptor and quality (q < 20, 5-bp window average) were removed. The final read set consisted of 1,259,053,614 reads for a total of 168× coverage of high-quality Illumina bases. This produced an initial diploid assembly of 6,600 scaffolds (6,600 contigs), with a contig N50 of 1.1 Mb, 3,489 scaffolds larger than 100 kb and a total 2C (diploid) genome size of 2,013.4 Mb.
Assembling a haploid genome in an outbred individual, such as AP13, will generally yield both haploid copies in heterozygous regions, necessitating computational steps to represent each chromosome as a single-copy haplotype without duplicate copies being unnecessarily repeated. Our initial assembly was approximately double the expected haploid (1C) genome size of 1.2 Gb. Therefore, to detect putative meiotically homologous haplotypes, we identified and counted shared 24-mers that occurred exactly twice in the assembly and binned contigs accordingly. A total of 3,152 shorter and redundant alternative haplotypes and 2,387 overlapping contig ends were identified, comprising a total sequence of 871.2 Mb. The remaining 1,142.2 Mb of sequence was ordered and oriented into 18 chromosomes by aligning genetic markers from 2 available maps (Supplementary Data 1) to the MECAT assembly; 563 joins and 57 breaks were made, with 10,000 Ns representing the unsized gap sequence. Overall, 97.2% of the assembled sequence was contained in the chromosomes. Telomeric sequence was identified using the (TTTAGGG)n repeat and properly oriented. The remaining scaffolds were screened against GenBank bacterial proteins and organelle sequences and removed if found to match these sequences. To resolve minor overlapping regions on contig ends, adjacent contig ends were aligned to one another using BLAT54; a total of 47 adjacent duplicate contig pairs were collapsed.
We conducted two rounds of error correction. First, we corrected homozygous SNPs and insertions and/or deletions (indels) by aligning the Illumina 2 × 150 bp library to the release consensus sequence using bwa mem55 and identifying homozygous SNPs and indels with the UnifiedGenotyper tool of GATK56. A total of 690 homozygous SNPs and 80,199 homozygous indels were corrected in the release. Second, we computationally finished 11,343 assembled contigs sequenced from BAC clones with a combination of ABI 3730XL capillary sequencers57 and single index Illumina clone pools and aligned this set of switchgrass clones to the SNP-fixed genome to find heterozygous SNPs that were out of phase with their neighbours. To resolve these phase-switched alleles, the full set of the raw PacBio reads was aligned to the assembly. For each read, the phase of each heterozygous site was determined and 62,732 out-of-phase heterozygous sites were corrected.
To distinguish the N and K subgenomes, we used a de novo repeat-clustering method and validated this with phylogenetic distances to a related species. We searched for ‘diagnostic’ 15-mers via Jellyfish58 in LTR regions of Gypsy, Copia and Pao insertions (identified by RepeatMasker59 and LTRHarvest60) that distinguished each set of homologous chromosomes (≤1 hit in one homologue and ≥100 in the other). The LTR sequences that shared common 15-mers were grouped as superfamilies and were aligned within each superfamily by BLAST. Superfamily members with significant BLAST hits (e < 0.01, ≥90% length) were assigned into families and aligned by Mafft61. Jukes–Cantor distances between LTR families were computed by the R ape package62, and clustered into two distinct sets of subgenomes. Clustering was identical between LTRs and alignments to P. rudgei (K.M.D. and E. Kellogg, unpublished data), which is an ancient relative of the K subgenome17, giving high confidence that we have effectively assigned all chromosomes to the correct subgenomes. Finally, we assigned chromosome identifiers and oriented each chromosome pseudomolecule via synteny with Setaria italica63. The final haploid version 5.0 release contained 1,125.2 Mb of sequence, consisting of 626 contigs with a contig N50 of 5.5 Mb and a total of 97.2% of assembled bases in chromosomes.
Transcript assemblies were made from about 2 billion pairs of 2 × 150-bp stranded paired-end Illumina RNA-seq reads, about 1 billion pairs of 2 × 100-bp paired-end Illumina RNA-seq reads and 454 reads (Supplementary Data 3) using PERTRAN (details of which have previously been published64). In brief, PERTRAN conducts genome-guided transcriptome short-read assembly via GSNAP65 and builds splice alignment graphs after alignment validation, realignment and correction. In total, around 4.5 million PacBio Iso-Seq circular consensus sequences66 were corrected and collapsed, resulting in approximately 677,000 putative full-length transcript assemblies. Subsequently, 668,176 transcript assemblies were constructed using PASA67 from RNA-seq reads, full-length cDNA, Sanger expressed sequence tags, and corrected and collapsed PacBio circular consensus sequence reads. Loci were determined by EXONERATE68 alignments of switchgrass transcript assemblies and proteins from Arabidopsis thaliana69, soybean70, Kitaake rice71, Setaria viridis72, P. hallii var. hallii64, Sorghum bicolor73, Brachypodium distachyon74, grape and Swiss-Prot75 proteomes. These alignments were accomplished against a repeat-soft-masked switchgrass genome using RepeatMasker59 (repeat library from RepeatModeler76 and RepBase77) with up to 2,000-bp extension on both ends unless extending into another locus on the same strand. Incomplete gene models, which had low homology support without full transcriptome support, or short single exon genes (<300-bp coding DNA sequences (CDS)) without protein domain or good expression were removed.
Syntenic orthologues and paralogues were inferred for the two switchgrass subgenomes via the GENESPACE pipeline64, using default parameters and two outgroups: P. hallii var. hallii64 and S. bicolor73. In brief, GENESPACE parses protein similarity scores into syntenic blocks and runs orthofinder78 on synteny-constrained blast results. The resulting block coordinates and syntenic orthology networks give high-confidence anchors for evolutionary inference.
To calculate the ancestral states of CDS regions, we first determined sequences that share common ancestry using genomes from Phytozome79. The final number of hits to the switchgrass genome were 38,960 and 33,772 for P. hallii, and S. bicolor, respectively. For any given orthology network, we built two multiple sequence alignments in mafft61, one excluding the focal switchgrass sequence (msa0) and one forcing msa0 to align to the coordinate system of the focal sequence via the --keeplength parameter. We then extracted marginal character states with the maximum likelihood algorithm in Phangorn80. For each reconstruction, only the internal node closest to the switchgrass branch was used as the ancestral state. Overall, we analysed 40,943 switchgrass gene models (216,157 exons) covering 54.95 Mb (Supplementary Data 9).
To infer the ages of the subgenomes and tetraploid switchgrass, we took a conservative set of orthologues with simple 2:1:1 networks between P. virgatum, P. hallii and S. italica. This yielded 45,045 switchgrass proteins aligning to 24,549 P. hallii proteins, resulting in 20,496 homologue pairs and 4,053 singletons (2,396 for K subgenome and 1,660 for N subgenome) from the cross-species analysis. We aligned the translated CDS of these sequences using Dialign-TX81. The aligned CDS sequences were concatenated and fed to Gblocks82 using default parameters. Gblocks filtered the alignment of 18,044,244 CDS nucleotides to 16,321,302 positions, in 50,334 blocks. The resulting alignment was then used in PhyML83 to build a maximum-likelihood tree using the general-time reversible model. This tree was used as an input to r8s84, to compute a time tree and calibrate the Panicum–Setaria node of the tree to 13.1 Ma63. To date subgenome divergence and therefore the timing of polyploid switchgrass speciation, we leveraged burst distances, which refer to all distances within an LTR family (whereas pairwise distances refer to the distance between the 5′ and 3′ LTRs of the same insertion). The 5′ versus 3′ distances of the N- or K-subgenome-specific retrotransposons were used to date the insertion times of those elements. This method cannot be used for the P. virgatum-specific or Panicum-specific families because the more recent expansions of those elements dominate the distributions. Instead, we relied on comparing the best cross-species alignments to estimate the LTR distances of the P. virgatum–P. hallii and Panicum–Setaria nodes. This way, we have calibration points to compare the LTR distances to the more confident protein-coding gene divergences between species.
To assess whether the subgenome evolution biases observed at the protein-coding sequence scale were manifest in phenotypes, we explored gene expression biases between homologues from biologically replicated AP13 leaf tissue (n ≥ 5) collected at two sites (TX2 and MI). Illumina paired-end RNA-seq 150-bp reads were quality trimmed (Q ≥ 25) and reads shorter than 50 bp after trimming were discarded. High-quality sequences were aligned to P. virgatum v5.1 reference genome using GSNAP65 and counts of reads uniquely mapping to annotated genes were obtained using HTSeq v.0.11.285. The test for differential expression was conducted through a likelihood ratio test in DESeq286. Library sizes were calculated before splitting the reads by subgenome; these sizes were used as the size factors in the analysis of differential expression. Subfunctionalization was defined as a significant subgenome-by-environment interaction from the likelihood ratio test. Subgenome expression bias was tested for both the field gardens and annotation libraries using post hoc Wald-test contrasts between subgenomes within conditions. Significant bias was defined as differential expression false-discovery-rate-adjusted P < 0.05. Weighted gene coexpression clustering of AP13 gene annotation RNA-seq libraries was conducted with WGCNA87 with a power of 6. Raw counts can be found in Supplementary Data 10.
We used a LSRFortessa SORP Flow Cytometer (BD Biosciences) to determine ploidy levels of the resequenced accessions. For each plant, 200–300 mg of young leaf tissue was macerated in a Petri dish with a razor blade and treated for 15 min with 1 ml Cystain PI Absolute P nuclei extraction buffer (Sysmex Flow Cytometry) mixed with 1 μl 2-mercaptoethanol. Samples were filtered to isolate free nuclei with a CellTrics 30-μm filter (Sysmex) and treated for 20 min on wet ice with 2 ml of Cystain PI Absolute P staining buffer (Sysmex), 12 μl of propidium iodide and 6 μl of RNase A. Samples were run on the flow cytometer to determine nuclei size with a minimum of 10,000 nuclei analysed per sample. Output from the flow cytometer was analysed with FlowJo software (BD Biosciences) and samples were binned into three categories on the basis of the average units of fluorescence per nuclei (Supplementary Fig. 1). Ploidy level of the sample was considered 4× if the cell population had 40,000–80,000 units of fluorescence, 6× for 80,000–100,000 units and 8× for 100,000–140,000 units. The binning parameters were established with flow cytometry data from several P. virgatum accessions of known ploidy.
We also assessed ploidy of the samples via the distribution of variant allele frequency at biallelic SNPs (as described in ‘Variant calling’). This method assumes that tetraploids and octoploids follow different allele frequency distribution patterns, with tetraploids having 0.5/0.5 (reference and variant depths) and octoploids having a mixture of 0.75/0.25 and 0.5/0.5. If the proportion of hits with 0.48 ≤ x ≤ 0.52 was <0.035, the library was considered octoploid and if it was ≥0.035, tetraploid; 837 out of 870 samples (96.2%) that had flow cytometry data matched with these results.
A total of 789 tetraploid diversity samples were resequenced at a median depth of 59× (range 20×–140×). Of these, 732 were used for further analysis after filtering for missing data, outlier elevated heterozygosity and collection site discrepancies. The samples were sequenced using Illumina HiSeq X10 and Illumina NovaSeq 6000 paired-end sequencing (2 × 150 bp) at HudsonAlpha Institute for Biotechnology and the Joint Genome Institute. To account for different library sizes, reads were pruned to ≤50× coverage, then mapped to the v5 assembly using bwa-mem55.
SNPs were called by aligning Illumina reads to the AP13 reference with BWA-mem. The resulting .bam file was filtered for duplicates using Picard (http://broadinstitute.github.io/picard) and realigned around indels using GATK 3.056. Multi-sample SNP calling was done using SAMtools mpileup88 and Varscan V2.4.089 with a minimum coverage of eight and a minimum alternate allele count of four. Genotypes were called via a binomial test. SNPs within 25 bp of a 24-mer repeat were removed from further analyses. Only SNPs with ≤20% missing data and minor allele frequencies >0.005 were retained, resulting in 33,905,042 SNPs across 75% of the genome at a coverage depth between 8× and 500×. Phasing was performed using SHAPEIT390. FST calculations were accomplished via vcftools91. We tested for subgenome read-mapping bias by generating mean coverage per Mb for each of the 732 libraries and 18 chromosomes. We then fit a mixed effects linear model to these data in lme492 in which the chromosome number (1–9) was a random effect, to test the main effect of subgenome. Models with and without the main effect term were compared via a likelihood ratio test.
Individual de novo assemblies for the 732 short read libraries were constructed using HipMer93 with a k-mer size of 101 to maximize haplotype splitting among contigs. As the assemblies varied in quality and contiguity, the sample set considered for gene presence–absence and structural variant detection was narrowed to 251 samples (pan-genome set) based on total assembly size, contig N50 length and total gene alignments per library.
To assess presence–absence variation of genes across the pan-genome, we aligned all AP13 proteins and a unique set of 6,161 proteins from Oropetium thomaeum (nproteins = 1,476)94, S. italica (n = 1,085)63, Setaria viridis (n = 891)72, P. hallii var. filipes (n = 1,048)64, S. bicolor (n = 878)95 and P. hallii var. hallii (n = 772)64. These unique genes were extracted from single-copy orthology networks inferred via orthofinder78 and selection owing to a lack of orthology to switchgrass. All proteins (≥100 amino acids) were aligned to all de novo assemblies using BLAT54. Gene alignments from AP13 proteins were considered present if they aligned with greater than or equal to 80% identity and 75% coverage, whereas other grass proteins were considered present with alignments greater than 70% identity and 75% coverage (to allow greater divergence among species). Variable (pan-genome shell) genes (considered present across 40–60% of the population; n = 5,432) were extracted from the presence–absence variation matrix and used to visualize differences among non-admixed individuals from the Atlantic, Gulf and Midwest subpopulations. Testing genes that were significantly over- or under-represented within each subpopulation was conducted with a χ2 test with a Benjamini–Hochberg multiple testing correction (P ≤ 0.05).
To detect structural variants across the pan-genome, contigs (≥2 kb) from each library were aligned to the AP13 reference genome using ngmlr96 with default settings for PacBio reads. The resulting .bam file was sorted using samtools88 and used for calling structural variants with sniffles96. Individual structural variant calls were merged across samples using SURVIVOR97, with a maximum allowed distance of 1 kb. The resulting .vcf file was filtered using bcftools88 using a minimum minor allele frequency of 0.1, and considering only insertions and deletions between 100 and 1,500 bp in length.
To assess the genetic population structure of the 732 tetraploid libraries (Supplementary Data 4), we extracted all fourfold degenerate sites (putatively neutral) with ancestral state calls (Supplementary Data 9) from the ancestral state alignments. This list of sites, which represents our highest confidence neutral loci, was then linkage-disequilibrium-pruned using a threshold of |r| ≤ 0.6, resulting in 59,789 sites for downstream analyses in the R package SNPRelate98.
The extent of linkage disequilibrium for the population was determined from SNPs99 in PLINK100. Linkage disequilibrium (r2) was calculated using plink (--ld-window 500--ld-window-kb 2000). The r2 value was averaged every 500 bp. A nonlinear model was fit for this data in R using the nls function, and the extent was determined as to when the linkage disequilibrium (r2) nonlinear curve stabilized.
Population genetic structure was assessed hierarchically. Given the presence of highly divergent ecotypes across the study range, we first analysed the broadest genetic population structure using discriminant analysis of principal components (DAPC)101 in adegenet v.2.0.1102. This method does not rely on common assumptions (for example, Hardy–Weinberg equilibrium and linkage disequilibrium) that underlie many population clustering approaches and therefore provides a valuable tool to look at broad structural divisions. DAPC demonstrated a strong set of gene pools and separated Midwest genotypes from all others. We then evaluated the genetic population structure and potential admixture of the remaining non-Midwest individuals using a Bayesian clustering algorithm implemented in STRUCTURE v.2.3.4103 via the admixture model with correlated allele frequencies. The analysis consisted of 20,000 burn‐in steps and 30,000 replicates of 1–6 genotypic groups, each of which was run 10 times. Ancestry coefficients across all subpopulations were assigned post hoc through eigenvector decomposition in SNPRelate.
We inferred the demographic history of the switchgrass samples using Multiple Sequentially Markovian Coalescent (MSMCv.2.0104), which is a population genetic method used to infer demographic history and population structure through time from sequence data. This method models an approximate version of the coalescent under recombination, and produces tests of both population size and divergence time. MSMC was run using four haplotypes for each subpopulation, skipping ambiguous sites, an estimated rhoOverMu of 0.25 and a time segment pattern of 10 × 2 + 20 × 5 + 10 × 2. We estimated rhoOverMu as 0.25 as the mean value from 100 iterations without the fixed recombination parameter for 5 sets of 4 haplotypes in each subpopulation and averaged them. To estimate scaled divergence time in generations, we assumed a mutation rate of 6.5 × 10−8. To make estimates of initial divergence time, we compared adjacent relative cross-coalescence rate (RCCR) values (past to present) (Supplementary Data 11). If there was a decline, either at a single time segment or within contiguous segments or within two interleaved time segments (>0.01;观察到的范围为0.01–0.28),以下邻居接近零(≤0.009;观察到的范围:-0.1-0.009),我们认为这是人口分离的起点。但是 ,如果在五次段之内又下降了,我们将后者视为人口分离的开始 。我们使用16组不同个体的分析来复制分析,以进行每个亚种群对比。
通过距离矩阵的特征向量分解 ,在SNP,结构变异和存在的差异之间可视化了种群结构。首先,在0/1/2之间计算了欧几里得距离矩阵(参考纯合子 ,杂合,替代纯合子)库×标记矩阵,用于三种变体调用类型中的每一种。然后对欧几里得矩阵进行缩放并中心缩放 ,以通过Gower的中心相似性矩阵删除图形的覆盖率方差,该矩阵在R package MDMR105中实现 。
调查了16个植物特征(叶:长度,宽度,长度/宽度比 ,面积,层厚度和层层/中米/中米的厚度比),调查了16个植物特征(叶子:长度 ,宽度,长度/宽度比,整个植物:分erer的数量 ,耕种者的高度,tiller高度×数量,pANILE的高度 ,panicter feel速率,叶子的高度,play leep/fert比)阵容出现的日期)在2019年夏天在德克萨斯大学J. J. Pickle研究校园(PKLE;或TX2(TX2(TX2)(美国德克萨斯州奥斯汀)和密歇根州立大学Kellogg Biological Station(KBSM;或MI;或MI(MI MI)(密歇根州Hickory Corners ,Michigan,USA)普通花园5详细量表,这些变体数据,包括绿色的绿色植物结构(从根茎冠中出现)和圆锥花序的出现(当第一个生殖结构从耕种中出现时) ,每天都会通过测量每种植物的代表性叶子来评估详细的叶子形态。(在MM2中)(持牌3100C叶面积),除了这些定量性状外,我们还为2019年夏末在奥斯汀(Austin)收集了叶子和全植物的定性索引(TX2) ,从而在2019年结束时(TX2)供应了1-5个尺度。评估的低地和高地角色的基线测量包括:从最厚,最低地到最薄,最高地的外观 ,从最宽,最长,最短 ,最短,最稀薄,最稀薄的 ,最高地的叶子 。 顶篷颜色从最蓝,通常是低地到最黑的绿色,最常见的是高地。这种视觉方法类似于SwitchGrass育种者经常使用的基本选择标准。
为了评估这些数据中的表型结构,我们使用了DAPC101 。通过首先使用主成分分析(PCA)转换表型数据来确定先前的组 ,然后将前10个主要成分用于K-均值算法中,将个体分类为3个可能的分组,以最大程度地提高组之间的变化。接下来 ,在10个保留的主组件上实施了DAPC,以使用两个合成变量对生态类簇进行有效描述,这些变量是原始表型变量的线性组合 ,这些变量具有群体间方差和最小的内部内部方差(即歧视功能)。
我们将651个四倍体基因型中的每一个分类为通过MI和TX2花园的16个特征(34个总特征,32个定量和2个定性序序性状),通过1个隐藏层和5个单元(补充数据5) ,将3个生态型中的1种 。The neural network was implemented in caret106 and was trained on seven cultivars with known ecotypes (lowland: Kanlow and Alamo; coastal: High Tide and Stuart; upland: Summer, Dacotah and Sunburst) and 78 additional genotypes that were in the same SNP-based genetic cluster (Extended Data Fig. 3), collected in the same states and clustered most closely in phenotypic PCA space with the示例品种。这些高亲和力的示例基因型印刷在补充数据5中。针对生态型分类性状表型的其余582个基因型的生态型预测了Caret106 。通过使用代表北部和南部开关草系的花园收集的性状,我们希望避免植物表型和随后的生态型分类的局部气候偏见。此外,神经网络分类方法比DAPC和专家的资格提供了一个显着的优势:由于神经网络锚定为已知和已发表的基因型 ,因此包括这些常见品种的实验将能够更有效地概括我们的作业。
我们通过两步管道构建了混合式SNP的数据库。首先,将祖先系数计算出与祖先呼叫相关的四倍变性位点的“人口基因组学”中的计算 。每个亚群的数据最少丢失的30个样品和全基因组混合物的比例≤0.001≤0.001用于定义亚群特异性等位基因频率。这些库用于查找至少一个成对FST值> 0.4的SNP,如Snprelate函数SNPGDSFST中的“ W&C84”方法计算。其次,这些全局祖先信息地点在每个亚种群中被解析为次要等位基因频率> 0.05且缺失的人群 < 0.05. These sites were further pruned within subpopulations first to sites with |r| < 0.9 (10 SNPs or 1,000-bp windows), then to |r| < 0.95 (1,000 SNPs or 10,000-bp windows) in snpRelate. This process resulted in the following SNP and library counts for each subpopulation: Atlantic, 579,468 SNPs and 284 libraries; GULF, 641,975 SNPs and 215 libraries; and Midwest, 481,563 SNPs and 196 libraries.
To test for the physical locations of admixture blocks between each pair of subpopulations, we used Ancestry_HMM36,107. This approach leverages allele frequencies in putative parental populations to determine regions of likely introgressions in a test population. For each of the three subpopulations, we sought to determine the timing, extent and current positions of admixture block introgressions. In each case, we permitted two pulses from each of the other two subpopulations. Ancestry_HMM can optimize the number of generations before present when an ancestry pulse occurred and the proportion of individuals involved in the admixture pulse. However, 8-parameter optimization with >480,000 sites and >150个库在计算上不可行 。因此 ,我们使用40个随机采样的库在0.2-0.8分位数中使用40个带有混合系数的随机抽样库进行了优化,仅在N子基因组的染色体4上仅在掺合式分布的0.2-0.8分位数。由于缺乏明显的大型高频渗入,我们选择了该染色体作为他人的代表。最初在目前的10,000代人的10,000代人中建立了所得的祖先脉冲参数优化 ,另外两个亚群的后续混合脉冲 。the optimized pulses are as follows (source–reference): Midwest–Atlantic (ngenerations = 8,658 and Padmixed = 0.001%; 67 and 0.7%), Gulf–Atlantic (85 and 1.1%; 17 and 0.25%), Atlantic–Gulf (79 and 1.9%; 11 and 0.38%), Midwest–Gulf (79 and 0.86%; 11 and0.14%),大西洋 - 米德斯特(66和0.27%; 14和0.036%)以及海湾 - 米德斯特(71和0.15%; 14和0.033%)。这些脉冲与所有个体和染色体一起提供给完整的模型,以及0.001的错误概率 ,在现场之前的最大世代数量,有效人口大小为100,000。将祖先的概率解码为单倍型块,并将块纳入类似位置块的群集中 。
地理地图是由从天然地球(https://www.naturalarearthdata.com/)下载的公开图层制作的。各种绘图例程依赖于统计计算的R环境中的SF108和Raster109软件包。110 。从WorldClim22(19个生物气候变量 ,0.5-Arcmin分辨率1960–2000)和Climatena21下载了气候数据。通过动态聚类111探索了跨集合位点的气候变量的分布,然后围绕k = 7的Medoids clustering112进行分区。最具代表性的气候变量定义为与每个群集中第一个变异的第一个特征向量相互依赖的气候变量。七个群集中有六个包括世界电获变量 。
从NOAA门户下载了天气数据,以至于每个花园的每日温度(最低至最低)以及2018年9月1日至2019年10月31日的降水量数据。(USC00255362) ,OK(USW00053926),SD(USC00391076),TX1(USC00414810),TX2(USC00410433) ,TX3(USC00418862)和TX4(USC00418862)和TX4(USW00003901)。
在原始数据和估算的数据上进行了跨花园的气候 - 表型关联 。纬度 - 星期关联(图2B)是在R.中通过GLM和二项式家族进行逻辑回归的原始数据完成的。弹药是在所有可用表型中使用最近的邻居(k = 5)在基础R中完成的,仅用于花园等级的测试(图2C,d)。气候相似性 - 生物量关联是通过LME92在混合线性模型中完成的 ,将完整模型(固定模型=气候距离 +截距,随机=基因型标识符)与降低的模型进行了比较,而没有可能使用可能性比率测试的气候距离固定效应 。
物种分布建模(SDM)用于模拟Virgatum的所有生态型(高地 ,低地和沿海)的现代潜在范围。用于构建SDM的最终数据集由277(山地),199(沿海)和121(低地)发生记录组成。在我们的最终SDM建模中使用了六个环境预测因子(Bio1 =年平均温度,Bio2 =平均昼夜范围 ,Bio4 =温度季节性,Bio5 =最高温度最高温度,Bio16 = wettest季度的降水量和Bio17的降水量= Driest季度的降水量) 。然后 ,使用Biomod2 v.3.3113生成SDM,并具有七种建模算法:广义线性模型,增强的回归树,人工神经网络 ,灵活的判别分析,随机森林,分类树分析和多元自适应回归闪光灯。对于每个模型 ,将发生数据与在建模的研究区域内随机生成的500个伪少量数据耦合,并具有相等的维度和伪依据114。用80%的耦合事件和伪散布数据训练模型,并以剩余20%的方式进行了测试 。每种建模算法的总计700型模型运行100次 ,通过真实技能统计(TSS)115进行了评估。TSS值范围从0.2到0.5较差,从0.6到0.8有用,> 0.8良好至优异116。根据TSS阈值值(高地TSS阈值= 0.96 ,Lowland TSS阈值= 0.93和沿海TSS TSS阈值= 0.965),从700型模型中的大约50个最佳SDM计算出唯一的合奏SDM。将最终的合奏SDM投影到当前的气候层上,以可视化现代潜在范围(补充数据12) 。
我们研究了大西洋亚种群中中西部渗入的存在与气候 ,地理和亲属关系的独立和关节影响如何通过在素食117,118,119,120中实施冗余分析。为了划分可归因于气候,亲属关系和地理位置的渗透渗透存在的可解释差异,我们运行了四个模型:一个具有浸入式存在的完整模型(潜在的渗入块(潜在的渗入块)为0的大西洋继承或1个中西部渗入的模型,或1个由气候所解释的(即 ,七个具有代表性的气候变量)的构图(即七个代表性的构图)。(纬度和经度),以及这三个因素中的每个模型,在其他两个因素上 。比较每个模型的约束基质的惯性(即方差)值 ,以确定气候,亲属关系,地理及其关节效应的相对重要性。此外 ,为了找到与气候和生存校正生物量密切相关的浸润区域,我们从其他两个模型中提取了冗余分析轴的负载:(1)仅通过气候预测的一个,并且仅通过生存验证的生物量预测。根据置换测试 ,这两种模型都很重要(n = 999; p < 0.001 for both), and all axes were approximately normally distributed. SNPs loading at the tails of each axis were more likely to indicate selection related to the predictors (that is, climate or survival-corrected biomass), so we identified all markers that were at least 2.5 s.d. (two-tailed P = 0.012) from the centre as introgressions putatively under selection119.
Owing to the large sizes of our common garden datasets, we developed a pipeline—the switchgrassGWAS R package (https://github.com/Alice-MacQueen/switchgrassGWAS)—to allow fast, less-memory-intensive GWAS on the diversity panel, and to analyse the extent to which SNP effects were similar or different for phenotypes measured at different sites. This package leverages bigsnpr121 to perform fast (>比流苏快的300×)对矩阵编码的大规模SNP阵列的统计分析 。它还在人类遗传学文献中纳入了当前的黄金标准,用于SNP质量控制,修剪和插定以及GWAS中的种群结构校正。为了在许多情况下测试许多效应的重要性(例如 ,多个站点,气候变量等),我们使用了MASHR32,一种灵活的 ,数据驱动的方法,可以在任何数据集中共享有关效应大小和符号的信息,以估算许多条件和SNP的条件基础和SNP的效果。我们确定了哪些SNP使用局部错误的符号率具有明显的表型效应的证据 ,该迹象率类似于错误的发现率,但更保守(因为它们也反映了效应符号的不确定性)122 。我们使用这些值来找到具有Log10变换的贝叶斯因子> 2的SNP。在这里,贝叶斯因子是SNP的一种或多种显着表型效应的可能性与SNP仅具有无效效应的可能性的比率。在先前的工作33之后 ,贝叶斯因子> 102被认为是决定性证据,支持SNP具有一种或多种显着的表型效应的假设 。
为了计算与气候和适应性相关的SNP的区域遗传力,我们遵循了先前描述的两步方法123。使用使用Van Raden Method124计算的基因组关系矩阵 ,使用ASREML(VSN International)完成了方差分析。基因组关系矩阵在每个亚群中和全部多样性面板中计算。计算基于单变量GWAS中使用的所有SNP的亲属矩阵(G),以及基于与该亚种群中气候显着相关的SNP的亲属矩阵(log10 transformed Bayes因子> 2; QClimate> 2; QClimate)和基于SNP的SNP与SNP显着相关的SNP与Biomass或Winter Ass Yornformant> Goid 10 snps显着相关(> 1.385用于海湾亚群;这些亲属矩阵用于区域遗传力绘图123,就像以前的出版物中一样 ,使用形式的混合模型:
矢量y表示生物量值,z是随机效应的设计矩阵,u是整个基因组添加剂遗传效应,v是区域基因组加性遗传效应 ,e是残留的 。矩阵G是使用所有SNP的整个基因组添加剂效应的整个基因组关系矩阵。矩阵Q是如上所述获得的区域基因组关系:QCLIMATE或QFITNESS之一。i是排名Y的身份矩阵,其中Y等于生物质值的数量 。整个基因组,区域基因组和残留方差分别是并且分别是。表型方差()为 + +。整个基因组遗传力 ,区域遗传力和总遗传力分别为=(/),=(/)和=( +/) 。
这些模型是针对进行GWAS的三个位置运行的:密苏里州哥伦比亚;密歇根州山核桃角;和德克萨斯州奥斯汀。这导致了80个模型:4种种群(全部多样性面板和3个亚群),2种模型类型(一种仅具有G的模型和G + Q模型) ,用于10种表型(3个地点的生物量和7个环境变量)。
方差分析也用于分配K-和N-伴侣群之间的方差 。在此分析中,仅使用具有祖先调用的SNP(补充数据9),导致每个人群子集使用460,429个SNP。计算了基于特定染色体上所有SNP的亲属矩阵(QCHR01K至QCHR09K ,QCHR01N到QCHR09N),从而产生18个亲属矩阵。这些亲属矩阵用于区域遗传力映射,使用形式的混合模型:
在其中矢量y表示生物量值 ,z是随机效应的设计矩阵,V1K(TO V9K)或V1N(TO V9N)(统称为VI)是染色体特异性基因组添加剂遗传效应,E是残基 。矩阵Qi是N和K亚基因组的九个染色体的染色体特异性基因组关系矩阵。染色体特异性和残留方差分别是并且。染色体特异性的遗传力为=(/),亚基因组特异性的遗传力是每个亚基因组中九个染色体的这些方差的总和。
我们集成了多个数据结构 ,以在渗入间隔内对候选基因的排列和提供有意义的淘汰标准,并与定量性状基因座峰的物理接近 。在GWAS峰的情况下,将候选基因定义为围绕MASHR峰的20 kb间隔内的基因座。基因组渗入的候选基因必须至少部分重叠introgression间隔。由于在遗传亚群中进行了GWAS和介入的推论 ,因此除了基因共表达分析外,补充数据7(候选基因列表)中报告的所有统计数据也是亚群特异性的(仅在AP13 RNA-sequera-sequeration库中进行,用于注释Pips(补充数据3)) 。对于给定的间隔 ,我们提供一组统计信息。首先,将与峰位置的物理接近度计算为基因与间隔(渗入)或GWAS峰位置的中点的中点。其次,由于亚种群内的GWAS峰的基因座必须在该亚种群中有所变化 ,因此我们在其中提取了所有SNP,并符合候选基因模型 。这些变体用SNPEFF126注释,并在https://pcingola.github.io/snpeff/snpeff/se_inputout/#effect-pridection-details- snpefe for Snpeff coneporcore coneper 5 + snperporcore中 ,可以在https://pcingola.github.io/snpeft/snpeft/snpect-/#effect-pridection-details)上找到三个主要类别的加权总和;第三,对于每个基因,我们计算了结构和存在 - 缺少变体的次要等位基因频率。第四,我们包括每个基因WGCNA簇的身份的矢量。最后 ,如果候选者是以前出版物的开花时间GWAS候选基因的同源127,则包括重叠间隔或基因的身份 。
有关研究设计的更多信息可在与本文有关的自然研究报告摘要中获得。
本文来自作者[admin]投稿,不代表象功馆立场,如若转载,请注明出处:https://wap.xianggongguan.cn/kexue/202506-1453.html
评论列表(3条)
我是象功馆的签约作者“admin”
本文概览: 没有使用统计方法来预先确定样本量。实验是完全随机的,研究人员在进行实验或测序时不知道基因型标识符。 为了形成多样性面板,从2010年到2018年收集了来自天然和普...
文章不错《多倍体生物能型草丛中气候适应的基因组机制》内容很有帮助