Genotyping by Sequencing for Crop Improvement. Группа авторовЧитать онлайн книгу.
by plus sign. Red arrow indicates reduced representation libraries produces by starting DNA sample digestion with restriction enzyme. Orange arrow indicates amplicon sequencing while blue arrow represents the sequence capture. Whole‐genome resequencing is indicated in purple arrow in which random fragments of digested DNA are sequenced. Green arrow represents plastid genome skimming responsible for plastid genome sequences recovery.
The figure is reproduced from Loera‐Sánchez et al. (2019) which is available under a Creative Commons Attribution 4.0 (CC‐By 4.0) International License, which permits reproduction.
The recent advancement in NGS includes next‐generation RNA sequencing (RNA‐seq) (Sharma et al. 2011; Sonah et al. 2016). This technique mainly focuses on the mRNA sequencing of only those genes that are expressed in the transcriptome (Chaudhary et al. 2019b). This technique helps in the identification of novel genes by de novo assembly without reference genome mapping. NGS also aids in the development of molecular markers. NGS has led to the exploration of thousands of markers on the entire genome resulting in an ease in the genome‐wide association studies. These markers also aid in association mapping (Sonah et al. 2015; Zargar et al. 2015). These technologies have enabled us to understand the underlying process of gene expression and the development of resources for the analysis of marker‐assisted breeding (MAS) and diversity analysis (Unamba et al. 2015).
3.2 Basic Steps Involved in Whole‐Genome Sequencing and Resequencing
Whole‐genome sequencing (WGS) can be divided into two groups, which include de novo WGS and whole‐genome resequencing (WGR) (Bhat et al. 2020). WGS involves the genome sequence assembly for the first time while WGR compares genomic variability within individuals or populations (Patil et al. 2019). WGR requires the prior availability of reference genome for mapping and variant detection. Among WGS, de novo WGS involves the complete assembly of a species genome for the first time (Sevanthi et al. 2018). First, for the library preparation, high quality of genomic DNA is subjected to fragmentation followed by the addition of adaptors to the DNA fragments. For the detection of small structural variants such as INDELs or CNVs (copy number variations), short reads (350–550 bp insert size) from standard libraries are utilized while long‐read data or mate‐pair libraries with an insert size of around 2 to 20 kb will be required for the detection of large structural variants. For high‐throughput sequencing, Illumina is often used. The sequences are mapped on the genome sequence based on similarity and local contigs are developed. While assembling the sequence, repetitive regions show difficulty in alignment with the short reads. In that case, mate pair‐end sequencing reads aids in aligning large sequences which are also referred as scaffolds or supercontigs by linking and orienting contig. Unknown sequences gaps are denoted as Ns. The final result of a genome assembly corresponds to the contiguous scaffold sequences in a series separated by gaps.
In contrast to the WGS, WGR helps in the comparison of the variable sequences present between the genome of an individual or the population. In the case of WGR, the species genome sequence is a prerequirement for the read mapping. For example, in the case of an individual, genomic DNA of high quality is fragmented for library preparation in which adaptors are added to the fragments with an average insert size of 350–500 bp. With the help of high‐throughput sequencing paired‐end, short reads of about 100 bp are obtained. These short reads based on sequence similarity are mapped on the reference genome. When a particular nucleotide differs from the species‐specific base single‐nucleotide polymorphisms (SNPs) are detected. In some cases, SNPs might get lost as these are not present in the reference genome while some are heterozygous. And others may get lost due to low coverage. In the case of the population, the aim is to obtain genomic data from a wide range of individuals which are analyzed as a whole and are sequenced. These techniques have a wide range of applications in conservation and management (Fuentes‐Pardo and Ruzzante 2017).
3.3 Whole‐Genome Resequencing Mega Projects in Different Crops
3.3.1 1K Arabidopsis Genomes Resequencing Project
Arabidopsis thaliana belongs to the family Brassicaceae. It has 125–150 Mb diploid genome having around 30,000 protein‐coding genes distributed over five chromosomes. Weigel and Mott (2009) initiated a project to report whole‐genome sequence variation in 1001 accession. To sequence species‐wide genome of the Arabidopsis, they proposed an approach with two different aspects. In the first aspect, by using the technologies such as Roche’s 454 platform they generated a small number of sequences that approach the quality of A. thaliana’s original Col‐o (Columbia) reference. For a large number of sequence, a relatively less expensive technology for example Applied Biosystem’s SOLiD or Illumina’s Genome Analyzer was used. Local haplotype similarity was exploited by using the information from the reference genome to draw a complete genome sequence. In a second aspect of this approach, the sampling was done for ten individuals from ten populations and geographical regions all around the Eurasia and at least one accession from North Africa (10 × 10 × 10 + 1). The aim of this 1K Arabidopsis genome project was to sequence a generalized genome that can encompass every Arabidopsis accession and every genome of the A. thaliana can be completely aligned against it (Weigel and Mott 2009).
3.3.2 3K Rice Genomes Resequencing Project
Rice is the principal staple food and is grown worldwide. It has high genetic diversity within genus and species and the wise use of rice diversity is a major factor for improved production. hence exploring the diversity at the genome level is required. A genome sequence program to sequence known diversity across the species was launched by the collaboration of CAAS (Chinese Academy of Agricultural Sciences), BGI (Beijing Genomics Institute), and IRRI (International Rice Research Institute) and named “The 3K rice genome project.” The 3K rice genome project is the giga‐dataset of the genome that is available publicly (averaging 14× depth of coverage), derived from 3K rice accession which are genetically and functionally diverse set collected from 89 countries. The aim of the 3K rice genome project was that the new population‐specific genotyping arrays will be helpful for the genetic as well as breeding application and reveal population structure (Li et al. 2014a).
3.3.3 Soybean Whole‐Genome Resequencing
The WGR approach has been used in soybean to identify the Quantitative trait loci (QTL) determining colonization of arbuscular mycorrhizal fungi (AFM). The microbial community like AFM can associate with 80% of the terrestrial plants and help host plants to uptake more nutrients, provide tolerance against stresses. The colonization and extent of benefits provided by AFM depend on the host genotypes. QTL is responsible for mycorrhizal responsiveness in different plants. Pawlowski and coworkers investigated the genetic components that are involved in the AFM association. The aim of the study was the genome‐wide association analysis to identify the difference in AFM colonization in soybean genotypes and identification of genomic regions that are responsible for the colonization of AFM. They had used a genetically diverse set of 350 soybean genotypes inoculated with AFM (i.e. Rhizophagus intraradices). By using whole‐genome resequencing‐derived SNP dataset, they identified six QTL involved in the colonization of the AFM. The candidate genes identified in these QTL regions contain the homologs of the nodulin protein family and other genes responsible for symbiosis (Pawlowski et al. 2020).
3.3.4 Chickpea
Cicer arietinum is the best source for protein, β‐carotene, and minerals such as iron, calcium, phosphorus, manganese, and zinc. Major abiotic stresses such as heat and drought can cause up to 70% loss in yield. Varshney and his coworkers utilize NGS technology to explore the germplasm wealth present in gene banks and provide information on genetic variation, domestication, and population structure of the 429 chickpea lines. They analyzed the 2.7 Tbp (terabase pair) raw data including 28.36 billion reads with around 6 Gbp (gigabase pairs) raw data per sample. By using mapped resequencing data, they reported a map of 4.97 million SNPs, 596 100 indels, 4931 CNVs, and 60 742 PAVs (presence–absence variation) in 429 reference genotypes set. Out of 4.97, million the most of the SNPs (i.e. 85%) were found in intergenic