5 Essential Concepts In Genome Assembly From NGS data
The main goal for researchers, clinicians, and students who perform Next Generation Sequencing (NGS) and produce sequenced data for diverse projects involving human samples is to find biomarkers or variants to make diagnoses; and deduce the genetic anomalies that could be responsible for the disease they are conducting research on. Most projects (academic or non-academic) constitute the prior ideology on deciphering the “unknown.” There are well-versed computational protocols and pipelines formulated by labs across the world in determining what the “unknown” variants are. The fact that we have the “reference” human genome available – thanks to the Human Genome Project – plays a vital role in discerning unknown variants.
The Human Genome Project started in 1990 as an international scientific research project to comprehend the genetic make-up and base pairs that compose the human genome from both a physical and a functional standpoint. The final sequence map of the human genome was published on April 14th, 2003. Since then, ultimate progress has been achieved in the field of sequencing with multiple sequencing methodologies (such as Illumina and Pyrosequencing) available in the market. Moreover, as Moore’s law states, the cost per human genome has decreased; and with the advancement in sequencing techniques, sequencing prices have dropped 100,000 times – to around $1,000 per genome, leading to the massive production of sequenced genomic data.
This tremendous amount of sequenced data increased the demand for competent efforts in conducting computational analysis to apprehend the underlying information that could be viable for research and medical diagnosis. However exhaustive (manually and computationally) the effort might be to reach a diagnosis, the availability of the reference human genome has made the effort coherent in identifying the genetic anomalies in the sequence data. Having a reference genome is a convenience in perpetrating efficient variant calling.
Imagine if the human genome had never been sequenced; where would we stand today? Having that reference genome sequence available is indeed a convenience. However, not all species have their genomes sequenced. This is especially true for prokaryotes, mainly because of the high genomic diversity among strains – every strain of bacteria or virus shows significant genomic variance. Therefore, research concerning prokaryotes often starts with genome assembly. This step allows researchers to understand the genetic constitution of the prokaryotic strain under study, and eventually help in determining its deleteriousness – such as the study of pathogenicity and non-pathogenicity characteristics of bacterial communities in water samples and perceive their influence on human health.
An important mention here is that assembling prokaryote genomes is relatively easy compared to the human genome because of their small genome size. On average, the genome size of a bacteria can range from 30 kbp to 14 Mbp. Viral genomes range between 2 kb and 1 Mb. Such genome size requires less manual and computational resources as opposed to a human genome of 3 billion bases.
Producing high-quality genome assembly in prokaryotes starts with the basics of assessing the quality of the NGS sequence data. Before getting to the quality-check, it is important to understand which sequencing platforms you can use to generate quality sequence data. Illumina is at the forefront of this race because of its ability to produce paired-end reads. These reads have unique DNA molecules between the 5-prime and the 3-prime of the DNA fragments. In other words, during sequencing, the DNA fragment is read from one end and once the complementary strand is generated, the process starts again in the other direction (Figure 1).
With paired-end sequencing, researchers can sequence both ends of a DNA fragment and generate high-quality alignable sequence data. Moreover, paired-end sequencing facilitates the detection of genomic structural rearrangements and repetitive sequence elements (repeats). Hence, it is recommended to use Illumina to generate paired-end sequence data for genome assembly. As shown in figure 1, you can generate 2x150bp, meaning two reads (forward “Read1” and reverse “Read2” of 150 bases each.
Once the sequence read data are generated, you can execute the downstream steps of the genome assembly. In this article, we will review some basic but essential assembly steps and concepts.
1. Quality Control
FASTQC is a widely used and robust program to perform quality-control and spot potential problems such as low-quality bases in the sequence reads. The program generates a set of graphs to examine data quality. Here is an example of good Illumina data and bad Illumina data. Among the generated graphs, one of the important factors in estimating the data-quality is the “Per base sequence quality” graph. If you see a consensus in the quality scores (generally, a score between 20 and 30 is considered a good base score) of the bases in the graph then the NGS library preparation was done considerably well and depicts uniformity. However, a divergence in the scores could indicate poor quality of the NGS library. Depending on the available data and their quality-stats in the graphs, you will have to determine a quality-score threshold that would help remove bad-quality data and appropriately trim the reads at that threshold. Alternatively, you can re-sequence to generate better quality data.
After the refinement, contamination removal from the sequence data is performed. This step is often neglected, but it could play a significant role in minimizing genomic inaccuracy in assemblies. Contaminations are eliminated by mapping the reads against the reference with the contaminants such as human genomes. Common tools are BWA (for short-reads) and BLAST (for long-reads).
Further, non-biological elements are discarded from the sequence data by trimming the reads. You can trim the sequence reads using Trimmomatic. This tool gives a good range of options to perform apt read trimming such as removing the non-biological elements like adapters and primers.
2. Genome Assembly Intricacies
Following the trimming step, genome assembly is performed using De Bruijn Graph (DBG) algorithm. Robust assembler programs implementing DBG algorithm are widely used for efficient genome assembly. One such program is SPAdes.
DBG uses the k-mer approach to generate the consensus sequence. In this approach, the reads are broken into k-mer substrings, and subsequently, the consecutive k-mers share an overlap of a (k − 1)-mer (Figure 2). Thereupon, the algorithm looks for overlaps in the pool of k-mers generated from the sequence reads. Afterward, the consensus sequence or assembly is formed as shown in Figure 3. It is to be noted that the 4 bases k-mer depicted in figure 3 are for illustration purposes. Ideally, the size of k-mers are larger; for example, SPAdes uses k-mers between 31 and 127bp.
You can calculate the number of k-mers from the length of the read and the k-mer size using the formula: N_kmers = L_read – Kmer_size. For instance, as shown in Figure 2; the read length (L_read) is 25 bases and the size (Kmer_size) of each k-mer is 20. Thus, the number (N_kmers) of k-mers is 5.
The k-mer approach of the DBG algorithm resolves the alignment issues encountered with repetitive regions in the genome, which makes it the superior choice for the assembly process. Since the distance between each paired-end read generated from Illumina is known, DBG uses this information to map the reads over repetitive regions more precisely than single-end reads where the DNA fragment is read from only one direction (Figure 1). Furthermore, with its k-mer approach, the DBG algorithm becomes a natural fit for the short reads generated from NGS methods like Illumina to find spurious overlaps between read pairs and generate consensus sequences or assemblies.
Here you might be asking yourself: how do de Bruijn graphs take long reads into account? The simple answer is that de Bruijn graphs are agnostic about the read length. The construction process of de Bruijn graphs chooses a k-mer and parses all reads (long or short) into k-mers of the chosen size. After the graph is constructed, the long-range information available from the long reads is lost. Such long-range information is most useful to resolve repetitive regions of the genome. Moreover, intuitively, the choice of large k-mers helps in resolving repeat region issues during alignment.
When using DBG, you should always choose a k-mer size below the read length in case of short read fragments because if the k-mer size exceeds the maximum read lengths of short-read libraries, then short reads cannot be included in the construction of de Bruijn graph. For example, when using the DBG algorithm on an Illumina paired-end reads library of 2×150 bp, you should use k-mers below 150. On the contrary to all the benefits of DBG approach, with a simple alignment-based approach a reliable alignment and overlapping of voluminous sequence read data to generate a consensus sequence cannot be obtained due to its limitations of computational complexity and time-consuming execution.
“Coverage/Depth” in the context of mapping of the sequence reads to the reference genome (if available) is a salient feature in assembly assessment. Coverage is defined as the average number of times any nucleotide is covered by a sequencing read. For instance, in Figure 4, the coverage of base “T” at that position is “4.”
The coverage of an entire genome can be calculated with the formula: C = (N x L)/G; where N is the number of reads, L is the average read length and G is the genome size. Similarly, since k-mers are also a short stretch of DNA fragments taken from sequence reads, their coverage can also be calculated with: Ck = C.(R-K+1)/R, where R is the read length, K is the length of k-mer, and C is the base coverage. K-mer coverage of an assembly is the number of k-mers that map with perfect identity to that assembly. A higher k-mer coverage corresponds to a high-quality assembly.
3. Subsampling And Assembling
Now that we have understood how DBG algorithm in SPAdes works, it is the right time to introduce the concept of “subsampling” in genome assembly or consensus sequence generation.
DBG is a robust algorithm, however, with millions of reads to assemble at once, the algorithm will likely choke on high coverage and produce inaccurate assemblies. Therefore, it is recommended to subsample the sequenced data. Subsampling means dividing sequenced reads present in the FASTQ file into separate datasets of reads. For example, if a FASTQ file contains 1 million reads, this file can be split into 20 datasets of 50,000 reads each. Datasets with 50,000 and 100,000 reads (especially for Prokaryotes and Phages) normally give better assembly results. However, it is on your perusal to experiment with your data and determine what dataset size gives better assembly. Once the datasets are ready, SPAdes is run on each dataset. This step produces sequence assemblies from every dataset. It is important to note that assemblies generated from every dataset are possible results from which the best assembly has to be chosen. When assessing the assembled sequences (from each dataset) for their accuracy, all assemblies should be taken into consideration and the best assembly should be picked based on the parameters mentioned in the “Assembly Quality Assessment” section.
The ideal result is to get only one assembly for every dataset with significant coverage of 50x-100x and assembly size compatible with a similar strain in the species. If you get only one assembly, that means all sequenced reads were used up to produce one consensus sequence, which is the best result. Be stringently careful with using all reads at once for the assembly, as the assembler will run and produce assemblies (possibly even just one consensus sequence) which could wrongly indicate a perfect assembly. Hence, as aforementioned, it is advised to run assemblers on sub-sampled datasets to get better results.
4. Assembly Quality Assessment
Like any experimental and simulation results, de novo assembled genomes need to be rigorously validated. Quast is a widely used tool to assess the quality and accuracy of the assembled genome. Quast takes FASTQ reads and the assembled sequences in FASTA format as input. It then computes coverage, assembly size, N50 value, and the percentage of the number of reads mapped back to the assembled genome. The number of sequenced reads that get mapped back to the assembly is considered an important validation of the accuracy of the assembly. Generally, if 95% of the reads are mapped back to the consensus sequence, then you have a good genome assembly.
Coverage
Genome coverage is the percentage of the genome that is contained in the assembly relative to the estimated genome size. Likewise, gene coverage refers to the percentage of genes that are covered by the assembled genome. Ideally, the genome coverage should be about 90 – 95% for the best assemblies in the Quast assessment output. However, repetitive sequences may lower this number because they are much more challenging to cover in a genome assembly.
N50
It is a measure that describes the ”completeness” of a genome assembly. To understand the concept, let’s assume you have a set of 7 assemblies generated from a dataset. The assemblies are put in the order of their sequence length, i.e. the longest comes first, followed by the second-longest, and with the shortest one in the end. Now, you add the lengths of all the assemblies starting from the beginning, so the longest assembly + the second-longest, until you reach the number that makes up 50% of the total assembly length. The length of the contig where you stopped counting is the N50 value. In other words, N50 is the length of the shortest sequence in a set of assemblies that covers 50% of all assemblies. Generally, the larger the N50 value is, the better the assembly. If your N50 is too short, you should perform additional sequencing to improve the assembly before moving onto gene prediction and annotation.
Assembly size and number
Ideally, one assembly from a dataset of sequenced reads is the best result; however, other quality assessment factors should be taken into consideration rather than relying only on the number of assemblies generated from the assembler. Additionally, if the assembled sequence is of a size similar to a strain in the same species then it is a good indication of efficient assembly. Nevertheless, you should check the reliability of the result by mapping the reads back to the assembled sequence using the Quast tool and if about 90 – 95% of reads are mapped then it is a viable result.
Gaps
Refers to uncovered regions within the assemblies. Generally, it is better to have fewer gaps. A higher percentage of gaps in the genome or having more gaps, in general, reduces assembly quality.
5. Cross-validation
Although SPAdes as an assembler does a significant job in generating quality assemblies, it is highly recommended to “Cross-Validate” the result with other robust DBG assemblers such as ABySS, SOAPdenovo, and Velvet. A high consensus in the obtained assemblies concerning assembly size, coverage, and N50 value from different assemblers provides legitimacy of the result and its confident usage for further downstream analyses such as gene prediction and annotation (Let’s keep this topic for next blog!).
In this article, we have covered the basic yet crucial concepts of genome assembly; although depending on the organism under study, the assembly steps may vary to some extent. The stated steps and concepts are not specific to prokaryotic genome assemblies and can also be used for eukaryotic genome assemblies. However, due to the genomic complexities such as the larger genome size, higher percentage of repeats, and an increase in heterozygosity, the eukaryotic genome assembly poses significant challenges. Obtaining an accurate assembly requires diligence in the approach as stated in this article. Nonetheless, with the right resources, concepts, and validation, high-quality assemblies are certainly possible.
To learn more about genome assembly from NGS data, and to get access to all of our advanced materials including 20 training videos, presentations, workbooks, and private group membership, get on the Expert Sequencing wait list.
ABOUT DEEPAK KUMAR, PHD
GENOMICS SOFTWARE APPLICATION ENGINEER
Deepak Kumar is a Genomics Software Application Engineer (Bioinformatics) at Agilent Technologies. He is the founder of the Expert Sequencing Program (ExSeq) at Cheeky Scientist. The ExSeq program provides a holistic understanding of the Next Generation Sequencing (NGS) field - its intricate concepts, and insights on sequenced data computational analyses. He holds diverse professional experience in Bioinformatics and computational biology and is always keen on formulating computational solutions to biological problems.
More Written by Deepak Kumar, PhD