How To Do Variant Calling From RNASeq NGS Data
Developing variant calling and analysis pipelines for NGS sequenced data have become a norm in clinical labs. These pipelines include a strategic integration of several tools and techniques to identify molecular and structural variants. That eventually helps in the apt variant annotation and interpretation. This blog will delve into the concepts and intricacies of developing a “variant calling” pipeline using GATK. “Variant calling” can also be performed using tools other than GATK, such as FREEBAYES and SAMTOOLS.
In this blog, I will walk you through variant calling methods on Illumina germline RNASeq data. In the steps, wherever required, I will mention the difference in the tool used or the method for handling DNASeq and RNASeq data. The sequenced data used in this example is a random paired-end (R1, R2) germline RNASeq data.
Data Preparation
First, assign the values (fastq files, reference fasta file, SNP, and INDELS known sites) to the variables in the bash script (pipeline). The fastq files contain sequenced reads of the sample/patient under study – obtained as a result of the sequencing run. It is imperative to ascertain that the quality of the extracted and fragmented DNA is high, and the prepared DNA library is uniform and contaminant-free. You can find the detailed dos and don’ts of NGS sequencing and DNA library preparation in one of my blogs.
Quality Control
In this step, a quality check of the sequenced data in the Fastq files is performed. “FastQC” reads a set of Fastq files and produces quality control reports consisting of different modules – representing the sequence reads’ quality. These reports are useful in determining and eliminating the low-quality data.
Trimmomatic can be used to trim low-quality reads and non-biological sequences from the sequenced read data. This is an important step as the quality of the sequence data used for downstream processing significantly affects the variant calling process.
Alignment
For RNASeq reads, it is recommended to use the “STAR” aligner. An important consideration during this step is the use of the 2-pass mapping method. In this example, I have used the 1-pass mapping alone and did not continue with the 2-pass. The two-pass mapping method is useful in detecting the most sensitive novel junctions. This method does not increase the number of detected novel junctions but allows the detection of splices reads mapping to novel junctions. When using the 2-pass method, the junctions obtained from the 1-pass method are used as “annotated” junctions for the 2-pass mapping method. To get a thorough understanding of 2-pass mapping, it is advised to go through the STAR manual. Moreover, you may go through this discussion forum to formulate an appropriate strategy for 1-pass and 2-pass mapping methods.
From the STAR alignment, an “SJ.out.tab” file for each sample (in this case paired-end R1 and R2 sample “ABB_40”) will be generated. For instance, in the provided example, after the STAR alignment steps, an “ABB_40SJ.out.tab” will be generated. This tab file consists of all the junctions obtained from the 1-pass mapping step.
For DNAseq data, the alignment step can be performed by using the “BWA” tool, and the indexing and alignment steps would be:
Sorting
Sort the Sequence Alignment/Map format (SAM) file obtained as an output from the STAR alignment step based on genomic coordinates. In this step, we can also convert the SAM file to a Binary Alignment Map format (BAM) file. BAM files are binary files and require less computational storage compared to the SAM files.
An important note to mention here is that SAM/BAM files can be significantly large depending on the number of sequenced reads being aligned; hence, “sorting” and “indexing” of alignment and reference sequence files are vital steps since these methods allow for the faster processing of computationally complex operations on these files.
The indexing of a genome or alignment can be explained similarly to indexing a book. Indexing in a book refers to an alphabetical list of names, subjects, etc. that belong to the pages on which they are mentioned. Thus, if one wants to know on which page a word appears or a chapter begins, then he/she can look up in the pre-built index rather than going through the whole book to find that word or chapter. Similarly, in alignments, indices allow the aligners to identify the potential origin of the query sequences or reads within the reference genome, which helps to save time and computational memory.
Alignment Summary
This step is useful in detailing the quality of the read alignments and the proportion of the machine signal-to-noise threshold quality filters. It is to be noted that these quality filters are specific to Illumina data.
In the next step, the “CollectInsertSizeMetrics” parameter of GATK provides useful metrics for validating library construction including the insert size distribution and read orientation of paired-end libraries. It produces the percentages of read pairs in three orientations (Forward Reverse, Reverse Forward, and TANDEM) as a histogram.
Mark Duplicate Reads
Duplicate reads are the result of Polymerase Chain Reaction (PCR) in sequencers. The duplication events arise at different rates as the sequencing experiment proceeds. The duplicate reads might come from the same input DNA template; thus, it can be assumed that they have the same start position on the reference in an alignment.
Also, it should be noted that any sequencing error during PCR reaction will be carried to all the duplicate reads. Hence, even if we eliminate the duplicate reads by marking or flagging them, the left representative read (after removing the duplicates) will still carry the sequencing error. Therefore, it is imperative to make sure that little to no errors are present in the DNA library preparation step, as these errors will be propagated to all PCR duplicates.
Adding Read Group (RG) tags
In this step, all the reads in the BAM file are assigned to a single new Read Group (RG). Picard and GATK require the presence of at least one RG tag – to which each read can be assigned. In this example, after assigning the reads to a Read Group with arguments “RGID=1 RGLB=lib2 RGPL=illumina RGPU=unit1 RGSM=3”, the “dedup_reads_RG1.bam” will have an entry stating:
@RG ID:1 LB:lib2 PL:illumina SM:3 PU:unit1
Where “RGID” refers to the “Read Group ID”, “RGLB” refers to “Read-Group library”, “RGPL” refers to “Read-Group platform” (in this example the platform is Illumina), “RGPU” refers to “Read-Group platform unit” (eg. run barcode), and “RGSM” refers to “Read-Group sample name”.
Building BAM Index
With reference to the importance of “indexing” stated in the “Sorting” section – in this step – the “dedup” or “duplicate-read-marked” BAM file is indexed to improve the efficiency of downstream variant identification processes.
Create Reference Sequence Dictionary
In this step, a reference genome sequence dictionary is created with a “.dict” extension – required for the downstream processing and analysis tools. The output “.dict” file contains only a header with sequence records but no SAM records.
Reference Sequence Index
For efficient pattern matching and downstream variant call processing, the reference genome sequence is indexed.
Knownsites Indexing
“Knownsites” are the already annotated or known variants present in the public or commercial variant repositories. Indexing these variants using “tabix“ improves the computational speed and performance for downstream tools and analysis.
Handling Splicing Events in RNASeq Data
Since RNA aligners have different conventions than DNA aligners, we need to reformat some of the alignments that span introns for HaplotypeCaller. This method splits reads with N in the cigar into multiple supplementary alignments and hard clips mismatching overhangs. By default, this step also reassigns mapping qualities for good alignments to match DNA conventions.
Furthermore, the algorithm Identifies all N cigars and creates k+1 new reads (where k is the number of N cigar elements). The first read includes the bases to the left of the first N element and the part to the right of the N (including the Ns) is hard clipped.
Base Quality Recalibration
This is an important step. As a part of the data pre-processing step, it detects systematic errors in the estimation of base call accuracy carried out by the sequencing machine.
The two keys steps are:
- The BaseRecalibrator is used to build a recalibration model (“recal_data.table” in the provided example) by using all the reads in the input BAM file and the set of known variants (“knownsites”).
- In the next step, the ApplyBQSR adjusts the base quality scores in the data based on the recalibration model and produces a new BAM file (“recal_reads.bam” in the provided example).
It is recommended to refer here to get further details on the re-calibration step.
Variant Calling
In this example, we are performing variant calls on germline RNASeq data, hence the “HaplotypeCaller” argument is used. Please note that for somatic (cancer) variant discovery “Mutect2” can be used.
In this step, the program first determines the active regions based on the presence of evidence for variation. Next, for each active region, it builds a De Bruijn-like graph to reassemble the region and identify the haplotypes (a group of alleles in an organism that is inherited together from a single parent) in the data. Afterwards, realignment of each haplotype is performed against the reference haplotype to determine potentially variant sites. Further, for each active region, a pairwise alignment of each read against each haplotype is performed using the PairHMM algorithm – to produce a matrix of likelihoods of haplotypes. These likelihoods are then marginalized to acquire likelihoods of alleles for each potential variant site given the read data. In the end, Bayes’ rule is applied for each potential variant site using the likelihoods of alleles to calculate the likelihoods of each genotype per sample given the read data observed for that sample; eventually, the most likely genotype is assigned to the sample. Finally, the called variants with their genotype are output as a Variant Call Format (VCF) file (“raw_variants.vcf” in this example).
Variant Selection
Selection of called variants is performed using the “SelectVariants” argument of GATK. Different options can be used for selecting subsets of variants from a larger variant callset. To get a complete understanding of these options, it is recommended to refer here.
Variant Filteration
In every VCF file, there are 9 mandatory columns (#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT) followed by the patient/sample column. It is important to understand the standards of VCF format before using the VCF filtering tools.
The variants called in the previous step are hard-filtered based on the INFO and/or FORMAT annotations of the VCF file. In the example, the values chosen for filters are case-specific. The filter values may vary depending on the type of experiment and the sequence data.
Building Database For Variant Annotation
Once the variants are filtered, the next step is to annotate them. For this step, a local database is created from the UCSC’s RefSeq table. More information on database creation for snpEFF can be found here: https://pcingola.github.io/SnpEff/se_buildingdb/
Variant Annotation
Once the local database is created, we annotate the variants using the “snpEff” tool.
Variant Callset Quality Control
Usually, after the variants are called and annotated, we move to downstream analyses; however, it is recommended to assess the quality of the called variants using the evaluation criteria of Number of Indels & SNPs, Indel Ratio, and TiTv Ratio.
The typical methods for variant evaluation are sanger sequencing of regions surrounding putative variants and evaluating concordance against results obtained from a genotyping chip run on the same samples. However, both of these methods have drawbacks. Sanger sequencing is the least scalable as it could prove to be costly and time-consuming to apply to an entire variant callset; and, although concordance evaluation is much more scalable than Sanger sequencing, it only covers the subset of known variants that the chip was designed for.
Therefore, the third method mentioned in this step could be used to evaluate how the variants stack up against the gold standards or the already annotated truth set of variants in repositories such as dbSNP. To get a better and comprehensive understanding of how these evaluation methods work, I recommend you refer here.
Extracting variants of Interest
The final step after variant quality evaluation could be to filter large genomic datasets in order to find the most significant variants for your experiment. This can be performed using “snpSift”. In the example below, we are extracting all the “missense variants” from the annotated VCF file.
And, if you want to extract a set of chromosomal regions from the VCF file to be analyzed further, you can use “tabix”. In this example, I am extracting all SNPs belonging to a region between genomic coordinates 30000000 and 40000000 of chromosome 1. Please note that to perform this step, the VCF file should be “bgzipped” and “tabixed” as shown in “Step 20”.
Concluding Remarks
In this blog, I have covered SNP and Indel variant calling and its related concepts. For Copy Number Variant (CNV) calling, I recommend you refer to one of my previous blogs. Furthermore, I would like to emphasize the importance of manual curation of identified and annotated variants using genomics viewers such as IGV – to detect and eliminate false positives in the germline and somatic data.
Although I have covered the vital steps in variant calling, it is to be noted that there may be additional steps required depending on the experimental data used and the ultimate objective of the sequencing project. The end goal of these sequencing projects is mostly apt and accurate variant identification and interpretation that could help in clinical diagnoses.
To learn more about gene prediction and how NGS can assist you, and to get access to all of our advanced materials including 20 training videos, presentations, workbooks, and private group membership, get on the Expert Sequencing wait list.
ABOUT DEEPAK KUMAR, PHD
GENOMICS SOFTWARE APPLICATION ENGINEER
Deepak Kumar is a Genomics Software Application Engineer (Bioinformatics) at Agilent Technologies. He is the founder of the Expert Sequencing Program (ExSeq) at Cheeky Scientist. The ExSeq program provides a holistic understanding of the Next Generation Sequencing (NGS) field - its intricate concepts, and insights on sequenced data computational analyses. He holds diverse professional experience in Bioinformatics and computational biology and is always keen on formulating computational solutions to biological problems.
More Written by Deepak Kumar, PhD