BJ-RNA Variant Calling#

Pipeline Introduction#

The RNA Variant Calling Pipeline is a unique pipeline that takes STAR alignment output files and similar to WGS, uses DNAscope and Haplotyper algorithms to call variants and to make predictions and annotation we use a tool called SnpEff. The RNA Variant calling Tertiary pipeline is dependent on the Secondary Bj-Expression-pipeline. Once the Bj-Expression-Pipeline completes successfully, the prerequisites are met to launch the RNA Variant Calling pipeline.

Key components of Variant Calling pipeline#

Variant Calling This step performs calling of germline, joint and copy number variants. The inputs are deduplicated (or deduplicated and compressed) .bam files, various variant databases, machine learning model and targeted .bed file. Outputs are variants described in .vcf files and copy number information in .tsv file and various plots.

HAPLOTYPER - Variant calling is performed by using Haplotyper tool, which receives deduplicated (or deduplicated and compressed) .bam file, recalibration table and various variant databases and outputs variants in .vcf file.

DNAScope - Germline calling is performed by using DNAScope tool, which receives deduplicated (or deduplicated and compressed) .bam file and ML model and outputs variants in .vcf file. This is the ** optional ** step.

SnpEff - A Variant Annotation and prediction tool used to understand genetic variant changes between each sample and the reference genome.¹ The inputs files used for this tool are .vcf files. More information on the SnpEff tool can be found here

Pipeline steps:#

Map reads to reference: This step aligns the reads contained in the FASTQ files to map to a reference genome contained in the FASTA file. This step ensures that the data can be placed in context.
Remove duplicates: This step detects reads indicative that the same RNA molecules were sequenced several times. These duplicates are not informative and should not be counted as additional evidence.
Split reads at junction: This step splits the RNA reads into exon segments by getting rid of Ns while maintaining grouping information, and hard-clips any sequences overhanging into the intron regions.
Base quality score recalibration (BQSR): This step modifies the quality scores assigned to individual read bases of the sequence read data. This action removes experimental biases caused by the sequencing methodology.
Variant calling: This step identifies the sites where your data displays variation relative to the reference genome, and calculates genotypes for each sample at that site.

Pipeline Overview#

Parameters#

Parameter Name	Options	Description
Genome	`GRCh38` _^(default)	Reference genome to use for alignment

Output files#

Note

* = place holder or filler for the rest of text in a certain output file

Output Directory/File	Notes
`tertiary_analyses`/ `rna_variant_calling`/ `_snpEff.ann.vcf`/ `_snpEff.csv`/ `rnavariants_dnascope.vcf.gz` / `rnavariants_haplotyper.vcf.gz`	This section includes the output files

Output file examples#

Example .vcf file after annoation is done with SnpEff¹

#CHROM POS	ID	REF	ALT	QUAL FILTER INFO
1 889455 .	G A	100.0	PASS	`AF=0.0005;EFF=STOP_GAINED(HIGH\|NONSENSE\|Cag/Tag\|Q236*\|749\|NOC2L\|\|CODING\|NM_015658\|)`
1 897062 .	C T	100.0	PASS	`AF=0.0005;EFF=STOP_GAINED(HIGH\|NONSENSE\|Cag/Tag\|Q141*\|642\|KLHL17\|\|CODING\|NM_198317\|)`

Reference#

"A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3.", Cingolani P, Platts A, Wang le L, Coon M, Nguyen T, Wang L, Land SJ, Lu X, Ruden DM. Fly (Austin). 2012 Apr-Jun;6(2):80-92. PMID: 22728672