BJ-RNA Variant Calling#
Pipeline Introduction#
The RNA Variant Calling Pipeline is a unique pipeline that takes STAR alignment output files and similar to WGS, uses DNAscope and Haplotyper algorithms to call variants and to make predictions and annotation we use a tool called SnpEff. The RNA Variant calling Tertiary pipeline is dependent on the Secondary Bj-Expression-pipeline. Once the Bj-Expression-Pipeline completes successfully, the prerequisites are met to launch the RNA Variant Calling pipeline.
Key components of Variant Calling pipeline#
Variant Calling
This step performs calling of germline, joint and copy number variants. The inputs are deduplicated (or deduplicated and compressed) .bam
files, various variant databases, machine learning model and targeted .bed
file. Outputs are variants described in .vcf
files and copy number information in .tsv file and various plots.
HAPLOTYPER - Variant calling is performed by using Haplotyper tool, which receives deduplicated (or deduplicated and compressed) .bam
file, recalibration table and various variant databases and outputs variants in .vcf
file.
DNAScope - Germline calling is performed by using DNAScope tool, which receives deduplicated (or deduplicated and compressed) .bam
file and ML model and outputs variants in .vcf
file. This is the ** optional ** step.
SnpEff - A Variant Annotation and prediction tool used to understand genetic variant changes between each sample and the reference genome.1 The inputs files used for this tool are .vcf
files. More information on the SnpEff tool can be found here
Pipeline steps:#
- Map reads to reference: This step aligns the reads contained in the FASTQ files to map to a reference genome contained in the FASTA file. This step ensures that the data can be placed in context.
- Remove duplicates: This step detects reads indicative that the same RNA molecules were sequenced several times. These duplicates are not informative and should not be counted as additional evidence.
- Split reads at junction: This step splits the RNA reads into exon segments by getting rid of Ns while maintaining grouping information, and hard-clips any sequences overhanging into the intron regions.
- Base quality score recalibration (BQSR): This step modifies the quality scores assigned to individual read bases of the sequence read data. This action removes experimental biases caused by the sequencing methodology.
- Variant calling: This step identifies the sites where your data displays variation relative to the reference genome, and calculates genotypes for each sample at that site.
Pipeline Overview#
Parameters#
Parameter Name | Options | Description |
---|---|---|
Genome | GRCh38 (default) |
Reference genome to use for alignment |
Output files#
Note
*
= place holder or filler for the rest of text in a certain output file
Output Directory/File |
Notes |
---|---|
tertiary_analyses / rna_variant_calling / *_snpEff.ann.vcf / *_snpEff.csv / *rnavariants_dnascope.vcf.gz / *rnavariants_haplotyper.vcf.gz |
This section includes the output files |
Output file examples#
Example .vcf
file after annoation is done with SnpEff1
#CHROM POS | ID | REF | ALT | QUAL FILTER INFO |
---|---|---|---|---|
1 889455 . | G A | 100.0 | PASS | AF=0.0005;EFF=STOP_GAINED(HIGH|NONSENSE|Cag/Tag|Q236*|749|NOC2L||CODING|NM_015658|) |
1 897062 . | C T | 100.0 | PASS | AF=0.0005;EFF=STOP_GAINED(HIGH|NONSENSE|Cag/Tag|Q141*|642|KLHL17||CODING|NM_198317|) |
Reference#
- "A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3.", Cingolani P, Platts A, Wang le L, Coon M, Nguyen T, Wang L, Land SJ, Lu X, Ruden DM. Fly (Austin). 2012 Apr-Jun;6(2):80-92. PMID: 22728672