Skip to content

BJ-Germline-Variantcalling

Background#

BJ-Germline-Variantcalling-Parabricks is a scalable and reproducible bioinformatics pipeline for processing single-cell sequencing data from BioSkryb's Whole Genome Amplification. Pipeline takes raw sequencing data in the form of FASTQ (Illumina/Element) or CRAM (Ultima), performs alignment and removes duplicate reads (Illumina/Element). Pipeline uses Google DeepVariant to make variant calls using custom model train with BioSkryb single-cell data and uses population allele frequency to improve sensitivity in making variant calls.

Pipeline Overview#

flowchart LR
%% Colors %%
classDef panel fill:transparent,stroke:#323232,stroke-dasharray:8
classDef panelt fill:transparent,stroke-opacity: 0
classDef black fill:#12294C,stroke:#12294C,stroke-width:2px,color:#fff
classDef blue fill:#20A4F3,stroke:#20A4F3,stroke-width:2px,color:#fff
classDef green fill:#3BCEAC,stroke:#3BCEAC,stroke-width:2px,color:#fff
classDef yellow fill:#ffd166,stroke:#ffd166,stroke-width:2px,color:#fff
classDef pink fill:#ef476f,stroke:#ef476f,stroke-width:2px,color:#fff
classDef orange fill:#f3722c,stroke:#f3722c,stroke-width:2px,color:#fff
classDef red fill:#BB4430,stroke:#BB4430,stroke-width:2px,color:#fff
classDef ming fill:#387780,stroke:#387780,stroke-width:2px,color:#fff
    Start((Start)):::black --fastq--> Alignment[Parabricks Align Reads <br/> <br/> Remove Duplicates]
    subgraph Map
        Alignment:::green
    end
    subgraph Variant[Variant Calling]
        Alignment --> Deepvariant[Deepvariant]:::yellow
    end
    subgraph Evaluate
        Alignment --> Metrics[Alignment Metrics <br/><br/> GC Metrics <br/><br/> Insert Size Metrics <br/><br/> Bam Metrics <br/><br/> HS Metrics - Exome mode ]:::pink
        Deepvariant --> vcfeval[Variant Evaluation]:::pink
        Alignment --> ado[Allelic balance - ADO -Benchmarking  <br/><br/> Bam-Lorenz Coverage]:::pink
    end
    subgraph Report
        vcfeval --> Mqc[MultiQC Report]:::orange
        Metrics --> Mqc
        ado --> Mqc
    end
    Mqc --> End((End)):::black
    Map:::panel
    Variant:::panel
    Evaluate:::panel
    Report:::panel

Following are the steps and tools that pipeline uses to perform the analyses:

  • Map reads to reference genome and remove duplicate reads using PARABRICKS FQ2BAM

  • Perform variant calling with GOOGLE DEEPVARIANT caller

  • Evaluate metrics using PARABRICKS COLLECTMETRICS which includes Alignment, GC Bias, Insert Size, and Coverage metrics

  • Evaluate variants with VCFEval to assess analytical performance and allelic balance evaluation (Only supported for Genome in a Bottle samples)

  • Evaluate coverage uniformity across genomic regions with BAM-LORENZ-COVERAGE

  • Aggregate the metrics across biosamples and tools to create overall pipeline statistics summary using MULTIQC

Pipeline Parameters#

Parameter Name Options Description
Genome GRCh38 (default) Reference genome to use for alignment

Module Parameters#

Module Parameter Name Options Description
Subsampling (default) Enables downsampling of input reads to a specified read count using SEQTK.
Evaluate Variant Calling (default) Perform benchmarking on variant calling based on ground truth variants.
Only for GIAB samples.
Allelic balance (ADO) Benchmarking (default) Evaluation of allele coverage at known heterozygous sites.
BAM Lorrenz coverage (default) Generates a Lorenz curve from BAM files to assess the uniformity of sequencing coverage across the genome.
Mutation Signature Profile (default) Performs mutational signature profiling.

Output files#

Output Directory/File
Notes
multiqc/ This section includes output files containing metrics from various tools to create a MultiQC report.

MultiQC Report Example

The all_metrics_mqc.txt contains metrics from the All Metrics section of the MultiQC report found in BaseJumper.

PARABRICKS_PRIMARY_WORKFLOW_PARABRICKS_FQ2BAM/
  GOOGLE_DEEPVARIANT_WF_DEEPVARIANT_POSTPROCESS/
PARABRICKS_PRIMARY_WORKFLOW_PARABRICKS_FQ2BAM/
Biosample level output containing aligned reads and index file.

GOOGLE_DEEPVARIANT_WF_DEEPVARIANT_POSTPROCESS/
Biosample level output containing the variant calls in vcf format and index file.
execution_info/ This section includes detail execution information regarding all the tasks in pipeline run.