BJ-Germline-Variantcalling
Background#
BJ-Germline-Variantcalling-Parabricks is a scalable and reproducible bioinformatics pipeline for processing single-cell sequencing data from BioSkryb's Whole Genome Amplification. Pipeline takes raw sequencing data in the form of FASTQ (Illumina/Element) or CRAM (Ultima), performs alignment and removes duplicate reads (Illumina/Element). Pipeline uses Google DeepVariant to make variant calls using custom model train with BioSkryb single-cell data and uses population allele frequency to improve sensitivity in making variant calls.
Pipeline Overview#
flowchart LR
%% Colors %%
classDef panel fill:transparent,stroke:#323232,stroke-dasharray:8
classDef panelt fill:transparent,stroke-opacity: 0
classDef black fill:#12294C,stroke:#12294C,stroke-width:2px,color:#fff
classDef blue fill:#20A4F3,stroke:#20A4F3,stroke-width:2px,color:#fff
classDef green fill:#3BCEAC,stroke:#3BCEAC,stroke-width:2px,color:#fff
classDef yellow fill:#ffd166,stroke:#ffd166,stroke-width:2px,color:#fff
classDef pink fill:#ef476f,stroke:#ef476f,stroke-width:2px,color:#fff
classDef orange fill:#f3722c,stroke:#f3722c,stroke-width:2px,color:#fff
classDef red fill:#BB4430,stroke:#BB4430,stroke-width:2px,color:#fff
classDef ming fill:#387780,stroke:#387780,stroke-width:2px,color:#fff
    Start((Start)):::black --fastq--> Alignment[Parabricks Align Reads <br/> <br/> Remove Duplicates]
    subgraph Map
        Alignment:::green
    end
    subgraph Variant[Variant Calling]
        Alignment --> Deepvariant[Deepvariant]:::yellow
    end
    subgraph Evaluate
        Alignment --> Metrics[Alignment Metrics <br/><br/> GC Metrics <br/><br/> Insert Size Metrics <br/><br/> Bam Metrics <br/><br/> HS Metrics - Exome mode ]:::pink
        Deepvariant --> vcfeval[Variant Evaluation]:::pink
        Alignment --> ado[Allelic balance - ADO -Benchmarking  <br/><br/> Bam-Lorenz Coverage]:::pink
    end
    subgraph Report
        vcfeval --> Mqc[MultiQC Report]:::orange
        Metrics --> Mqc
        ado --> Mqc
    end
    Mqc --> End((End)):::black
    Map:::panel
    Variant:::panel
    Evaluate:::panel
    Report:::panel
Following are the steps and tools that pipeline uses to perform the analyses:
- 
Map reads to reference genome and remove duplicate reads using
PARABRICKS FQ2BAM - 
Perform variant calling with
GOOGLE DEEPVARIANTcaller - 
Evaluate metrics using
PARABRICKS COLLECTMETRICSwhich includes Alignment, GC Bias, Insert Size, and Coverage metrics - 
Evaluate variants with
VCFEvalto assess analytical performance and allelic balance evaluation (Only supported for Genome in a Bottle samples) - 
Evaluate coverage uniformity across genomic regions with
BAM-LORENZ-COVERAGE - 
Aggregate the metrics across biosamples and tools to create overall pipeline statistics summary using
MULTIQC 
Pipeline Parameters#
| Parameter Name | Options | Description | 
|---|---|---|
| Genome | GRCh38 (default) | 
Reference genome to use for alignment | 
Module Parameters#
| Module | Parameter Name | Options | Description | 
|---|---|---|---|
| Subsampling | (default) | Enables downsampling of input reads to a specified read count using SEQTK. | |
| Evaluate Variant Calling | (default) | Perform benchmarking on variant calling based on ground truth variants.   Only for GIAB samples.  | 
|
| Allelic balance (ADO) Benchmarking | (default) | Evaluation of allele coverage at known heterozygous sites. | |
| BAM Lorrenz coverage | (default) | Generates a Lorenz curve from BAM files to assess the uniformity of sequencing coverage across the genome. | |
| Mutation Signature Profile | (default) | Performs mutational signature profiling. | 
Output files#
Output Directory/File  | 
Notes | 
|---|---|
multiqc/ | 
This section includes output files containing metrics from various tools to create a MultiQC report.  MultiQC Report Example The all_metrics_mqc.txt contains metrics from the All Metrics section of the MultiQC report found in BaseJumper.  | 
PARABRICKS_PRIMARY_WORKFLOW_PARABRICKS_FQ2BAM/ GOOGLE_DEEPVARIANT_WF_DEEPVARIANT_POSTPROCESS/ | 
PARABRICKS_PRIMARY_WORKFLOW_PARABRICKS_FQ2BAM/Biosample level output containing aligned reads and index file. GOOGLE_DEEPVARIANT_WF_DEEPVARIANT_POSTPROCESS/ Biosample level output containing the variant calls in vcf format and index file.  | 
execution_info/ | 
This section includes detail execution information regarding all the tasks in pipeline run. |