Skip to content

BJ-WGS

Background#

BJ-WGS pipeline is a scalable and reproducible bioinformatics pipeline to process single-cell sequencing data from ResolveDNA Whole Genome Amplification or any single-cell or bulk sequencing data. The pipeline currently only has added support for human sequencing data but can certainly be extended to other model systems. The pipeline takes raw sequencing data in form of fastq files and performs alignment, removes duplicate reads, base calibrates the reads, all before haplotype calling. The pipeline also runs DNAScope variant caller, and by default, DNAScope vcf files are used for variant annotation and all other downstream analyses.

Pipeline Overview#

flowchart LR
%% Colors %%
classDef panel fill:transparent,stroke:#323232,stroke-dasharray:8
classDef panelt fill:transparent,stroke-opacity: 0
classDef black fill:#12294C,stroke:#12294C,stroke-width:2px,color:#fff
classDef blue fill:#20A4F3,stroke:#20A4F3,stroke-width:2px,color:#fff
classDef green fill:#3BCEAC,stroke:#3BCEAC,stroke-width:2px,color:#fff
classDef yellow fill:#ffd166,stroke:#ffd166,stroke-width:2px,color:#fff
classDef pink fill:#ef476f,stroke:#ef476f,stroke-width:2px,color:#fff
classDef orange fill:#f3722c,stroke:#f3722c,stroke-width:2px,color:#fff
classDef red fill:#BB4430,stroke:#BB4430,stroke-width:2px,color:#fff
classDef ming fill:#387780,stroke:#387780,stroke-width:2px,color:#fff
    Start((Start)):::black --fastq--> Alignment[Align Reads <br/> <br/> Remove Duplicates <br/> <br/> BQSR]
    subgraph Map
        Alignment:::green
    end
    subgraph Variant[Variant Calling]
        Alignment --> Haplotyper[Haplotyper]:::yellow
        Alignment --> DNAScope[DNAScope]:::yellow
    end
    subgraph Evaluate
        Alignment --> M_Sentieon[Alignment Metrics <br/><br/> GC Metrics <br/><br/> Insert Size Metrics <br/><br/> Coverage Metrics]:::pink
        DNAScope --> vcfeval[Variant Evaluation]:::pink
        DNAScope --> Annotation[Variant Annotation]:::pink
    end
    subgraph Report
        vcfeval --> Mqc[MultiQC Report]:::orange
        Annotation --> Mqc
        M_Sentieon --> Mqc
    end
    Mqc --> End((End)):::black
    Map:::panel
    Variant:::panel
    Evaluate:::panel
    Report:::panel

Following are the steps and tools that pipeline uses to perform the analyses:

  • Map reads to reference genome using SENTIEON BWA MEM

  • Remove duplicate reads using SENTIEON DRIVER LOCUSCOLLECTOR and SENTIEON DRIVER DEDUP

  • Perform base quality score recalibration (BQSR) using SENTIEON DRIVER BQSR

  • Perform variant calling with HAPLOTYPER caller

  • Perform variant calling with DNAScope caller

  • Perform variant annotation with SNPEFF and annotate variants with COSMIC, ClinVar, and dbSNP databases

  • Evaluate metrics using SENTIEON DRIVER METRICS which includes Alignment, GC Bias, Insert Size, and Coverage metrics

  • Evaluate variants with VCFEval to assess analytical performance (Only supported for HG001 samples)

  • Aggregate the metrics across biosamples and tools to create overall pipeline statistics summary using MULTIQC

Info

HG001 is a GIAB reference sample. More information is found here Genome in a Bottle

Pipeline Parameters#

Parameter Name Options Description
Genome GRCh38 (default) Reference genome to use for alignment

Module Parameters#

Module Parameter Name Options Description
Variant Annotation (default) Perform annotation of genic variants with dbSNP, ClinVar, and COSMIC databases.
Evaluate Variant Calling (default) Perform benchmarking on variant calling based on ground truth variants.
Only for HG001/NA12878 samples.

Output files#

Output Directory/File
Notes
multiqc/ This section includes output files containing metrics from various tools to create a MultiQC report.

MultiQC Report Example
secondary_analyses/
  alignment/
  metrics/
  variant_calls*/
alignment/
Biosample level output containing aligned reads and index file.

metrics/
Metrics output from secondary analyses - Alignment, GC bias, Insert Size, Coverage, and library complexity metrics.

The *-pipeline_all_metrics_mqc.txt contains metrics from the All Metrics section of the MultiQC report found in BaseJumper.

variant_calls*/
Biosample level output containing the variant calls in vcf format and index file for Haplotype and DNAScope variant caller.
tertiary_analyses/
  variant_annotation/
Contains the annotated vcfs for individual biosamples and as a multisample variants table in txt and hdf5 file format.

MultiSample Variants Table Example
execution_info/ This section includes execution information regarding the pipeline run.