BJ-DNA-QC

Background#

One way that users can ensure that a single-cell library is uniformly amplified with low allelic dropouts, is by first sequencing using “low-pass” or low throughput sequencing of around 2M reads per sample. Data from the “low-pass” can used to estimate the genome coverage if the single-cell libraries were to be used for high-depth sequencing. Users can then select only quality libraries for high-depth sequencing.

The BJ-DNA-QC pipeline uses low-pass sequencing data and generates several QC metrics that help assess whether the single-cell libraries are ready for high-depth sequencing.

Pipeline Overview#

flowchart LR
%% Colors %%
classDef panel fill:transparent,stroke:#323232,stroke-dasharray:8
classDef black fill:#12294C,stroke:#12294C,stroke-width:2px,color:#fff
classDef blue fill:#20A4F3,stroke:#20A4F3,stroke-width:2px,color:#fff
classDef green fill:#3BCEAC,stroke:#3BCEAC,stroke-width:2px,color:#fff
classDef yellow fill:#ffd166,stroke:#ffd166,stroke-width:2px,color:#fff
classDef pink fill:#ef476f,stroke:#ef476f,stroke-width:2px,color:#fff
classDef orange fill:#f3722c,stroke:#f3722c,stroke-width:2px,color:#fff
classDef red fill:#BB4430,stroke:#BB4430,stroke-width:2px,color:#fff
classDef ming fill:#387780,stroke:#387780,stroke-width:2px,color:#fff
    Start((Start)):::black --fastq--> Trimming[Subsample Reads <br/> <br/> Trim Reads]:::blue
    subgraph Preprocess
        Trimming
    end
    subgraph Map
        direction TB
        Trimming --> Alignment[Align Reads <br/> <br/> Remove Duplicates]:::green
    end
    subgraph Evaluate
        Trimming --> M_FastQC[Read Metrics]:::pink
        Alignment --> M_Sentieon[Alignment Metrics <br/><br/> GC Metrics <br/><br/> Insert Size Metrics <br/><br/> Coverage Metrics]:::pink
        Alignment --> M_Preseq[Library Complexity Metrics]:::pink
        Alignment --> M_Kraken[Kraken]:::pink
        Alignment --> M_CNV[CNV]:::pink

    end
    subgraph Report
        M_FastQC --> Mqc[MultiQC Report]:::orange
        M_Sentieon --> Mqc
        M_Preseq --> Mqc
        M_Kraken --> Mqc
    end
    Mqc --> End((End)):::black
    Preprocess:::panel
    Map:::panel
    Evaluate:::panel
    Report:::panel

Following are the steps and tools that pipeline uses to perform the analyses:

Subsample the reads to 2 million using SEQTK SAMPLE to compare metrics across samples
Evaluate sequencing quality control using FASTP and trim/clip reads
Map reads to reference genome using SENTIEON BWA MEM
Remove duplicate reads using SENTIEON DRIVER LOCUSCOLLECTOR and SENTIEON DRIVER DEDUP
Evaluate metrics using SENTIEON DRIVER METRICS which includes Alignment, GC Bias, Insert Size, and Coverage metrics
Evaluate the BAM quality control using QUALIMAP BAMQC
Evaluate the library complexity using PRESEQ BAM2MR and PRESEQ GC EXTRAP
Evaluate the CNV using a custom Ginkgo impelmentation
Evaluate taxonomic classification with Kraken
Aggregate the metrics across biosamples and tools to create overall pipeline statistics summary using MULTIQC

Pipeline Parameters#

Parameter Name	Options	Description
Read Length	`50` `75` _^(default) `100` `150`	Read length preference for each sample can be made here and used for sequencing
Read Sampling	`1000` `1000000` `2000000` _^(default)	Number of reads to sample. 1M paired reads is equivalent to 2M individual reads
Genome	`GRCh38` _^(default) `GRCm38` `GRCm39`	Reference genome to use for alignment

Optional Modules#

Module	Options	Description
FastQC	_^(default)	FastQC performs QC checks on your raw sequencing data.
Qualimap	_^(default)	Qualimap evaluates the quality of the alignment data.
CNV	_^(default)	CNV evaluates the Copy Number Variation.
Kraken	_^(default)	Kraken evaluates taxonomic classification of reads.

Output Directories#

Output Directory	Notes
`multiqc`/	This section includes output files containing metrics from various tools to create a MultiQC report. MultiQC Report Example
`primary_analyses`/ `metrics`/	Metrics output from Fastp, Kraken, and/or FASTQC if those modules were selected to run the analyses for each biosample.
`secondary_analyses`/ `alignment`/ `metrics`/	`alignment/` Biosample level output containing aligned reads and index file on subsample reads. `metrics/` Metrics output from secondary analyses - Alignment, GC bias, Insert Size, Coverage, and library complexity metrics. The section includes outputs from the Bam Lorenz coverage tool containing information about coverage using lorenz curve estimation in order to look at uniformity across the genome. The `*-pipeline_all_metrics_mqc.txt` contains metrics from the `All Metrics` section of the MultiQC report found in BaseJumper. Bam lorenz curve Example
`tertiary_analyses`/ `cnv_ginkgo`/	Biosample level output from Ginkgo. CNV profile Example
`execution_info`/	This section includes execution information regarding the pipeline run.