BJ-DNA-QC
Background#
One way that users can ensure that a single-cell library is uniformly amplified with low allelic dropouts, is by first sequencing using “low-pass” or low throughput sequencing of around 2M reads per sample. Data from the “low-pass” can used to estimate the genome coverage if the single-cell libraries were to be used for high-depth sequencing. Users can then select only quality libraries for high-depth sequencing.
The BJ-DNA-QC pipeline uses low-pass sequencing data and generates several QC metrics that help assess whether the single-cell libraries are ready for high-depth sequencing.
Pipeline Overview#
flowchart LR
%% Colors %%
classDef panel fill:transparent,stroke:#323232,stroke-dasharray:8
classDef black fill:#12294C,stroke:#12294C,stroke-width:2px,color:#fff
classDef blue fill:#20A4F3,stroke:#20A4F3,stroke-width:2px,color:#fff
classDef green fill:#3BCEAC,stroke:#3BCEAC,stroke-width:2px,color:#fff
classDef yellow fill:#ffd166,stroke:#ffd166,stroke-width:2px,color:#fff
classDef pink fill:#ef476f,stroke:#ef476f,stroke-width:2px,color:#fff
classDef orange fill:#f3722c,stroke:#f3722c,stroke-width:2px,color:#fff
classDef red fill:#BB4430,stroke:#BB4430,stroke-width:2px,color:#fff
classDef ming fill:#387780,stroke:#387780,stroke-width:2px,color:#fff
Start((Start)):::black --fastq--> Trimming[Subsample Reads <br/> <br/> Trim Reads]:::blue
subgraph Preprocess
Trimming
end
subgraph Map
direction TB
Trimming --> Alignment[Align Reads <br/> <br/> Remove Duplicates]:::green
end
subgraph Evaluate
Trimming --> M_FastQC[Read Metrics]:::pink
Alignment --> M_Sentieon[Alignment Metrics <br/><br/> GC Metrics <br/><br/> Insert Size Metrics <br/><br/> Coverage Metrics]:::pink
Alignment --> M_Preseq[Library Complexity Metrics]:::pink
Alignment --> M_Kraken[Kraken]:::pink
Alignment --> M_CNV[CNV]:::pink
end
subgraph Report
M_FastQC --> Mqc[MultiQC Report]:::orange
M_Sentieon --> Mqc
M_Preseq --> Mqc
M_Kraken --> Mqc
end
Mqc --> End((End)):::black
Preprocess:::panel
Map:::panel
Evaluate:::panel
Report:::panel
Following are the steps and tools that pipeline uses to perform the analyses:
-
Subsample the reads to 2 million using
SEQTK SAMPLE
to compare metrics across samples -
Evaluate sequencing quality control using
FASTP
and trim/clip reads -
Map reads to reference genome using
SENTIEON BWA MEM
-
Remove duplicate reads using
SENTIEON DRIVER LOCUSCOLLECTOR
andSENTIEON DRIVER DEDUP
-
Evaluate metrics using
SENTIEON DRIVER METRICS
which includes Alignment, GC Bias, Insert Size, and Coverage metrics -
Evaluate the BAM quality control using
QUALIMAP BAMQC
-
Evaluate the library complexity using
PRESEQ BAM2MR
andPRESEQ GC EXTRAP
-
Evaluate the CNV using a custom
Ginkgo
impelmentation -
Evaluate taxonomic classification with
Kraken
-
Aggregate the metrics across biosamples and tools to create overall pipeline statistics summary using
MULTIQC
Pipeline Parameters#
Parameter Name | Options |
Description |
---|---|---|
Read Length | 50 75 (default) 100 150 |
Read length preference for each sample can be made here and used for sequencing |
Read Sampling | 1000 1000000 2000000 (default) |
Number of reads to sample. 1M paired reads is equivalent to 2M individual reads |
Genome | GRCh38 (default) GRCm38 GRCm39 |
Reference genome to use for alignment |
Optional Modules#
Module | Options | Description |
---|---|---|
FastQC | (default) | FastQC performs QC checks on your raw sequencing data. |
Qualimap | (default) | Qualimap evaluates the quality of the alignment data. |
CNV | (default) | CNV evaluates the Copy Number Variation. |
Kraken | (default) | Kraken evaluates taxonomic classification of reads. |
Output Directories#
Output Directory |
Notes |
---|---|
multiqc / |
This section includes output files containing metrics from various tools to create a MultiQC report. MultiQC Report Example |
primary_analyses / metrics / |
Metrics output from Fastp, Kraken, and/or FASTQC if those modules were selected to run the analyses for each biosample. |
secondary_analyses / alignment / metrics / |
alignment/ Biosample level output containing aligned reads and index file on subsample reads. metrics/ Metrics output from secondary analyses - Alignment, GC bias, Insert Size, Coverage, and library complexity metrics. The section includes outputs from the Bam Lorenz coverage tool containing information about coverage using lorenz curve estimation in order to look at uniformity across the genome. The *-pipeline_all_metrics_mqc.txt contains metrics from the All Metrics section of the MultiQC report found in BaseJumper. Bam lorenz curve Example |
tertiary_analyses / cnv_ginkgo / |
Biosample level output from Ginkgo. CNV profile Example |
execution_info / |
This section includes execution information regarding the pipeline run. |