Skip to content

BJ-DNA-QC Pipeline User Manual#

Customer Support#

If you need assistance with the BJ-DNA-QC pipeline, several support channels are available:

BaseJumper Online Manual#

For comprehensive documentation and guides, visit the BaseJumper Documentation Portal.

Email Support#

Contact our support team at: basejumper@bioskryb.com

Online Portal Ticketing System#

For bug reports and feature requests, submit a ticket through our online portal. Your account representative can provide access details.


Before You Begin#

Account Registration#

  1. Register for a BaseJumper account at https://basejumper.bioskryb.com, select ‘Create Account’ and review the Terms and Conditions.
  2. Wait for account approval from your organization administrator
  3. Log in with your credentials to access the platform

Data Transfer Setup#

Before running the BJ-DNA-QC pipeline, ensure your sequencing data is accessible to you:

Globus Data Transfer#

  • Recommended method for large-scale data transfers
  • Set up a Globus endpoint for your data storage.
  • You can review Globus setup instructions in the BaseJumper Online Manual here
  • The BaseJumper support team will add you to your assigned workspace and an email confirming membership will arrive from groups.globus.org (you may need to check your Junk/Spam filter).
  • Contact BaseJumper support for additional Globus configuration assistance

Alternative Transfer Methods#

  • AWS S3 bucket access (requires IAM credentials and ARN roles submitted)
  • Direct upload through BaseJumper interface (for smaller datasets)
  • Contact support to discuss custom data transfer solutions

Sample Metadata Requirements#

The BJ-DNA-QC pipeline requires a properly formatted input CSV. You can get the initial format of this CSV from the BaseJumper project page and clicking the ‘Export’ button at the top of the biosample table. This file will have the following structure (and you will need to add the ‘groups’ column):

Input CSV Format#

Column Name Description Required Example
biosampleName Unique identifier for each sample Yes chr22_testsample1
read1 Path to forward reads (R1) FASTQ file Yes s3://bucket/sample_R1_001.fastq.gz
read2 Path to reverse reads (R2) FASTQ file Yes s3://bucket/sample_R2_001.fastq.gz
groups Optional grouping for batch analysis No Group1 or Control

Example Input File#

1
2
3
4
biosampleName,read1,read2,groups
chr22_testsample1,s3://bioskryb-data/sample1_R1_001.fastq.gz,s3://bioskryb-data/sample1_R2_001.fastq.gz,Group1
chr22_testsample2,s3://bioskryb-data/sample2_R1_001.fastq.gz,s3://bioskryb-data/sample2_R2_001.fastq.gz,Group1
chr22_testsample3,s3://bioskryb-data/sample3_R1_001.fastq.gz,s3://bioskryb-data/sample3_R2_001.fastq.gz,Group2

FASTQ Naming Conventions#

BJ-DNA-QC accepts standard Illumina FASTQ naming formats:

  • {SampleName}_S{SampleNumber}_L{Lane}_R{ReadNumber}_001.fastq.gz
  • Example: SampleA_S1_L001_R1_001.fastq.gz

Important Notes: - Files must be gzipped (.fastq.gz extension) - Read pairs must have matching sample names - S3 paths should be accessible from your BaseJumper account


Project Design#

Creating a New Project#

  1. Navigate to Projects: From the BaseJumper dashboard, click the "Create Project" button in the upper right corner

  2. Project Configuration:

  3. Project Name: Choose a descriptive, unique name for your project
  4. Project Description: Add details about your experiment (optional but recommended)
  5. Select Pipeline: Choose "BJ-DNA-QC" from the pipeline dropdown menu

  6. Sample Selection:

  7. Select samples from your Shared Data directory. If you do have an Illumina BaseSpace Sequence Hub token you can follow instructions in the BaseJumper Manual to provide this token to the Support email.

Shared Data Requirements#

If using samples from Shared Data: - Samples must follow Illumina naming conventions - Files must be in .fastq.gz format - CRAM files are not currently supported - contact ResolveServices for CRAM processing

Sample Organization#

Before submitting your project, consider:

  • Which samples to include? Select only samples you want to analyze together. You can add biosamples later to a project (back out to workspace-level view and then the dots of your project for the ‘Add Biosamples’ menu to arrive), but you will need to reanalyze the entire project to aggregate metrics into one summary report.
  • Metadata grouping: Use the groups column to organize samples by experimental conditions
  • Quality control: Ensure all samples have sufficient read depth, this should be over 100,000 reads although stability for quality metrics requires at least 1M reads / biosample.

⚠️ Warning: Samples with zero reads will cause the pipeline to fail. Verify data quality before submission. If you auto-queue analyses after project loading, this is the primary reason for analysis to fail.

Automatic Pipeline Queuing#

By default, projects are automatically queued for execution after creation. To disable this:

  1. Uncheck the "Auto-queue project" option during project setup
  2. Manually start the pipeline from the project dashboard when ready

Pipeline Parameters#

Selecting Pipeline Version#

  1. From the pipeline configuration screen, select your preferred pipeline version
  2. Default: The most recent stable version (recommended)
  3. For reproducibility, you can select specific previous versions

Core Parameters#

Configure the following parameters based on your experimental design:

Read Length#

Option Description Use Case
50 50 base pair reads Older sequencing protocols
75 75 base pair reads (Default) Standard low-pass sequencing
100 100 base pair reads Higher resolution QC
150 150 base pair reads Deep sequencing applications

Default: 75

Read Sampling#

Controls the number of reads subsampled per biosample for QC analysis:

Option Description Notes
1000 1K reads Quick testing only
1000000 1M reads Faster QC, lower accuracy
2000000 2M reads (Default) Recommended for comprehensive QC

Default: 2000000 (2M reads)

ℹ️ Note: 1M paired-end reads equals 2M individual reads

Reference Genome#

Select the reference genome that matches your sample organism:

Genome Description
GRCh38 (Default) Human reference genome, build 38
GRCm39 Mouse reference genome, build 39

Default: GRCh38

Optional Modules#

Enable or disable optional analysis modules based on your needs:

Module Default State Description
FastQC ✅ Enabled Performs quality checks on raw sequencing data including per-base quality, GC content, and adapter contamination
CNV ✅ Enabled Detects copy number variations across the genome using Ginkgo segmentation
Qualimap ❌ Disabled Evaluates alignment quality with detailed coverage statistics and quality distributions
Kraken ❌ Disabled Performs taxonomic classification to detect contamination
Mutational Signature Profile ❌ Disabled Identifies mutational signatures in grouped samples through pseudobulk analysis and variant calling

💡 Recommendation: Enable CNV module for single-cell DNA samples to assess amplification quality and generate ClusterQC plots for quality categorization

Mutational Signature Profile Requirements#

The Mutational Signature module requires: - Groups column in input CSV: Samples with the same group name will be combined through pseudobulk analysis - GRCh38 genome only (currently supported) - Samples are combined by group, variants are called on pseudobulk BAMs, and mutational patterns are profiled using SigProfilerMatrixGenerator

Samples with blank or "None" in the groups column will be excluded from mutational signature analysis.


Workflow Overview#

Pipeline Execution Steps#

The BJ-DNA-QC pipeline performs the following analyses:

flowchart TD
%% Colors %%
classDef black fill:#12294C,stroke:#12294C,stroke-width:3px,color:#fff,font-size:16px
classDef blue fill:#20A4F3,stroke:#20A4F3,stroke-width:3px,color:#fff,font-size:16px
classDef green fill:#3BCEAC,stroke:#3BCEAC,stroke-width:3px,color:#fff,font-size:16px
classDef pink fill:#ef476f,stroke:#ef476f,stroke-width:3px,color:#fff,font-size:16px
classDef orange fill:#f3722c,stroke:#f3722c,stroke-width:3px,color:#fff,font-size:16px
classDef purple fill:#9b5de5,stroke:#9b5de5,stroke-width:3px,color:#fff,font-size:16px

    Start([START]):::black
    Subsample[Read Subsampling<br/>SEQTK - 2M reads]:::blue
    Trim[Quality Trimming<br/>FASTP]:::blue
    Align[Read Alignment<br/>Sentieon BWA-MEM]:::green
    Dedup[Duplicate Removal<br/>Sentieon Dedup]:::green
    Metrics[Metrics Collection<br/>Sentieon Driver]:::pink
    Preseq[Library Complexity<br/>PreSeq]:::pink
    CNV[CNV Analysis<br/>Ginkgo]:::pink
    Kraken[Taxonomic Classification<br/>Kraken - Optional]:::pink
    Pseudobulk{Mutational Signature<br/>Enabled?}:::purple
    PseudoBAM[Pseudobulk BAMs<br/>by Group]:::purple
    VariantCall[Variant Calling<br/>Sentieon DNAscope]:::purple
    Filter[Filter Variants<br/>BCFtools]:::purple
    SigProfile[Mutational Signatures<br/>SigProfilerMatrixGenerator]:::purple
    ClusterQC[ClusterQC Plots<br/>Quality Categories]:::orange
    MultiQC[MultiQC Report<br/>Aggregate Results]:::orange
    End([END]):::black

    Start --> Subsample
    Subsample --> Trim
    Trim --> Align
    Align --> Dedup
    Dedup --> Metrics
    Dedup --> Pseudobulk
    Metrics --> Preseq
    Metrics --> CNV
    Metrics --> Kraken
    Metrics --> MultiQC
    CNV --> ClusterQC
    Preseq --> MultiQC
    Kraken --> MultiQC
    ClusterQC --> MultiQC

    Pseudobulk -->|Yes| PseudoBAM
    Pseudobulk -->|No| MultiQC
    PseudoBAM --> VariantCall
    VariantCall --> Filter
    Filter --> SigProfile
    SigProfile --> MultiQC

    MultiQC --> End

Processing Steps Detail#

1. Read Subsampling (SEQTK)#

  • Randomly selects 2M reads per sample (default)
  • Ensures uniform comparison across samples
  • Output: Subsampled FASTQ files

2. Quality Trimming (FASTP)#

  • Removes adapter sequences
  • Trims low-quality bases
  • Filters reads below quality threshold
  • Output: Trimmed FASTQ files, QC JSON reports

3. Read Alignment (Sentieon BWA MEM)#

  • Maps reads to reference genome
  • Uses BWA-MEM algorithm optimized by Sentieon
  • Output: BAM alignment files

4. Duplicate Removal (Sentieon Dedup)#

  • Identifies PCR and optical duplicates
  • Marks or removes duplicates from analysis
  • Output: Deduplicated BAM files with indexes

5. Metrics Collection (Sentieon Driver)#

  • Alignment metrics: Mapping rates, paired-end statistics
  • GC bias metrics: GC content distribution
  • Insert size metrics: Fragment size distribution
  • Coverage metrics: Genome-wide coverage statistics
  • Output: Multiple .txt metrics files per sample

6. Library Complexity (Preseq)#

  • Estimates library complexity
  • Projects sequencing saturation
  • Output: Complexity curves and extrapolation data

7. CNV Analysis (Ginkgo) (Optional)#

  • Segments genome into bins
  • Calculates copy number variation
  • Generates CNV profiles
  • Output: CNV plots (JPEG), segmentation files (TSV)

8. Taxonomic Classification (Kraken) (Optional)#

  • Identifies organism composition
  • Detects contamination
  • Output: Taxonomic classification reports

9. Mutational Signature Profiling (Optional)#

  • Combines BAMs from samples with the same group label (pseudobulk)
  • Calls variants using Sentieon DNAscope
  • Filters variants against GIAB reference
  • Generates mutational signature profiles using SigProfilerMatrixGenerator
  • Analyzes single base substitution (SBS) patterns
  • Output: VCF files, mutational catalogs, SBS96 plots
  • Requirements: Groups column in input CSV, GRCh38 genome only

10. Report Generation (MultiQC)#

  • Aggregates metrics across all samples
  • Creates interactive HTML report
  • Generates summary tables
  • Output: multiqc_report.html

File Naming Conventions#

Input Files#

  • Forward reads: {biosampleName}_R1_001.fastq.gz
  • Reverse reads: {biosampleName}_R2_001.fastq.gz

Output Files#

  • Aligned BAM: {biosampleName}.bam
  • BAM index: {biosampleName}.bam.bai
  • CNV plot: {biosampleName}_CNV1.tsv(gains), {biosampleName}_CNV2.tsv (losses)
  • Metrics: {biosampleName}.dedup.{metric_type}.sentieonmetrics.txt

Estimated Processing Time#

Project Size Estimated Time Notes
10 samples <1 hour With 2M reads per sample
50 samples 1-3 hours Standard QC modules enabled
100 samples 2-4 hours All optional modules enabled
384 samples 3-5 hours High-throughput plate

Factors affecting runtime: - Read depth per sample - Reference genome size - Optional modules enabled - Current cluster load


Summarizing Output#

MultiQC Report#

The MultiQC report aggregates all QC metrics into a single interactive HTML file located at:

1
multiqc/multiqc_report.html

MultiQC Report Sections#

1. General Statistics#

Overview table showing key metrics for all samples: - Total reads - Alignment rate - Duplication rate - Insert size - Coverage metrics

2. Selected Metrics#
  • Curated subset of key quality metrics for quick sample assessment
  • Interactive table showing critical QC parameters across all samples
  • Includes metrics such as:
  • Library complexity (PreSeq count)
  • Mitochondrial read percentage
  • Chimeric rate
  • Error rates
  • Alignment percentages
  • Insert size statistics
  • Coverage uniformity (Gini coefficient)
  • CNV quality metrics (MAPD, Segment MAD)
  • Quality category assignments
3. All Metrics#
  • Comprehensive table containing all metrics generated by the pipeline
  • Complete QC dataset for advanced analysis
  • Exportable for downstream processing
  • Includes all metrics from individual pipeline modules
4. FastQC Section#

If enabled, displays: - Per-base sequence quality: Quality scores across read positions - Per-sequence quality scores: Distribution of mean quality scores - Adapter content: Detection of adapter contamination - GC content distribution: Expected vs. observed GC content

5. Fastp Section#

Trimming and filtering statistics: - Reads passed/filtered - Adapter trimming rates - Quality filtering results - Before/after comparison plots

6. Sentieon Alignment Metrics#
  • Total reads mapped: Percentage of reads successfully aligned
  • Properly paired reads: Percentage of read pairs aligned correctly
  • Chimeric rate: Percentage of chimeric read pairs
  • Mean coverage: Average depth across the genome
7. Sentieon GC Bias Metrics#
  • GC content vs. coverage plot
  • Normalized coverage by GC content
  • AT/GC dropout assessment
8. Insert Size Distribution#
  • Histogram of fragment sizes
  • Mean and median insert sizes
  • Standard deviation
9. Preseq Library Complexity#
  • Complexity curves showing library diversity
  • Extrapolated saturation points
  • Expected unique reads at higher depths
10. Kraken Taxonomic Classification (if enabled)#
  • Species composition
  • Contamination detection
  • Read assignment percentages
11. Mutational Signature Profile (if enabled)#
  • Single base substitution (SBS) patterns
  • Mutational catalog by trinucleotide context
  • SBS96 barplots showing mutation types
  • Group-level mutational profiles

View Example MultiQC Report

ClusterQC Analysis#

ClusterQC provides visual assessment of sample quality and consistency:

CNV Quadrants Plot#

Location: tertiary_analyses/qc_plots/CNV-Quadrants.pdf

This plot classifies samples into quality categories based on CNV metrics:

Axes: - X-axis: MAPD Score (Median Absolute Pairwise Difference) - Y-axis: Segment MAD (Median Absolute Deviation)

Quality Categories (Quadrants):

Category MAPD Score Segment MAD Quality Level
Category 5 < 0.4 < 0.4 Excellent - Ready for deep sequencing
Category 4 0.4-0.6 < 0.4 Good - Suitable for sequencing
Category 3 < 0.4 0.4-0.6 Moderate - Review individual profiles
Category 2 0.4-0.6 0.4-0.6 Fair - Consider re-amplification
Category 1 > 0.6 > 0.6 Poor - Not recommended for deep sequencing

Interpretation: - Samples in Category 5 (bottom-left quadrant) show uniform amplification with low noise - Samples in Category 1 (top-right quadrant) show high noise and poor coverage uniformity - Use this plot to select high-quality samples for downstream deep sequencing

QC Composition Plot#

Location: tertiary_analyses/qc_plots/QC_composition.pdf

Displays the distribution of samples across quality categories: - Bar chart showing number and percentage of samples in each category - Helps assess overall library quality - Useful for batch effect detection

Consensus Scores#

Individual Sample Scores:

Location: tertiary_analyses/qc_plots/DNA-QC_ConsensusScores.txt

Detailed per-sample quality scores showing: - SampleId: Biosample identifier - CompositeScore: Overall quality category (1-5) - Group: Sample group assignment (from input CSV) - Verdict columns: Individual pass/fail scores for each QC metric: - Verdict_preseq_count: Library complexity score - Verdict_PCT_CHIMERAS: Chimeric rate score - Verdict_chrM: Mitochondrial contamination score - Verdict_MAPD_CNV_Log2: Coverage uniformity score - Verdict_SKEW_CNV: CNV distribution symmetry score

Each verdict is scored as 1 (pass) or 0 (fail). The CompositeScore represents the overall quality based on all individual verdicts.

Example:

1
2
3
4
SampleId                        CompositeScore  Group    Verdict_preseq  Verdict_CHIMERAS  Verdict_chrM  Verdict_MAPD  Verdict_SKEW
ResolveOMEv2.0-384-DNA-QC-B4   5               group1   1               1                 1             1             1
ResolveOMEv2.0-384-DNA-QC-D3   5               group1   1               1                 1             1             1
ResolveOMEv2.0-384-DNA-QC-H5   5               group2   1               1                 1             1             1

Summary Table:

Location: tertiary_analyses/qc_plots/DNA-QC_ConsensusScores_SummaryTable.txt

Aggregate summary showing: - Number of cells in each quality category - Proportion of cells in each category - Overall batch quality assessment

Example:

1
2
3
4
5
6
Category    NumberCells    ProportionCells
5           10             100%
4           0              0%
3           0              0%
2           0              0%
1           0              0%

In this example, all 10 samples are Category 5 (excellent quality).


Output Files#

All pipeline results are organized in a structured directory hierarchy:

Directory Structure#

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
project_results/
├── multiqc/
├── primary_analyses/
│   └── metrics/
├── read_counts/
├── secondary_analyses/
│   ├── alignment/
│   └── metrics/
├── tertiary_analyses/
│   ├── cnv_ginkgo/
│   ├── qc_plots/
│   └── mutational_signatures/
└── execution_info/

Detailed Output Descriptions#

1. multiqc/ Directory#

File Description
multiqc_report.html Main QC report - Interactive HTML report with all QC metrics. Open in web browser.
multiqc_data/ Raw data used to generate MultiQC report (JSON, TSV files)
multiqc_version.yml Version information for reproducibility

Example: multiqc_report.html

2. primary_analyses/metrics/ Directory#

Per-biosample metrics from raw read analysis:

1
2
3
primary_analyses/metrics/{biosampleName}/
└── fastp/
    └── {biosampleName}_no_qc_fastp.json
File Pattern Description
*_no_qc_fastp.json JSON file with detailed fastp metrics: read counts, quality scores, adapter content, filtering statistics

Use Case: Review trimming effectiveness and raw read quality

Example: sample_no_qc_fastp.json

3. secondary_analyses/alignment/ Directory#

Aligned reads for each biosample:

1
2
3
secondary_analyses/alignment/output/
├── {biosampleName}.bam
└── {biosampleName}.bam.bai
File Type Description
.bam Binary Alignment Map - Aligned reads in binary format. Can be viewed with IGV or samtools.
.bam.bai BAM Index - Required for efficient random access to BAM file.

Size: Typically 50-500 MB per sample (for 2M reads)

Use Case: Visual inspection of alignments, custom downstream analysis

4. secondary_analyses/metrics/ Directory#

Comprehensive alignment metrics per biosample:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
secondary_analyses/metrics/{biosampleName}/
├── alignment_stats/
│   ├── {biosampleName}.dedup.alignmentstat_sentieonmetrics.txt
│   ├── {biosampleName}.dedup.cov_sentieonmetrics
│   ├── {biosampleName}.dedup.gcbias.sentieonmetrics.txt
│   ├── {biosampleName}.dedup.insertsizemetricalgo.sentieonmetrics.txt
│   ├── {biosampleName}.dedup.wgsmetricsalgo.sentieonmetrics.txt
│   └── [nondedup versions of above files]
└── bam_lorenz_coverage/
    └── [lorenz curve files]
File Pattern Description
*.alignmentstat_sentieonmetrics.txt Alignment statistics: total reads, mapped reads, properly paired, chimeric rate
*.cov_sentieonmetrics* Coverage metrics: mean coverage, coverage distribution, genome fraction covered
*.gcbias.sentieonmetrics.txt GC bias metrics: normalized coverage by GC content, AT/GC dropout
*.insertsizemetricalgo.sentieonmetrics.txt Insert size distribution: mean, median, standard deviation of fragment sizes
*.wgsmetricsalgo.sentieonmetrics.txt Whole genome sequencing metrics: coverage uniformity, genome territory
bam_lorenz_coverage/ Lorenz curve data for coverage uniformity assessment

Files with .dedup. prefix: Metrics calculated after duplicate removal
Files with .nondedup. prefix: Metrics calculated before duplicate removal

Example Files: - sample_alignmentstat_sentieonmetrics.txt - sample_insertsize_sentieonmetrics.txt - sample_gcbias_sentieonmetrics.txt - sample_wgsmetrics_sentieonmetrics.txt

Summary Files:

1
2
3
secondary_analyses/metrics/
├── nf-preseq-pipeline_all_metrics_mqc.txt
└── nf-preseq-pipeline_selected_metrics_mqc.txt
File Description
*_all_metrics_mqc.txt All metrics table - Comprehensive TSV with all QC metrics for all samples
*_selected_metrics_mqc.txt Key metrics table - Curated subset of most important QC metrics

Example Files: - nf-preseq-pipeline_selected_metrics_mqc.txt - nf-preseq-pipeline_all_metrics_mqc.txt

Example Lorenz Curve: lorenz_example.png

5. tertiary_analyses/cnv_ginkgo/ Directory#

Copy number variation analysis outputs (if CNV module enabled):

1
2
3
4
5
6
7
8
tertiary_analyses/cnv_ginkgo/
├── {biosampleName}_CNV1.tsv
├── {biosampleName}_CNV2.tsv
├── {biosampleName}_dots.tsv
├── AllSample-GinkgoSegmentSummary.txt
├── SegCopy.binsize_1000000.tsv
├── cnv_plots_binsize_1000000.tar.gz
└── ginkgo_res.binsize_1000000.RDS
File Pattern Description
*_CNV1.tsv Copy number calls per genomic bin (method 1)
*_CNV2.tsv Copy number calls per genomic bin (method 2)
*_dots.tsv Individual bin values for plotting
AllSample-GinkgoSegmentSummary.txt Summary of CNV segments across all samples: number of segments, MAD scores
SegCopy.binsize_*.tsv Segment-level copy number matrix (all samples)
cnv_plots_*.tar.gz Compressed archive containing JPEG CNV profile plots for all samples
ginkgo_res.*.RDS R data file with complete Ginkgo analysis results

CNV Profile Example: cnv_number_profile_example.jpeg

Example Files: - sample_CNV1.tsv - AllSample-GinkgoSegmentSummary.txt - SegCopy.binsize_1000000.tsv

To extract CNV plots:

1
tar -xzf cnv_plots_binsize_1000000.tar.gz

6. tertiary_analyses/qc_plots/ Directory#

Aggregate quality control plots and scores:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
tertiary_analyses/qc_plots/
├── CNV-Quadrants.pdf
├── CNV-Quadrants_mqc.jpg
├── QC_composition.pdf
├── QC_composition_mqc.jpg
├── DNA-QC_ConsensusScores.txt
├── DNA-QC_ConsensusScores_SummaryTable.txt
├── ConsensusScores_SummaryTableByGroup.txt
├── AllSample-GinkgoSegmentSummary.txt
├── nf-preseq-pipeline_all_metrics_mqc.txt
└── nf-preseq-pipeline_selected_metrics_mqc.txt
File Description
CNV-Quadrants.pdf ClusterQC plot - Scatter plot showing MAPD vs. Segment MAD with quality categories
QC_composition.pdf Category distribution - Bar chart showing proportion of samples in each quality category
*_mqc.jpg JPEG versions of plots for inclusion in reports
DNA-QC_ConsensusScores.txt Individual sample consensus scores with quality category assignments
DNA-QC_ConsensusScores_SummaryTable.txt Overall summary: number and proportion of samples per category
ConsensusScores_SummaryTableByGroup.txt Group-stratified summary (if groups specified in input CSV)
AllSample-GinkgoSegmentSummary.txt Segmentation statistics for all samples

Example Files: - CNV-Quadrants.pdf - CNV-Quadrants_mqc.jpg - QC_composition.pdf - QC_composition_mqc.jpg - DNA-QC_ConsensusScores.txt - DNA-QC_ConsensusScores_SummaryTable.txt - ConsensusScores_SummaryTableByGroup.txt

7. tertiary_analyses/mutational_signatures/ Directory (if enabled)#

Mutational signature profiling outputs for grouped samples:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
tertiary_analyses/mutational_signatures/
├── pseudobulk/
│   ├── {GroupName}.pseudobulk.bam
│   ├── {GroupName}.pseudobulk.bam.bai
│   └── {GroupName}.pseudobulk.dnascope.vcf.gz
├── filtered_vcf/
│   └── {GroupName}.filtered.vcf.gz
└── sigprofiler/
    ├── SBS96/
    │   ├── {GroupName}_SBS96.all
    │   └── {GroupName}_SBS96_barplot.pdf
    ├── mutational_catalog_mqc.txt
    └── merged_mutational_catalog.tsv
File Pattern Description
*.pseudobulk.bam Pseudobulk alignment - Combined BAM file from all samples in the group
*.pseudobulk.bam.bai BAM index file for pseudobulk alignment
*.dnascope.vcf.gz Variant calls - Called variants from pseudobulk BAM using Sentieon DNAscope
*.filtered.vcf.gz Filtered variants - High-confidence variants after filtering against GIAB reference
*_SBS96.all SBS96 matrix - Single base substitution counts in 96 trinucleotide contexts
*_SBS96_barplot.pdf Visualization - Bar plot showing mutational signature profile
mutational_catalog_mqc.txt Summary table for MultiQC integration
merged_mutational_catalog.tsv Combined mutational catalog across all groups

Trinucleotide Context: Each mutation is classified by the mutated base and its flanking bases (e.g., C>A in ACA context)

SBS96 Classification: Mutations are categorized into 6 substitution types (C>A, C>G, C>T, T>A, T>C, T>G) × 16 trinucleotide contexts = 96 categories

Use Case: Identify mutational processes, compare signatures between experimental groups, detect exposure-related mutation patterns

8. read_counts/ Directory#

Read count summaries at each processing step:

1
2
read_counts/
└── [read count files]

Tracks the number of reads remaining after each filtering/processing step.

9. execution_info/ Directory#

Pipeline execution metadata:

1
2
3
4
5
6
execution_info/
├── pipeline_info/
│   ├── execution_report.html
│   ├── execution_timeline.html
│   └── execution_trace.txt
└── versions.yml
File Description
execution_report.html Nextflow execution report with resource usage and runtime statistics
execution_timeline.html Visual timeline of task execution
execution_trace.txt Detailed trace of all executed tasks
versions.yml Software versions for all tools used in the pipeline

Use Case: Troubleshooting, reproducibility, performance optimization


Appendix#

A. FASTQ Naming Conventions#

The BJ-DNA-QC pipeline accepts Illumina FASTQ files following these naming patterns:

Standard Illumina Format#

1
{SampleName}_S{SampleNumber}_L{LaneNumber}_R{ReadNumber}_001.fastq.gz

Components: - {SampleName}: User-defined sample identifier (e.g., SampleA, Patient01_Tumor) - S{SampleNumber}: Sample number from sample sheet (e.g., S1, S12) - L{LaneNumber}: Sequencing lane (e.g., L001, L002) - R{ReadNumber}: Read direction (R1 for forward, R2 for reverse) - 001: File segment (always 001 for single files)

Examples:

1
2
3
4
chr22_testsample1_S1_L001_R1_001.fastq.gz
chr22_testsample1_S1_L001_R2_001.fastq.gz
Patient01-Tumor_S12_L003_R1_001.fastq.gz
Patient01-Tumor_S12_L003_R2_001.fastq.gz

Simplified Format (Also Accepted)#

1
2
{SampleName}_R1_001.fastq.gz
{SampleName}_R2_001.fastq.gz

Requirements: - Files must be gzip compressed (.fastq.gz or .fq.gz) - Forward and reverse reads must have matching sample names - Read pairs must be indicated by _R1_ and _R2_ (or _R1. and _R2.)

B. Selected Metrics Descriptions#

The nf-preseq-pipeline_selected_metrics_mqc.txt file contains the following key metrics:

Metric Description Typical Range Quality Indicator
sample_name Unique biosample identifier - -
preseq_count Estimated library complexity > 1B Higher is better - indicates diverse library
chrM Percentage of mitochondrial reads 0.1 - 0.5% Lower is better - high values indicate poor library
PCT_CHIMERAS Percentage of chimeric read pairs 5 - 10% Lower is better - high values indicate amplification artifacts
PF_HQ_ERROR_RATE Error rate in high-quality reads 0.2 - 0.5% Lower is better
pct_trimmed_aligned Percentage of trimmed reads that aligned > 98% Higher is better - indicates good quality and correct reference
MEDIAN_INSERT_SIZE Median fragment size (bp) 300 - 500 bp Should match expected library prep
total_reads Total number of raw reads 2M (after subsampling) Should match subsampling parameter
adapter_trimmed_reads Number of reads with adapters removed Varies Lower is better - indicates clean library
adapter_trimmed_bases Total bases trimmed from adapters Varies Lower is better
gini_coefficient_index Coverage uniformity (Gini index) 0.02 - 0.05 Lower is better - indicates uniform coverage
cnv_genome_ploidy Estimated genome ploidy 2 (diploid) Should match expected sample ploidy
cnv_number_of_segments Number of CNV segments detected 5 - 20 Fewer is better for normal samples
cnv_segment_MAD Segment Median Absolute Deviation < 0.4 Lower is better - measures CNV noise
SegmentMAD_Aware Adjusted Segment MAD < 0.4 Lower is better - quality metric
MAPD.Score Median Absolute Pairwise Difference < 0.4 Lower is better - coverage uniformity
MAPD_CNV_Log2 Log2-transformed MAPD for CNV < 0.2 Lower is better
SKEW_CNV Skewness of CNV distribution < 0.15 Lower is better - symmetric distribution

C. ClusterQC Metric Cutoffs#

CNV Quality Category Definitions#

The ClusterQC analysis uses the following thresholds to classify sample quality: - MAPD Score (Median Absolute Pairwise Difference) - Segment MAD (Median Absolute Deviation)

Category 5 (Excellent): - MAPD Score < 0.4 - Segment MAD < 0.4 - Recommendation: Proceed with high-depth sequencing

Category 4 (Good): - MAPD Score < 0.4 - Segment MAD 0.4-0.6 - Recommendation: Suitable for sequencing, acceptable quality

Category 3 (Moderate): - MAPD Score 0.4-0.6 - Segment MAD < 0.4 - Recommendation: Review individual CNV profiles before sequencing

Category 2 (Fair): - MAPD Score 0.4-0.6 - Segment MAD 0.4-0.6 - Recommendation: Consider re-amplification or exclude from study

Category 1 (Poor): - MAPD Score > 0.6 OR Segment MAD > 0.6 - Recommendation: Do not proceed with deep sequencing

Metric Interpretations#

MAPD (Median Absolute Pairwise Difference): - Measures coverage uniformity across adjacent genomic windows - Calculated as the median of absolute differences between adjacent bins - Low MAPD (< 0.4): Smooth, uniform coverage - High MAPD (> 0.6): Noisy, uneven coverage

Segment MAD (Median Absolute Deviation): - Measures the variability within CNV segments - Reflects the noise level in copy number calls - Low MAD (< 0.4): Clean segmentation, low noise - High MAD (> 0.6): Noisy segmentation, poor quality

D. PreSeq Library Complexity#

PreSeq estimates library complexity and predicts sequencing saturation:

Key Metrics: - Extrapolated complexity: Predicted number of unique molecules at higher sequencing depths - Saturation point: Sequencing depth where few new unique molecules are detected

Interpretation: - High complexity (> 1B unique molecules): Library can support deep sequencing - Low complexity (< 100M unique molecules): Library is saturated, deep sequencing will yield diminishing returns

Use Case: Decide whether to proceed with high-depth sequencing based on library diversity

E. Mitochondrial Read Percentage (ChrM)#

Percentage of reads mapping to the mitochondrial genome:

ChrM % Interpretation
< 0.5% Normal - high-quality nuclear DNA library
0.5 - 2% Elevated - acceptable for most applications
2 - 10% High - may indicate cell stress or poor lysis
> 10% Very high - poor nuclear DNA recovery, consider re-amplification

Causes of High ChrM: - Incomplete cell lysis - Poor amplification of nuclear DNA - Cell death or apoptosis - High metabolic activity (tissue-specific)

F. Chimeric Rate#

Percentage of read pairs mapping to different chromosomes or distant locations:

Chimeric Rate Interpretation
< 5% Excellent
5 - 10% Normal for whole genome amplification
10 - 15% Elevated - review amplification protocol
> 15% High - indicates excessive artifacts

Causes of High Chimeric Rate: - Over-amplification - DNA damage - Non-specific priming during amplification

G. SKEW (CNV Distribution Skewness)#

Measures the asymmetry of the CNV log2 ratio distribution:

SKEW Value Interpretation
< 0.1 Symmetric CNV distribution - high quality
0.1 - 0.2 Slight skew - acceptable
> 0.2 Significant skew - indicates bias or poor coverage

Ideal: SKEW close to 0 indicates balanced amplification without systematic bias

H. Log2 Ratio (CNV)#

Log2-transformed copy number ratio relative to expected ploidy:

Log2 Value Copy Number Interpretation
-∞ 0 Homozygous deletion
-1 1 Heterozygous deletion
0 2 Normal diploid
+0.58 3 Single copy gain
+1 4 Double copy gain

Quality Assessment: - Tight distribution around 0: High-quality, uniform coverage - Broad distribution: Noisy coverage, poor quality - Systematic shift: Ploidy estimation error or contamination


Frequently Asked Questions#

Q: Can I use CRAM files as input?#

A: No, BJ-DNA-QC currently does not support CRAM format. Contact ResolveServices to convert CRAM to FASTQ files.

Q: What if some samples have no reads?#

A: Samples with zero reads will cause the pipeline to fail. Always verify that all input files contain data before submission.

Q: How do I select samples for high-depth sequencing?#

A: Use the ClusterQC CNV-Quadrants plot. Select samples in Category 5 (and optionally Category 4) for optimal results.

Q: Can I re-run the pipeline with different parameters?#

A: Yes, create a new project with the same input data but different parameter settings. Previous results are preserved.

Q: Where can I view example outputs?#

A: Example files are linked throughout this manual in the output file descriptions.