BJ-DNA-QC Pipeline User Manual#
Customer Support#
If you need assistance with the BJ-DNA-QC pipeline, several support channels are available:
BaseJumper Online Manual#
For comprehensive documentation and guides, visit the BaseJumper Documentation Portal.
Email Support#
Contact our support team at: basejumper@bioskryb.com
Online Portal Ticketing System#
For bug reports and feature requests, submit a ticket through our online portal. Your account representative can provide access details.
Before You Begin#
Account Registration#
- Register for a BaseJumper account at https://basejumper.bioskryb.com, select ‘Create Account’ and review the Terms and Conditions.
- Wait for account approval from your organization administrator
- Log in with your credentials to access the platform
Data Transfer Setup#
Before running the BJ-DNA-QC pipeline, ensure your sequencing data is accessible to you:
Globus Data Transfer#
- Recommended method for large-scale data transfers
- Set up a Globus endpoint for your data storage.
- You can review Globus setup instructions in the BaseJumper Online Manual here
- The BaseJumper support team will add you to your assigned workspace and an email confirming membership will arrive from groups.globus.org (you may need to check your Junk/Spam filter).
- Contact BaseJumper support for additional Globus configuration assistance
Alternative Transfer Methods#
- AWS S3 bucket access (requires IAM credentials and ARN roles submitted)
- Direct upload through BaseJumper interface (for smaller datasets)
- Contact support to discuss custom data transfer solutions
Sample Metadata Requirements#
The BJ-DNA-QC pipeline requires a properly formatted input CSV. You can get the initial format of this CSV from the BaseJumper project page and clicking the ‘Export’ button at the top of the biosample table. This file will have the following structure (and you will need to add the ‘groups’ column):
Input CSV Format#
| Column Name | Description | Required | Example |
|---|---|---|---|
biosampleName |
Unique identifier for each sample | Yes | chr22_testsample1 |
read1 |
Path to forward reads (R1) FASTQ file | Yes | s3://bucket/sample_R1_001.fastq.gz |
read2 |
Path to reverse reads (R2) FASTQ file | Yes | s3://bucket/sample_R2_001.fastq.gz |
groups |
Optional grouping for batch analysis | No | Group1 or Control |
Example Input File#
1 2 3 4 | |
FASTQ Naming Conventions#
BJ-DNA-QC accepts standard Illumina FASTQ naming formats:
{SampleName}_S{SampleNumber}_L{Lane}_R{ReadNumber}_001.fastq.gz- Example:
SampleA_S1_L001_R1_001.fastq.gz
Important Notes:
- Files must be gzipped (.fastq.gz extension)
- Read pairs must have matching sample names
- S3 paths should be accessible from your BaseJumper account
Project Design#
Creating a New Project#
-
Navigate to Projects: From the BaseJumper dashboard, click the "Create Project" button in the upper right corner
-
Project Configuration:
- Project Name: Choose a descriptive, unique name for your project
- Project Description: Add details about your experiment (optional but recommended)
-
Select Pipeline: Choose "BJ-DNA-QC" from the pipeline dropdown menu
-
Sample Selection:
- Select samples from your Shared Data directory. If you do have an Illumina BaseSpace Sequence Hub token you can follow instructions in the BaseJumper Manual to provide this token to the Support email.
Shared Data Requirements#
If using samples from Shared Data:
- Samples must follow Illumina naming conventions
- Files must be in .fastq.gz format
- CRAM files are not currently supported - contact ResolveServices for CRAM processing
Sample Organization#
Before submitting your project, consider:
- Which samples to include? Select only samples you want to analyze together. You can add biosamples later to a project (back out to workspace-level view and then the dots of your project for the ‘Add Biosamples’ menu to arrive), but you will need to reanalyze the entire project to aggregate metrics into one summary report.
- Metadata grouping: Use the
groupscolumn to organize samples by experimental conditions - Quality control: Ensure all samples have sufficient read depth, this should be over 100,000 reads although stability for quality metrics requires at least 1M reads / biosample.
⚠️ Warning: Samples with zero reads will cause the pipeline to fail. Verify data quality before submission. If you auto-queue analyses after project loading, this is the primary reason for analysis to fail.
Automatic Pipeline Queuing#
By default, projects are automatically queued for execution after creation. To disable this:
- Uncheck the "Auto-queue project" option during project setup
- Manually start the pipeline from the project dashboard when ready
Pipeline Parameters#
Selecting Pipeline Version#
- From the pipeline configuration screen, select your preferred pipeline version
- Default: The most recent stable version (recommended)
- For reproducibility, you can select specific previous versions
Core Parameters#
Configure the following parameters based on your experimental design:
Read Length#
| Option | Description | Use Case |
|---|---|---|
50 |
50 base pair reads | Older sequencing protocols |
75 |
75 base pair reads (Default) | Standard low-pass sequencing |
100 |
100 base pair reads | Higher resolution QC |
150 |
150 base pair reads | Deep sequencing applications |
Default: 75
Read Sampling#
Controls the number of reads subsampled per biosample for QC analysis:
| Option | Description | Notes |
|---|---|---|
1000 |
1K reads | Quick testing only |
1000000 |
1M reads | Faster QC, lower accuracy |
2000000 |
2M reads (Default) | Recommended for comprehensive QC |
Default: 2000000 (2M reads)
ℹ️ Note: 1M paired-end reads equals 2M individual reads
Reference Genome#
Select the reference genome that matches your sample organism:
| Genome | Description |
|---|---|
GRCh38 (Default) |
Human reference genome, build 38 |
GRCm39 |
Mouse reference genome, build 39 |
Default: GRCh38
Optional Modules#
Enable or disable optional analysis modules based on your needs:
| Module | Default State | Description |
|---|---|---|
| FastQC | ✅ Enabled | Performs quality checks on raw sequencing data including per-base quality, GC content, and adapter contamination |
| CNV | ✅ Enabled | Detects copy number variations across the genome using Ginkgo segmentation |
| Qualimap | ❌ Disabled | Evaluates alignment quality with detailed coverage statistics and quality distributions |
| Kraken | ❌ Disabled | Performs taxonomic classification to detect contamination |
| Mutational Signature Profile | ❌ Disabled | Identifies mutational signatures in grouped samples through pseudobulk analysis and variant calling |
💡 Recommendation: Enable CNV module for single-cell DNA samples to assess amplification quality and generate ClusterQC plots for quality categorization
Mutational Signature Profile Requirements#
The Mutational Signature module requires: - Groups column in input CSV: Samples with the same group name will be combined through pseudobulk analysis - GRCh38 genome only (currently supported) - Samples are combined by group, variants are called on pseudobulk BAMs, and mutational patterns are profiled using SigProfilerMatrixGenerator
Samples with blank or "None" in the groups column will be excluded from mutational signature analysis.
Workflow Overview#
Pipeline Execution Steps#
The BJ-DNA-QC pipeline performs the following analyses:
flowchart TD
%% Colors %%
classDef black fill:#12294C,stroke:#12294C,stroke-width:3px,color:#fff,font-size:16px
classDef blue fill:#20A4F3,stroke:#20A4F3,stroke-width:3px,color:#fff,font-size:16px
classDef green fill:#3BCEAC,stroke:#3BCEAC,stroke-width:3px,color:#fff,font-size:16px
classDef pink fill:#ef476f,stroke:#ef476f,stroke-width:3px,color:#fff,font-size:16px
classDef orange fill:#f3722c,stroke:#f3722c,stroke-width:3px,color:#fff,font-size:16px
classDef purple fill:#9b5de5,stroke:#9b5de5,stroke-width:3px,color:#fff,font-size:16px
Start([START]):::black
Subsample[Read Subsampling<br/>SEQTK - 2M reads]:::blue
Trim[Quality Trimming<br/>FASTP]:::blue
Align[Read Alignment<br/>Sentieon BWA-MEM]:::green
Dedup[Duplicate Removal<br/>Sentieon Dedup]:::green
Metrics[Metrics Collection<br/>Sentieon Driver]:::pink
Preseq[Library Complexity<br/>PreSeq]:::pink
CNV[CNV Analysis<br/>Ginkgo]:::pink
Kraken[Taxonomic Classification<br/>Kraken - Optional]:::pink
Pseudobulk{Mutational Signature<br/>Enabled?}:::purple
PseudoBAM[Pseudobulk BAMs<br/>by Group]:::purple
VariantCall[Variant Calling<br/>Sentieon DNAscope]:::purple
Filter[Filter Variants<br/>BCFtools]:::purple
SigProfile[Mutational Signatures<br/>SigProfilerMatrixGenerator]:::purple
ClusterQC[ClusterQC Plots<br/>Quality Categories]:::orange
MultiQC[MultiQC Report<br/>Aggregate Results]:::orange
End([END]):::black
Start --> Subsample
Subsample --> Trim
Trim --> Align
Align --> Dedup
Dedup --> Metrics
Dedup --> Pseudobulk
Metrics --> Preseq
Metrics --> CNV
Metrics --> Kraken
Metrics --> MultiQC
CNV --> ClusterQC
Preseq --> MultiQC
Kraken --> MultiQC
ClusterQC --> MultiQC
Pseudobulk -->|Yes| PseudoBAM
Pseudobulk -->|No| MultiQC
PseudoBAM --> VariantCall
VariantCall --> Filter
Filter --> SigProfile
SigProfile --> MultiQC
MultiQC --> End
Processing Steps Detail#
1. Read Subsampling (SEQTK)#
- Randomly selects 2M reads per sample (default)
- Ensures uniform comparison across samples
- Output: Subsampled FASTQ files
2. Quality Trimming (FASTP)#
- Removes adapter sequences
- Trims low-quality bases
- Filters reads below quality threshold
- Output: Trimmed FASTQ files, QC JSON reports
3. Read Alignment (Sentieon BWA MEM)#
- Maps reads to reference genome
- Uses BWA-MEM algorithm optimized by Sentieon
- Output: BAM alignment files
4. Duplicate Removal (Sentieon Dedup)#
- Identifies PCR and optical duplicates
- Marks or removes duplicates from analysis
- Output: Deduplicated BAM files with indexes
5. Metrics Collection (Sentieon Driver)#
- Alignment metrics: Mapping rates, paired-end statistics
- GC bias metrics: GC content distribution
- Insert size metrics: Fragment size distribution
- Coverage metrics: Genome-wide coverage statistics
- Output: Multiple
.txtmetrics files per sample
6. Library Complexity (Preseq)#
- Estimates library complexity
- Projects sequencing saturation
- Output: Complexity curves and extrapolation data
7. CNV Analysis (Ginkgo) (Optional)#
- Segments genome into bins
- Calculates copy number variation
- Generates CNV profiles
- Output: CNV plots (JPEG), segmentation files (TSV)
8. Taxonomic Classification (Kraken) (Optional)#
- Identifies organism composition
- Detects contamination
- Output: Taxonomic classification reports
9. Mutational Signature Profiling (Optional)#
- Combines BAMs from samples with the same group label (pseudobulk)
- Calls variants using Sentieon DNAscope
- Filters variants against GIAB reference
- Generates mutational signature profiles using SigProfilerMatrixGenerator
- Analyzes single base substitution (SBS) patterns
- Output: VCF files, mutational catalogs, SBS96 plots
- Requirements: Groups column in input CSV, GRCh38 genome only
10. Report Generation (MultiQC)#
- Aggregates metrics across all samples
- Creates interactive HTML report
- Generates summary tables
- Output:
multiqc_report.html
File Naming Conventions#
Input Files#
- Forward reads:
{biosampleName}_R1_001.fastq.gz - Reverse reads:
{biosampleName}_R2_001.fastq.gz
Output Files#
- Aligned BAM:
{biosampleName}.bam - BAM index:
{biosampleName}.bam.bai - CNV plot:
{biosampleName}_CNV1.tsv(gains),{biosampleName}_CNV2.tsv(losses) - Metrics:
{biosampleName}.dedup.{metric_type}.sentieonmetrics.txt
Estimated Processing Time#
| Project Size | Estimated Time | Notes |
|---|---|---|
| 10 samples | <1 hour | With 2M reads per sample |
| 50 samples | 1-3 hours | Standard QC modules enabled |
| 100 samples | 2-4 hours | All optional modules enabled |
| 384 samples | 3-5 hours | High-throughput plate |
Factors affecting runtime: - Read depth per sample - Reference genome size - Optional modules enabled - Current cluster load
Summarizing Output#
MultiQC Report#
The MultiQC report aggregates all QC metrics into a single interactive HTML file located at:
1 | |
MultiQC Report Sections#
1. General Statistics#
Overview table showing key metrics for all samples: - Total reads - Alignment rate - Duplication rate - Insert size - Coverage metrics
2. Selected Metrics#
- Curated subset of key quality metrics for quick sample assessment
- Interactive table showing critical QC parameters across all samples
- Includes metrics such as:
- Library complexity (PreSeq count)
- Mitochondrial read percentage
- Chimeric rate
- Error rates
- Alignment percentages
- Insert size statistics
- Coverage uniformity (Gini coefficient)
- CNV quality metrics (MAPD, Segment MAD)
- Quality category assignments
3. All Metrics#
- Comprehensive table containing all metrics generated by the pipeline
- Complete QC dataset for advanced analysis
- Exportable for downstream processing
- Includes all metrics from individual pipeline modules
4. FastQC Section#
If enabled, displays: - Per-base sequence quality: Quality scores across read positions - Per-sequence quality scores: Distribution of mean quality scores - Adapter content: Detection of adapter contamination - GC content distribution: Expected vs. observed GC content
5. Fastp Section#
Trimming and filtering statistics: - Reads passed/filtered - Adapter trimming rates - Quality filtering results - Before/after comparison plots
6. Sentieon Alignment Metrics#
- Total reads mapped: Percentage of reads successfully aligned
- Properly paired reads: Percentage of read pairs aligned correctly
- Chimeric rate: Percentage of chimeric read pairs
- Mean coverage: Average depth across the genome
7. Sentieon GC Bias Metrics#
- GC content vs. coverage plot
- Normalized coverage by GC content
- AT/GC dropout assessment
8. Insert Size Distribution#
- Histogram of fragment sizes
- Mean and median insert sizes
- Standard deviation
9. Preseq Library Complexity#
- Complexity curves showing library diversity
- Extrapolated saturation points
- Expected unique reads at higher depths
10. Kraken Taxonomic Classification (if enabled)#
- Species composition
- Contamination detection
- Read assignment percentages
11. Mutational Signature Profile (if enabled)#
- Single base substitution (SBS) patterns
- Mutational catalog by trinucleotide context
- SBS96 barplots showing mutation types
- Group-level mutational profiles
ClusterQC Analysis#
ClusterQC provides visual assessment of sample quality and consistency:
CNV Quadrants Plot#
Location: tertiary_analyses/qc_plots/CNV-Quadrants.pdf
This plot classifies samples into quality categories based on CNV metrics:
Axes: - X-axis: MAPD Score (Median Absolute Pairwise Difference) - Y-axis: Segment MAD (Median Absolute Deviation)
Quality Categories (Quadrants):
| Category | MAPD Score | Segment MAD | Quality Level |
|---|---|---|---|
| Category 5 | < 0.4 | < 0.4 | Excellent - Ready for deep sequencing |
| Category 4 | 0.4-0.6 | < 0.4 | Good - Suitable for sequencing |
| Category 3 | < 0.4 | 0.4-0.6 | Moderate - Review individual profiles |
| Category 2 | 0.4-0.6 | 0.4-0.6 | Fair - Consider re-amplification |
| Category 1 | > 0.6 | > 0.6 | Poor - Not recommended for deep sequencing |
Interpretation: - Samples in Category 5 (bottom-left quadrant) show uniform amplification with low noise - Samples in Category 1 (top-right quadrant) show high noise and poor coverage uniformity - Use this plot to select high-quality samples for downstream deep sequencing
QC Composition Plot#
Location: tertiary_analyses/qc_plots/QC_composition.pdf
Displays the distribution of samples across quality categories: - Bar chart showing number and percentage of samples in each category - Helps assess overall library quality - Useful for batch effect detection
Consensus Scores#
Individual Sample Scores:
Location: tertiary_analyses/qc_plots/DNA-QC_ConsensusScores.txt
Detailed per-sample quality scores showing:
- SampleId: Biosample identifier
- CompositeScore: Overall quality category (1-5)
- Group: Sample group assignment (from input CSV)
- Verdict columns: Individual pass/fail scores for each QC metric:
- Verdict_preseq_count: Library complexity score
- Verdict_PCT_CHIMERAS: Chimeric rate score
- Verdict_chrM: Mitochondrial contamination score
- Verdict_MAPD_CNV_Log2: Coverage uniformity score
- Verdict_SKEW_CNV: CNV distribution symmetry score
Each verdict is scored as 1 (pass) or 0 (fail). The CompositeScore represents the overall quality based on all individual verdicts.
Example:
1 2 3 4 | |
Summary Table:
Location: tertiary_analyses/qc_plots/DNA-QC_ConsensusScores_SummaryTable.txt
Aggregate summary showing: - Number of cells in each quality category - Proportion of cells in each category - Overall batch quality assessment
Example:
1 2 3 4 5 6 | |
In this example, all 10 samples are Category 5 (excellent quality).
Output Files#
All pipeline results are organized in a structured directory hierarchy:
Directory Structure#
1 2 3 4 5 6 7 8 9 10 11 12 13 | |
Detailed Output Descriptions#
1. multiqc/ Directory#
| File | Description |
|---|---|
multiqc_report.html |
Main QC report - Interactive HTML report with all QC metrics. Open in web browser. |
multiqc_data/ |
Raw data used to generate MultiQC report (JSON, TSV files) |
multiqc_version.yml |
Version information for reproducibility |
Example: multiqc_report.html
2. primary_analyses/metrics/ Directory#
Per-biosample metrics from raw read analysis:
1 2 3 | |
| File Pattern | Description |
|---|---|
*_no_qc_fastp.json |
JSON file with detailed fastp metrics: read counts, quality scores, adapter content, filtering statistics |
Use Case: Review trimming effectiveness and raw read quality
Example: sample_no_qc_fastp.json
3. secondary_analyses/alignment/ Directory#
Aligned reads for each biosample:
1 2 3 | |
| File Type | Description |
|---|---|
.bam |
Binary Alignment Map - Aligned reads in binary format. Can be viewed with IGV or samtools. |
.bam.bai |
BAM Index - Required for efficient random access to BAM file. |
Size: Typically 50-500 MB per sample (for 2M reads)
Use Case: Visual inspection of alignments, custom downstream analysis
4. secondary_analyses/metrics/ Directory#
Comprehensive alignment metrics per biosample:
1 2 3 4 5 6 7 8 9 10 | |
| File Pattern | Description |
|---|---|
*.alignmentstat_sentieonmetrics.txt |
Alignment statistics: total reads, mapped reads, properly paired, chimeric rate |
*.cov_sentieonmetrics* |
Coverage metrics: mean coverage, coverage distribution, genome fraction covered |
*.gcbias.sentieonmetrics.txt |
GC bias metrics: normalized coverage by GC content, AT/GC dropout |
*.insertsizemetricalgo.sentieonmetrics.txt |
Insert size distribution: mean, median, standard deviation of fragment sizes |
*.wgsmetricsalgo.sentieonmetrics.txt |
Whole genome sequencing metrics: coverage uniformity, genome territory |
bam_lorenz_coverage/ |
Lorenz curve data for coverage uniformity assessment |
Files with .dedup. prefix: Metrics calculated after duplicate removal
Files with .nondedup. prefix: Metrics calculated before duplicate removal
Example Files: - sample_alignmentstat_sentieonmetrics.txt - sample_insertsize_sentieonmetrics.txt - sample_gcbias_sentieonmetrics.txt - sample_wgsmetrics_sentieonmetrics.txt
Summary Files:
1 2 3 | |
| File | Description |
|---|---|
*_all_metrics_mqc.txt |
All metrics table - Comprehensive TSV with all QC metrics for all samples |
*_selected_metrics_mqc.txt |
Key metrics table - Curated subset of most important QC metrics |
Example Files: - nf-preseq-pipeline_selected_metrics_mqc.txt - nf-preseq-pipeline_all_metrics_mqc.txt
Example Lorenz Curve: lorenz_example.png
5. tertiary_analyses/cnv_ginkgo/ Directory#
Copy number variation analysis outputs (if CNV module enabled):
1 2 3 4 5 6 7 8 | |
| File Pattern | Description |
|---|---|
*_CNV1.tsv |
Copy number calls per genomic bin (method 1) |
*_CNV2.tsv |
Copy number calls per genomic bin (method 2) |
*_dots.tsv |
Individual bin values for plotting |
AllSample-GinkgoSegmentSummary.txt |
Summary of CNV segments across all samples: number of segments, MAD scores |
SegCopy.binsize_*.tsv |
Segment-level copy number matrix (all samples) |
cnv_plots_*.tar.gz |
Compressed archive containing JPEG CNV profile plots for all samples |
ginkgo_res.*.RDS |
R data file with complete Ginkgo analysis results |
CNV Profile Example: cnv_number_profile_example.jpeg
Example Files: - sample_CNV1.tsv - AllSample-GinkgoSegmentSummary.txt - SegCopy.binsize_1000000.tsv
To extract CNV plots:
1 | |
6. tertiary_analyses/qc_plots/ Directory#
Aggregate quality control plots and scores:
1 2 3 4 5 6 7 8 9 10 11 | |
| File | Description |
|---|---|
CNV-Quadrants.pdf |
ClusterQC plot - Scatter plot showing MAPD vs. Segment MAD with quality categories |
QC_composition.pdf |
Category distribution - Bar chart showing proportion of samples in each quality category |
*_mqc.jpg |
JPEG versions of plots for inclusion in reports |
DNA-QC_ConsensusScores.txt |
Individual sample consensus scores with quality category assignments |
DNA-QC_ConsensusScores_SummaryTable.txt |
Overall summary: number and proportion of samples per category |
ConsensusScores_SummaryTableByGroup.txt |
Group-stratified summary (if groups specified in input CSV) |
AllSample-GinkgoSegmentSummary.txt |
Segmentation statistics for all samples |
Example Files: - CNV-Quadrants.pdf - CNV-Quadrants_mqc.jpg - QC_composition.pdf - QC_composition_mqc.jpg - DNA-QC_ConsensusScores.txt - DNA-QC_ConsensusScores_SummaryTable.txt - ConsensusScores_SummaryTableByGroup.txt
7. tertiary_analyses/mutational_signatures/ Directory (if enabled)#
Mutational signature profiling outputs for grouped samples:
1 2 3 4 5 6 7 8 9 10 11 12 13 | |
| File Pattern | Description |
|---|---|
*.pseudobulk.bam |
Pseudobulk alignment - Combined BAM file from all samples in the group |
*.pseudobulk.bam.bai |
BAM index file for pseudobulk alignment |
*.dnascope.vcf.gz |
Variant calls - Called variants from pseudobulk BAM using Sentieon DNAscope |
*.filtered.vcf.gz |
Filtered variants - High-confidence variants after filtering against GIAB reference |
*_SBS96.all |
SBS96 matrix - Single base substitution counts in 96 trinucleotide contexts |
*_SBS96_barplot.pdf |
Visualization - Bar plot showing mutational signature profile |
mutational_catalog_mqc.txt |
Summary table for MultiQC integration |
merged_mutational_catalog.tsv |
Combined mutational catalog across all groups |
Trinucleotide Context: Each mutation is classified by the mutated base and its flanking bases (e.g., C>A in ACA context)
SBS96 Classification: Mutations are categorized into 6 substitution types (C>A, C>G, C>T, T>A, T>C, T>G) × 16 trinucleotide contexts = 96 categories
Use Case: Identify mutational processes, compare signatures between experimental groups, detect exposure-related mutation patterns
8. read_counts/ Directory#
Read count summaries at each processing step:
1 2 | |
Tracks the number of reads remaining after each filtering/processing step.
9. execution_info/ Directory#
Pipeline execution metadata:
1 2 3 4 5 6 | |
| File | Description |
|---|---|
execution_report.html |
Nextflow execution report with resource usage and runtime statistics |
execution_timeline.html |
Visual timeline of task execution |
execution_trace.txt |
Detailed trace of all executed tasks |
versions.yml |
Software versions for all tools used in the pipeline |
Use Case: Troubleshooting, reproducibility, performance optimization
Appendix#
A. FASTQ Naming Conventions#
The BJ-DNA-QC pipeline accepts Illumina FASTQ files following these naming patterns:
Standard Illumina Format#
1 | |
Components:
- {SampleName}: User-defined sample identifier (e.g., SampleA, Patient01_Tumor)
- S{SampleNumber}: Sample number from sample sheet (e.g., S1, S12)
- L{LaneNumber}: Sequencing lane (e.g., L001, L002)
- R{ReadNumber}: Read direction (R1 for forward, R2 for reverse)
- 001: File segment (always 001 for single files)
Examples:
1 2 3 4 | |
Simplified Format (Also Accepted)#
1 2 | |
Requirements:
- Files must be gzip compressed (.fastq.gz or .fq.gz)
- Forward and reverse reads must have matching sample names
- Read pairs must be indicated by _R1_ and _R2_ (or _R1. and _R2.)
B. Selected Metrics Descriptions#
The nf-preseq-pipeline_selected_metrics_mqc.txt file contains the following key metrics:
| Metric | Description | Typical Range | Quality Indicator |
|---|---|---|---|
sample_name |
Unique biosample identifier | - | - |
preseq_count |
Estimated library complexity | > 1B | Higher is better - indicates diverse library |
chrM |
Percentage of mitochondrial reads | 0.1 - 0.5% | Lower is better - high values indicate poor library |
PCT_CHIMERAS |
Percentage of chimeric read pairs | 5 - 10% | Lower is better - high values indicate amplification artifacts |
PF_HQ_ERROR_RATE |
Error rate in high-quality reads | 0.2 - 0.5% | Lower is better |
pct_trimmed_aligned |
Percentage of trimmed reads that aligned | > 98% | Higher is better - indicates good quality and correct reference |
MEDIAN_INSERT_SIZE |
Median fragment size (bp) | 300 - 500 bp | Should match expected library prep |
total_reads |
Total number of raw reads | 2M (after subsampling) | Should match subsampling parameter |
adapter_trimmed_reads |
Number of reads with adapters removed | Varies | Lower is better - indicates clean library |
adapter_trimmed_bases |
Total bases trimmed from adapters | Varies | Lower is better |
gini_coefficient_index |
Coverage uniformity (Gini index) | 0.02 - 0.05 | Lower is better - indicates uniform coverage |
cnv_genome_ploidy |
Estimated genome ploidy | 2 (diploid) | Should match expected sample ploidy |
cnv_number_of_segments |
Number of CNV segments detected | 5 - 20 | Fewer is better for normal samples |
cnv_segment_MAD |
Segment Median Absolute Deviation | < 0.4 | Lower is better - measures CNV noise |
SegmentMAD_Aware |
Adjusted Segment MAD | < 0.4 | Lower is better - quality metric |
MAPD.Score |
Median Absolute Pairwise Difference | < 0.4 | Lower is better - coverage uniformity |
MAPD_CNV_Log2 |
Log2-transformed MAPD for CNV | < 0.2 | Lower is better |
SKEW_CNV |
Skewness of CNV distribution | < 0.15 | Lower is better - symmetric distribution |
C. ClusterQC Metric Cutoffs#
CNV Quality Category Definitions#
The ClusterQC analysis uses the following thresholds to classify sample quality: - MAPD Score (Median Absolute Pairwise Difference) - Segment MAD (Median Absolute Deviation)
Category 5 (Excellent): - MAPD Score < 0.4 - Segment MAD < 0.4 - Recommendation: Proceed with high-depth sequencing
Category 4 (Good): - MAPD Score < 0.4 - Segment MAD 0.4-0.6 - Recommendation: Suitable for sequencing, acceptable quality
Category 3 (Moderate): - MAPD Score 0.4-0.6 - Segment MAD < 0.4 - Recommendation: Review individual CNV profiles before sequencing
Category 2 (Fair): - MAPD Score 0.4-0.6 - Segment MAD 0.4-0.6 - Recommendation: Consider re-amplification or exclude from study
Category 1 (Poor): - MAPD Score > 0.6 OR Segment MAD > 0.6 - Recommendation: Do not proceed with deep sequencing
Metric Interpretations#
MAPD (Median Absolute Pairwise Difference): - Measures coverage uniformity across adjacent genomic windows - Calculated as the median of absolute differences between adjacent bins - Low MAPD (< 0.4): Smooth, uniform coverage - High MAPD (> 0.6): Noisy, uneven coverage
Segment MAD (Median Absolute Deviation): - Measures the variability within CNV segments - Reflects the noise level in copy number calls - Low MAD (< 0.4): Clean segmentation, low noise - High MAD (> 0.6): Noisy segmentation, poor quality
D. PreSeq Library Complexity#
PreSeq estimates library complexity and predicts sequencing saturation:
Key Metrics: - Extrapolated complexity: Predicted number of unique molecules at higher sequencing depths - Saturation point: Sequencing depth where few new unique molecules are detected
Interpretation: - High complexity (> 1B unique molecules): Library can support deep sequencing - Low complexity (< 100M unique molecules): Library is saturated, deep sequencing will yield diminishing returns
Use Case: Decide whether to proceed with high-depth sequencing based on library diversity
E. Mitochondrial Read Percentage (ChrM)#
Percentage of reads mapping to the mitochondrial genome:
| ChrM % | Interpretation |
|---|---|
| < 0.5% | Normal - high-quality nuclear DNA library |
| 0.5 - 2% | Elevated - acceptable for most applications |
| 2 - 10% | High - may indicate cell stress or poor lysis |
| > 10% | Very high - poor nuclear DNA recovery, consider re-amplification |
Causes of High ChrM: - Incomplete cell lysis - Poor amplification of nuclear DNA - Cell death or apoptosis - High metabolic activity (tissue-specific)
F. Chimeric Rate#
Percentage of read pairs mapping to different chromosomes or distant locations:
| Chimeric Rate | Interpretation |
|---|---|
| < 5% | Excellent |
| 5 - 10% | Normal for whole genome amplification |
| 10 - 15% | Elevated - review amplification protocol |
| > 15% | High - indicates excessive artifacts |
Causes of High Chimeric Rate: - Over-amplification - DNA damage - Non-specific priming during amplification
G. SKEW (CNV Distribution Skewness)#
Measures the asymmetry of the CNV log2 ratio distribution:
| SKEW Value | Interpretation |
|---|---|
| < 0.1 | Symmetric CNV distribution - high quality |
| 0.1 - 0.2 | Slight skew - acceptable |
| > 0.2 | Significant skew - indicates bias or poor coverage |
Ideal: SKEW close to 0 indicates balanced amplification without systematic bias
H. Log2 Ratio (CNV)#
Log2-transformed copy number ratio relative to expected ploidy:
| Log2 Value | Copy Number | Interpretation |
|---|---|---|
| -∞ | 0 | Homozygous deletion |
| -1 | 1 | Heterozygous deletion |
| 0 | 2 | Normal diploid |
| +0.58 | 3 | Single copy gain |
| +1 | 4 | Double copy gain |
Quality Assessment: - Tight distribution around 0: High-quality, uniform coverage - Broad distribution: Noisy coverage, poor quality - Systematic shift: Ploidy estimation error or contamination
Frequently Asked Questions#
Q: Can I use CRAM files as input?#
A: No, BJ-DNA-QC currently does not support CRAM format. Contact ResolveServices to convert CRAM to FASTQ files.
Q: What if some samples have no reads?#
A: Samples with zero reads will cause the pipeline to fail. Always verify that all input files contain data before submission.
Q: How do I select samples for high-depth sequencing?#
A: Use the ClusterQC CNV-Quadrants plot. Select samples in Category 5 (and optionally Category 4) for optimal results.
Q: Can I re-run the pipeline with different parameters?#
A: Yes, create a new project with the same input data but different parameter settings. Previous results are preserved.
Q: Where can I view example outputs?#
A: Example files are linked throughout this manual in the output file descriptions.