BJ-Expression Pipeline User Manual#
Customer Support#
If you need assistance with the BJ-Expression pipeline, several support channels are available:
BaseJumper Online Manual#
For comprehensive documentation and guides, visit the BaseJumper Documentation Portal.
Email Support#
Contact our support team at: basejumper@bioskryb.com
Online Portal Ticketing System#
For bug reports and feature requests, submit a ticket through our online portal. Your account representative can provide access details.
Before You Begin#
Account Registration#
- Register for a BaseJumper account at https://basejumper.bioskryb.com select ‘Create Account’ and review the Terms and Conditions.
- Wait for account approval from your organization administrator
- Log in with your credentials to access the platform
Data Transfer Setup#
Before running the BJ-Expression pipeline, ensure your sequencing data is accessible to you:
Globus Data Transfer#
- Recommended method for large-scale data transfers
- Set up a Globus endpoint for your data storage
- You can review Globus setup instructions in the BaseJumper Online Manual here
- The BaseJumper support team will add you to your assigned workspace and an email confirming membership will arrive from groups.globus.org (you may need to check your Junk/Spam filter).
- Contact BaseJumper support for additional Globus configuration assistance.
Alternative Transfer Methods#
- AWS S3 bucket access (requires IAM credentials)
- Direct upload through BaseJumper interface (for smaller datasets)
- Contact support to discuss custom data transfer solutions
Sample Metadata Requirements#
The BJ-Expression pipeline requires a properly formatted input CSV file. You can get the initial format of this CSV from the BaseJumper project page and clicking the ‘Export’ button at the top of the biosample table. This file will have the following structure (and you will need to add the ‘groups’ column):
Input CSV Format#
| Column Name | Description | Required | Example |
|---|---|---|---|
biosampleName |
Unique identifier for each sample | Yes | Expression-test1 |
read1 |
Path to forward reads (R1) FASTQ file | Yes | s3://bucket/sample_R1_001.fastq.gz |
read2 |
Path to reverse reads (R2) FASTQ file | Yes | s3://bucket/sample_R2_001.fastq.gz |
groups |
Optional grouping for batch analysis | No | Group1 or Control |
Example Input File#
1 2 3 4 | |
FASTQ Naming Conventions#
BJ-Expression accepts standard Illumina FASTQ naming formats:
{SampleName}_S{SampleNumber}_L{Lane}_R{ReadNumber}_001.fastq.gz- Example:
Expression-test1_S1_L001_R1_001.fastq.gz
Important Notes:
- Files must be gzipped (.fastq.gz extension)
- Read pairs must have matching sample names
- S3 paths should be accessible from your BaseJumper account
- CRAM files are not currently supported - contact ResolveServices for CRAM processing
Project Design#
Creating a New Project#
-
Navigate to Projects: From the BaseJumper dashboard, click the "Create Project" button in the upper right corner
-
Project Configuration:
- Project Name: Choose a descriptive, unique name for your project
- Description: Add optional project details
-
Select Pipeline: Choose BJ-Expression from the pipeline dropdown menu
-
Sample Selection:
- Select samples from your Shared Data directory. If you do have an Illumina BaseSpace Sequence Hub token you can follow instructions in the BaseJumper Manual to provide this token to the Support email.
Shared Data Requirements#
If using samples from Shared Data:
- Samples must follow Illumina naming conventions
- Files must be in .fastq.gz format
- Both paired-end reads (R1 and R2) must be present
Sample Organization#
Before submitting your project, consider:
- Which samples to include? Select only samples you want to analyze together
- Metadata grouping: Use the
groupscolumn to organize samples by experimental conditions - Quality control: Ensure all samples have sufficient read depth (minimum 5,000 reads)
⚠️ Warning: Samples with fewer than 5,000 reads may cause the pipeline to fail. Verify data quality before submission.
Automatic Pipeline Queuing#
By default, projects are automatically queued for execution after creation. To disable this:
- Uncheck the "Auto-queue project" option during project setup
- Manually start the pipeline from the project dashboard when ready
Pipeline Parameters#
Selecting Pipeline Version#
- From the pipeline configuration screen, select your preferred pipeline version
- Default: The most recent stable version (recommended)
- For reproducibility, you can select specific previous versions
Core Parameters#
Configure the following parameters based on your experimental design:
Reference Genome#
Select the reference genome that matches your sample organism:
| Genome | Description |
|---|---|
GRCh38 (Default) |
Human reference genome, build 38 |
GRCh37 |
Human reference genome, build 37 |
GRCm39 |
Mouse reference genome, build 39 |
Default: GRCh38
Read Length#
| Option | Description | Use Case |
|---|---|---|
50 (Default) |
50 base pair reads | Standard single-cell RNA-seq |
75 |
75 base pair reads | Higher resolution analysis |
100 |
100 base pair reads | Deep sequencing applications |
150 |
150 base pair reads | Long-read RNA-seq protocols |
Default: 50
Adapter Sequences#
Specify adapter sequences for trimming:
| Parameter | Default Value | Description |
|---|---|---|
| Adapter Sequence for first read | AAGCAGTGGTATCAACGCAGAGTACA |
Adapter sequence trimmed from R1 reads |
| Adapter Sequence for second read | AAGCAGTGGTATCAACGCAGAGTACAT |
Adapter sequence trimmed from R2 reads |
💡 Note: These defaults are optimized for standard single-cell RNA-seq library prep protocols. Your sequencing analysis workflow should provide these adapter sequences, which you can use here.
Optional Modules#
Enable or disable optional analysis modules based on your needs:
| Module | Default State | Description |
|---|---|---|
| FastQ Reads Subsampling | ✅ Enabled | Randomly subsamples reads to 100,000 reads per sample for faster QC analysis. For comprehensive analysis, disable this option to use all reads. |
| CUTADAPT Adapter Removal | ✅ Enabled | Performs targeted adapter removal using user-specified sequences before FASTP trimming. Removes adapters from both 5' and 3' ends of reads. |
| 10x Support | ❌ Disabled | Enables analysis of 10x Genomics Chromium data with specialized processing |
Subsampling Details: - Default: 100,000 reads per sample. - Purpose: Enables faster computation time while providing sufficient data for QC metrics comparison. - Recommendation: Keep enabled for initial QC, disable for final full-depth analysis. This will keep data comparable from project to project but if you want to get the most out of phenotype prediction, isoform detection or variant calling, you can deselect this.
CUTADAPT Details: - Default: Enabled - Purpose: Removes known adapter sequences efficiently before comprehensive FASTP analysis - Recommendation: Keep enabled when using standard library prep protocols with known adapter sequences; can be disabled if adapters are minimal or if FASTP alone is sufficient
Workflow Overview#
Pipeline Execution Steps#
The BJ-Expression pipeline performs the following analyses:
flowchart LR
%% Colors %%
classDef black fill:#12294C,stroke:#12294C,stroke-width:2px,color:#fff
classDef blue fill:#20A4F3,stroke:#20A4F3,stroke-width:2px,color:#fff
classDef green fill:#3BCEAC,stroke:#3BCEAC,stroke-width:2px,color:#fff
classDef pink fill:#ef476f,stroke:#ef476f,stroke-width:2px,color:#fff
classDef orange fill:#f3722c,stroke:#f3722c,stroke-width:2px,color:#fff
Start((Start)):::black --> Subsample[Subsample Reads<br/>SEQTK]:::blue
Subsample --> Trim[Trim Adapters<br/>FASTP]:::blue
Trim --> Salmon[Transcript Quantification<br/>SALMON]:::green
Trim --> STAR[Splice-aware Alignment<br/>STAR]:::green
STAR --> HTSeq[Gene Quantification<br/>HTSeq]:::green
STAR --> Qualimap[Alignment QC<br/>Qualimap]:::pink
STAR --> CellType[Cell Typing<br/>SingleR]:::pink
STAR --> PCA[PCA Analysis]:::pink
Salmon --> MultiQC[Aggregate Report<br/>MultiQC]:::orange
HTSeq --> MultiQC
Qualimap --> MultiQC
CellType --> MultiQC
PCA --> MultiQC
MultiQC --> End((End)):::black
Processing Steps Detail#
1. Read Subsampling (SEQTK)#
- Randomly selects 100,000 reads per sample (default)
- Enables rapid QC and consistent cross-sample comparison
- Can be disabled for full-depth analysis
- Output: Subsampled FASTQ files
💡 Why subsample? For initial QC or comparison between runs, subsampling enables faster computation while providing sufficient data for quality assessment.
2. Adapter Removal (CUTADAPT)#
- Removes adapter sequences from both 5' and 3' ends
- Uses user-specified adapter sequences for targeted removal
- Trims adapters from both forward and reverse reads
- Output: Adapter-trimmed FASTQ files
💡 Note: CUTADAPT runs before FASTP to remove known adapter sequences, then FASTP performs additional quality trimming and filtering.
3. Quality Trimming and QC (FASTP)#
- Performs additional adapter detection and removal
- Trims low-quality bases from read ends
- Performs poly-G tail removal (for two-color chemistry)
- Filters reads below quality threshold
- Generates comprehensive QC metrics
- Output: Trimmed FASTQ files, QC JSON reports
4. Transcript-Level Quantification (SALMON)#
- Uses pseudo-alignment for rapid transcript (isoform) quantification
- Maps reads to transcript sequences without full alignment
- Generates TPM (Transcripts Per Million) values and normalizes for isoform length
- Provides both transcript and gene-level counts
- Output: Transcript counts, TPM values, gene counts
5. Splice-Aware Alignment (STAR)#
- Aligns reads to reference genome with splice junction detection
- Handles intron-spanning reads for RNA-seq data
- Generates chimeric junction output for fusion detection
- Output: BAM alignment files, junction files
6. Primary Read Extraction (SAMTOOLS)#
- Extracts primary alignments from STAR output
- Filters secondary and supplementary alignments
- Creates indexed BAM files
- Output: Filtered BAM files with indexes
7. Gene-Level Quantification (HTSeq)#
- Counts reads overlapping gene features
- Uses STAR alignments and GTF annotations
- Provides raw read counts per gene
- Output: Gene count matrices
8. Alignment Quality Assessment (QUALIMAP)#
- Evaluates alignment quality metrics
- Calculates exonic/intronic/intergenic read proportions
- Assesses 5'-3' bias
- Generates coverage profiles
- Output: Quality metrics, coverage statistics
9. Cell Typing Classification (SingleR)#
- Predicts cell type based on gene expression patterns
- Uses reference databases: HPCA, GTEx, TCGA
- Classifies by cell phase, progenitor type, tissue type
- Output: Cell type predictions with confidence scores
10. Principal Component Analysis (PCA)#
- Identifies sample relationships and batch effects
- Visualizes expression patterns across samples
- Detects outliers or technical artifacts
- Output: PCA plots, variance explained
11. Report Generation (MultiQC)#
- Aggregates metrics across all samples and tools
- Creates interactive HTML report
- Generates summary tables and visualizations
- Output:
multiqc_report.html
File Naming Conventions#
Input Files#
- Forward reads:
{biosampleName}_R1_001.fastq.gz - Reverse reads:
{biosampleName}_R2_001.fastq.gz
Output Files#
- Aligned BAM:
{biosampleName}.bam - BAM index:
{biosampleName}.bam.bai - Gene counts:
{biosampleName}.htseq_counts.tsv - Chimeric junctions:
{biosampleName}_Chimeric.out.junction
Estimated Processing Time#
| Project Size | Estimated Time | Notes |
|---|---|---|
| 10 samples | <1 hour | With 100K reads subsampling |
| 50 samples | 1-3 hours | Standard QC analysis |
| 100 samples | 2-6 hours | Large batch processing |
| 384 samples | 4-12 hours | High-throughput plate |
Factors affecting runtime: - Read depth per sample - Subsampling enabled/disabled - Reference genome size - Current cluster load
Summarizing Output#
MultiQC Report#
The MultiQC report aggregates all QC metrics into a single interactive HTML file located at:
1 | |
MultiQC Report Sections#
1. General Statistics#
Overview table showing key metrics for all samples: - Total reads - Alignment rate - Exonic read percentage - Number of genes detected - Mitochondrial read percentage
2. Selected Metrics#
Curated selection of the most important QC metrics across all analysis tools. This section consolidates key quality indicators from read processing, alignment, quantification, and cell typing analyses into a single comprehensive table for easy comparison across samples. Metrics include read counts at various pipeline stages, quality filtering statistics, alignment proportions (exonic, intronic, intergenic, mitochondrial), gene and transcript detection counts, insert size measurements, coverage metrics, expression dynamics, and automated cell type classifications. See Appendix B for detailed descriptions of all metrics, including typical ranges and quality indicators.
3. Fastp Section#
Trimming and filtering statistics: - Reads passed/filtered - Adapter trimming rates - Quality filtering results - Poly-G tail removal (if applicable) - Before/after comparison plots
4. Salmon Section#
Transcript quantification metrics: - Mapping rate - Number of transcripts detected - Library type detection - Fragment length distribution
5. STAR Alignment Section#
Splice-aware alignment statistics: - Uniquely mapped reads percentage - Multi-mapped reads - Unmapped reads - Chimeric reads - Reads mapped to multiple loci
6. Qualimap Section#
Alignment quality metrics: - Exonic reads: Percentage of reads mapping to exons - Intronic reads: Percentage of reads mapping to introns - Intergenic reads: Percentage of reads in intergenic regions - 5'-3' bias: Coverage uniformity along transcript length - Gene body coverage: Distribution of reads across gene features
7. Gene Detection Section#
Gene-level quantification summary: - Total genes detected - Protein-coding genes - lncRNAs, pseudogenes, miRNAs - Gene type distribution - Mitochondrial gene proportion
8. Cell Typing Section#
Automated cell type classification: - Predicted cell type - Cell cycle phase - Tissue type prediction - Confidence scores
9. Expression Dynamics#
Dynamic range and expression metrics: - Housekeeping gene expression consistency - Coefficient of variation (CV) for housekeeping genes - Expression range (min/max/average) - Top and bottom 10% expression levels
10. PCA Analysis#
Sample clustering and relationships: - Principal component plots - Variance explained by each PC - Sample outlier detection - Group separation visualization
11. ClusterQC Composition#
Quality category distribution: - Number of samples per quality category - Overall batch quality assessment - Category definitions (1-5 scale)
ClusterQC Analysis#
ClusterQC provides visual assessment of RNA-seq sample quality:
RNA-QC Composition Plot#
Location: tertiary_analyses/qc_plots/composition_rnaqc.pdf
This plot displays the distribution of samples across quality categories.
Quality Categories:
| Category | Quality Level | Characteristics |
|---|---|---|
| Category 5 | Excellent | High alignment rate (>70%), high exonic proportion (>75%), low intergenic reads (<5%), >10K genes detected |
| Category 4 | Good | Alignment rate 50-70%, exonic proportion 60-75%, suitable for analysis |
| Category 3 | Moderate | Alignment rate 30-50%, may require additional QC review |
| Category 2 | Fair | Alignment rate 10-30%, consider re-sequencing |
| Category 1 | Poor | Alignment rate <10%, not recommended for analysis |
| Category 0 | Failed | Unable to process or extreme technical failure |
Interpretation: - Samples in Category 4-5 are suitable for downstream RNA-seq analysis - Samples in Category 1-2 may indicate library preparation or sequencing issues - Use this plot to identify problematic samples before proceeding to differential expression analysis
Summary Verdict Table#
Location: tertiary_analyses/qc_plots/summary_verdict.txt
Summary table showing: - Number of samples in each quality category - Proportion of samples in each category - Overall project quality assessment
Example:
1 2 3 4 5 6 7 | |
In this example, all 5 samples are Category 4 (good quality).
Group-Stratified Summary#
Location: tertiary_analyses/qc_plots/summary_verdict_group.txt
If groups are specified in the input CSV, this file shows quality category distribution per group, useful for detecting batch effects.
Output Files#
All pipeline results are organized in a structured directory hierarchy:
Directory Structure#
1 2 3 4 5 6 7 8 9 10 11 12 13 | |
Detailed Output Descriptions#
1. multiqc/ Directory#
| File | Description |
|---|---|
multiqc_report.html |
Main QC report - Interactive HTML report with all QC metrics. Open in web browser. |
multiqc_data/ |
Raw data used to generate MultiQC report (JSON, TSV files) |
multiqc_report_plots/ |
Plot files - PDF, PNG, and SVG versions of all plots featured in MultiQC |
multiqc_version.yml |
Version information for reproducibility |
Example: multiqc_report.html
2. secondary_analyses/alignment_htseq/ Directory#
Aligned reads for each biosample:
1 2 3 4 | |
| File Type | Description |
|---|---|
.bam |
Binary Alignment Map - STAR-aligned reads in binary format. Can be viewed with IGV or samtools. Contains primary alignments only. |
.bam.bai |
BAM Index - Required for efficient random access to BAM file. |
_Chimeric.out.junction |
Chimeric junctions - Detected fusion/chimeric reads that span distant genomic locations. Useful for fusion gene detection. |
Size: Typically 50-500 MB per sample (for 100K reads)
Use Case: Visual inspection of alignments, splice junction analysis, fusion detection
3. secondary_analyses/quantification_htseq/ Directory#
Gene-level quantification from STAR and HTSeq:
1 2 3 4 5 6 7 8 9 | |
| File | Description |
|---|---|
df_gene_counts_starhtseq.tsv |
Main gene counts table - Contains ENSEMBL gene IDs, gene symbols, gene biotypes, and HTSeq counts for each sample and detected gene |
df_mt_gene_counts_starhtseq.tsv |
Mitochondrial metrics - Contains MT_counts (mitochondrial gene counts), Total_counts (total detected gene counts), and PropMT (proportion of MT to total counts) per sample |
df_gene_types_detected_summary_starhtseq.tsv |
Gene biotype summary - Details the number and proportion of various gene types detected: protein-coding genes, lncRNAs, pseudogenes, miRNAs, etc. |
matrix_gene_counts_starhtseq.txt |
Count matrix - Project-level matrix with all samples (columns) and read counts for all genes (rows). Ready for differential expression analysis. |
HouseKeepingGenes_Expression.pdf |
Housekeeping gene heatmap - Clustergram showing expression consistency of housekeeping genes across samples |
HouseKeepingGenes_CV_mqc.tsv |
Coefficient of variation - CV rates for housekeeping genes, indicating technical variability |
HouseKeepingGenes_Counts_mqc.tsv |
Housekeeping counts - Raw counts for housekeeping genes used for QC assessment |
Example Files: - df_gene_counts_starhtseq.tsv - df_mt_gene_counts_starhtseq.tsv - df_gene_types_detected_summary_starhtseq.tsv - matrix_gene_counts_starhtseq.txt - HKGenes_Expression__mqc.png - Housekeeping gene heatmap
Use Case:
- Use matrix_gene_counts_starhtseq.txt for differential expression analysis in R (DESeq2, edgeR)
- Check df_mt_gene_counts_starhtseq.tsv for contamination or poor cell quality (high PropMT indicates dying cells)
- Review df_gene_types_detected_summary_starhtseq.tsv to assess library complexity
4. secondary_analyses/quantification_salmon/ Directory#
Transcript-level quantification from Salmon:
1 2 3 4 5 6 7 8 9 10 11 12 | |
Transcript-Level Files:
| File | Description |
|---|---|
df_transcript_counts_salmon.tsv |
Main transcript table - Contains ENSEMBL transcript IDs, transcript lengths, TPM values, and both scaled and unscaled transcript counts |
matrix_transcript_raw_salmon.tsv |
Raw transcript counts - Unscaled read counts per transcript across all samples |
matrix_transcript_tpm_salmon.tsv |
TPM-scaled transcripts - Transcripts Per Million scaled by library size |
matrix_transcript_length_tpm_salmon.tsv |
Length-scaled TPM - Scaled first by average transcript length, then by library size (recommended for differential expression) |
df_transcript_types_detected_summary_salmon.tsv |
Transcript biotype summary - Number and types of transcripts detected per sample |
Example Files: - df_transcript_counts_salmon.tsv
Gene-Level Files (Salmon-based):
| File | Description |
|---|---|
df_gene_counts_salmon.tsv |
Salmon gene counts - Gene-level counts generated by collapsing transcript counts from the same gene |
df_mt_gene_counts_salmon.tsv |
Mitochondrial metrics - Similar to HTSeq version but derived from Salmon quantification |
df_gene_types_detected_summary_salmon.tsv |
Gene biotype summary - Gene types detected using Salmon-based quantification |
matrix_gene_counts_salmon.tsv |
Gene count matrix - Unscaled gene counts across all samples |
matrix_gene_tpm_salmon.tsv |
Gene TPM matrix - TPM-scaled gene expression values |
matrix_gene_length_tpm_salmon.tsv |
Length-scaled gene TPM - Recommended for differential expression analysis using tximport-compatible tools |
Example Files: - df_gene_counts_salmon.tsv - matrix_gene_length_tpm_salmon.tsv
Use Case:
- Use matrix_transcript_length_tpm_salmon.tsv for transcript-level differential expression
- Use matrix_gene_length_tpm_salmon.tsv with tximport in R for gene-level DE analysis
- Salmon quantification is faster and doesn't require full alignment
- Compare Salmon and HTSeq gene counts to assess quantification concordance
5. secondary_analyses/secondary_metrics/ Directory#
Comprehensive alignment and expression metrics:
1 2 3 4 | |
| File | Description |
|---|---|
pipeline_metrics_summary_percents.csv |
Percentage-based metrics - Contains metrics from the "QualiMap percent stats" section: percentage of exonic reads, intronic reads, intergenic reads, and total reads aligned |
qualimap_stats_mqc.csv |
Qualimap statistics - Contains alignment stats including total aligned reads, alignments to genes, 5'-3' bias metrics |
df_dynamicrange_expression.tsv |
Expression dynamics - Contains min/max/average expression, bottom 10% and top 10% expression levels, and dynamic range for each sample |
Example Files: - pipeline_metrics_summary_percents.csv - qualimap_stats_mqc.csv - df_dynamicrange_expression.tsv
Key Metrics Explained:
| Metric | Description | Good Quality Range |
|---|---|---|
reads.aligned_P_Total |
Percentage of reads successfully aligned | > 50% |
exonic_P_gen |
Percentage of reads mapping to exons | > 60% |
intergenic_P_gen |
Percentage of reads in intergenic regions | < 10% |
dynamic_range_expr |
Spread of expression levels (top10% - bottom10%) | Higher indicates more diverse expression |
6. secondary_analyses/insert_size/ Directory#
Insert size distribution metrics (for paired-end sequencing):
1 2 | |
| File | Description |
|---|---|
aggregated_insert_size_summary.tsv |
Aggregated insert size metrics - Contains mean and median insert sizes calculated separately for different RNA biotypes across all samples |
This file provides insert size statistics stratified by gene biotype, useful for assessing library quality and fragment size distribution:
Biotype Columns: - Protein_Coding_Mean / Protein_Coding_Median: Insert sizes from reads mapping to protein-coding genes (most informative for library prep QC) - Ribosomal_RNA_Mean / Ribosomal_RNA_Median: Insert sizes from ribosomal RNA genes (indicates rRNA depletion efficiency) - Mitochondrial_RNA_Mean / Mitochondrial_RNA_Median: Insert sizes from mitochondrial genes - Small_RNAs_Mean / Small_RNAs_Median: Insert sizes from small nuclear/nucleolar RNAs - MicroRNAs_Mean / MicroRNAs_Median: Insert sizes from microRNA genes
Example Files: - aggregated_insert_size_summary.tsv
7. tertiary_analyses/classification_cell_typing/ Directory#
Cell type prediction results:
1 2 3 | |
| File | Description |
|---|---|
df_cell_typing_summary_singler_hpca_gtex_tcga.tsv |
Cell type predictions - Contains assigned Cell Phase (G1, S, G2M), Progenitor Type, Tissue Type, TGCA Tissue Type, and TGCA Tumor Type for each sample |
df_cell_typing_scores_singler_hpca_gtex_tcga.tsv |
Prediction scores - Confidence scores for each sample against all possible phases, progenitors, tissues, etc. Used to assess classification confidence |
Example Files: - df_cell_typing_summary_singler_hpca_gtex_tcga.tsv - df_cell_typing_scores_singler_hpca_gtex_tcga.tsv
Cell Classification Categories:
- Phase: Cell cycle stage (G1, S, G2M)
- Progenitor: Cell lineage type (e.g., B_cell, T_cell, Monocyte)
- Tissue: Predicted tissue of origin (e.g., Blood, Spleen, Brain)
- TGCA Tissue: TCGA reference tissue type
- TGCA Tumor: TCGA tumor classification
Use Case: Validate expected cell types, detect contamination, identify unexpected cell populations
8. tertiary_analyses/qc_plots/ Directory#
Aggregate quality control visualizations:
1 2 3 4 5 | |
| File | Description |
|---|---|
composition_rnaqc.pdf |
QC category distribution - Bar chart showing number and percentage of samples in each quality category (1-5) |
RNAQC_composition_mqc.jpg |
JPEG version for inclusion in reports |
summary_verdict.txt |
Overall summary - Table with number and proportion of samples per quality category |
summary_verdict_group.txt |
Group-stratified summary - Quality metrics broken down by experimental groups (if specified in input CSV) |
Example Files: - composition_rnaqc.pdf - RNAQC_composition_mqc.jpg - summary_verdict.txt - summary_verdict_group.txt
9. read_counts/ Directory#
Read count tracking at each processing step:
Tracks the number of reads remaining after subsampling, trimming, filtering, and alignment. Useful for identifying where read loss occurs.
10. execution_info/ Directory#
Pipeline execution metadata:
1 2 3 4 5 6 7 8 9 10 | |
| File | Description |
|---|---|
input_dataset.csv |
Copy of the input CSV used to run the pipeline |
execution_report.html |
Nextflow execution report with resource usage and runtime statistics |
execution_timeline.html |
Visual timeline of task execution |
execution_trace.txt |
Detailed trace of all executed tasks |
params_file.json |
Complete parameter settings used for the pipeline run |
tool_versions.yml |
Software versions for all tools used in the pipeline (STAR, Salmon, HTSeq, etc.) |
Use Case: Troubleshooting, reproducibility, performance optimization, documenting analysis parameters
Appendix#
A. FASTQ Naming Conventions#
The BJ-Expression pipeline accepts Illumina FASTQ files following these naming patterns:
Standard Illumina Format#
1 | |
Components:
- {SampleName}: User-defined sample identifier (e.g., Expression-test1, Sample_A)
- S{SampleNumber}: Sample number from sample sheet (e.g., S1, S12)
- L{LaneNumber}: Sequencing lane (e.g., L001, L002)
- R{ReadNumber}: Read direction (R1 for forward, R2 for reverse)
- 001: File segment (always 001 for single files)
Examples:
1 2 3 4 | |
Simplified Format (Also Accepted)#
1 2 | |
Requirements:
- Files must be gzip compressed (.fastq.gz or .fq.gz)
- Forward and reverse reads must have matching sample names
- Read pairs must be indicated by _R1_ and _R2_ (or _R1. and _R2.)
- Both reads in a pair are required (pipeline does not support single-end for paired data)
B. Selected Metrics Descriptions#
The following table describes key metrics found in the MultiQC report and output files:
| Metric | Description | Typical Range | Quality Indicator |
|---|---|---|---|
SampleId |
Unique sample identifier | N/A | Must match biosampleName from input CSV |
BJ_calculated_raw_read_pairs |
Total raw read pairs calculated by BaseJumper | Varies by design | Project-dependent |
Nonsubsampled_reads |
Total reads before subsampling step | Varies by design | Project-dependent |
Subsampled_reads |
Reads after subsampling (100,000 default) | 100,000 if enabled | Should match target if enabled |
Pass_filtered_reads |
Reads passing FASTP quality filters | > 80% of input | Higher indicates good quality |
Low_quality_reads |
Reads removed due to low quality scores | < 10% of input | Lower is better |
Many_Ns_reads |
Reads with excessive N bases removed | < 5% of input | Lower is better |
Too_short_reads |
Reads too short after trimming | < 10% of input | Lower is better |
Prop_pass_filtered_reads |
Proportion of reads passing all filters | > 0.8 (80%) | Higher indicates good quality |
Prop_low_quality_reads |
Proportion failing quality threshold | < 0.1 (10%) | Lower is better |
Prop_of_many_Ns_reads |
Proportion with too many N bases | < 0.05 (5%) | Lower is better |
Prop_of_too_short_reads |
Proportion too short after trimming | < 0.1 (10%) | Lower is better |
Prop_mappability |
Percentage of reads successfully aligned | 40-80% | Higher indicates good library quality |
Prop_exonic |
Percentage of reads mapping to exons | 60-80% | Higher indicates good mRNA enrichment |
Prop_intronic |
Percentage of reads mapping to introns | 10-30% | Too high may indicate pre-mRNA contamination |
Prop_intergenic |
Percentage of reads in intergenic regions | < 10% | Lower indicates less genomic DNA contamination |
Prop_mitochondrion |
Proportion of mitochondrial reads | < 0.2 (20%) | Higher may indicate cell stress or poor quality |
Protein_coding_genes |
Number of protein-coding genes detected | > 5,000 | Higher indicates good library complexity |
Protein_coding_transcripts |
Number of protein-coding transcripts detected | > 10,000 | Higher indicates isoform diversity |
Protein_coding_Mean_insertsize |
Mean insert size from protein-coding genes | 150-300 bp (varies by protocol) | Should match expected library prep size |
Protein_coding_Median_insertsize |
Median insert size from protein-coding genes | 150-300 bp (varies by protocol) | More robust to outliers than mean |
Mean_number_of_txs_per_gene |
Average transcripts detected per gene | 1.5-3.0 | Higher indicates isoform detection |
Coverage_ratio_gene_body |
Ratio of 5' to 3' coverage across genes | 0.8-1.2 | ~1.0 indicates no bias |
Median_coverage_gene_body |
Median coverage depth across gene bodies | Varies by depth | Higher indicates better coverage |
Dynamic_range |
Expression spread (top 10% - bottom 10%) | 3,000-6,000 | Higher indicates diverse expression |
Cell_type_Phase |
Predicted cell cycle phase | G1, S, or G2M | Validation of expected phase |
Cell_type_Progenitor |
Predicted progenitor cell type | B_cell, T_cell, etc. | Validation of expected lineage |
Cell_type_Tissue |
Predicted tissue of origin | Blood, Brain, etc. | Validation of expected tissue |
Cell_type_TGCA_tissue |
TCGA reference tissue classification | Various tissue types | Reference-based classification |
Cell_type_TGCA_tumor |
TCGA tumor type classification | Various tumor types | Tumor-specific classification |
C. ClusterQC Cutoffs and Interpretation#
Quality Metrics Explained#
1. Mappability (Alignment Rate) - What it measures: Percentage of reads successfully aligned to the reference genome - Interpretation: - High (>70%): Excellent library quality, minimal contamination - Medium (50-70%): Good quality, acceptable for analysis - Low (<50%): Possible contamination, degraded RNA, or wrong reference genome
2. Proportion Exonic - What it measures: Percentage of aligned reads mapping to exon regions - Interpretation: - High (>75%): Excellent mRNA enrichment, successful library prep - Medium (60-75%): Acceptable mRNA content - Low (<60%): Poor mRNA enrichment, possible genomic DNA contamination or pre-mRNA
3. Proportion Intergenic - What it measures: Percentage of aligned reads mapping to intergenic regions - Interpretation: - Low (<5%): Excellent specificity to genic regions - Medium (5-10%): Acceptable background - High (>10%): Genomic DNA contamination or poor library enrichment
4. Proportion Mitochondria - What it measures: Percentage of reads mapping to mitochondrial genes - Interpretation: - Low (<10%): Healthy cells with intact cytoplasmic RNA - Medium (10-20%): Acceptable for some cell types - High (>20%): Cell stress, apoptosis, or poor sample quality
5. Number of Protein Coding Genes - What it measures: Total number of protein-coding genes with detected expression - Interpretation: - High (>10,000): Excellent library complexity - Medium (5,000-10,000): Good complexity, suitable for analysis - Low (<5,000): Poor library complexity or insufficient sequencing depth
D. How Adapter Trimming Works#
The BJ-Expression pipeline uses a two-step adapter trimming approach for optimal read cleanup:
CUTADAPT (Step 1: Targeted Adapter Removal)#
CUTADAPT performs the initial adapter removal using known adapter sequences:
Adapter Removal Strategy:
- Removes adapters from both 5' (-b flag) and 3' (-a flag) ends of reads
- Processes both R1 and R2 reads independently with specified adapters
- Uses exact adapter sequence matching for precise removal
- Particularly effective for removing known library prep adapters
FASTP (Step 2: Comprehensive QC and Trimming)#
FASTP performs comprehensive adapter and quality trimming:
Adapter Detection: - Automatically detects adapter sequences by analyzing read ends - Can use user-specified adapter sequences (recommended for single-cell RNA-seq) - Detects and removes both 3' and 5' adapters
Quality Trimming: - Removes low-quality bases from read ends (default Q score < 15) - Performs sliding window quality trimming - Removes reads shorter than minimum length after trimming
Poly-G Tail Removal: - For NovaSeq/NextSeq (two-color chemistry), removes artificial poly-G tails - Poly-G tails occur when no signal is detected in two-color chemistry - Automatically detected based on instrument parameter
Read Filtering: - Removes reads with too many N bases - Filters reads below complexity threshold - Removes extremely short reads after trimming
Output: - Trimmed FASTQ files with improved quality - JSON report with detailed trimming statistics - HTML report with visual QC plots
Technical Note: The default adapter sequences (AAGCAGTGGTATCAACGCAGAGTACA) are optimized for single-cell RNA-seq protocols using template-switching. If using different library prep protocols, specify appropriate adapter sequences.
E. Frequently Asked Questions#
Q: How do I download gene or transcript count tables?
A: Follow these steps:
1. Export your project using BaseJumper (see Data Export manual)
2. In the export_data/ folder, navigate to:
- secondary_analyses/quantification_htseq/ for STAR/HTSeq-based gene counts
- secondary_analyses/quantification_salmon/ for Salmon-based transcript or gene counts
3. Download the matrix files for use in downstream analysis tools
Q: Are the gene or transcript count tables normalized?
A:
- STAR/HTSeq gene counts (matrix_gene_counts_starhtseq.txt): Not normalized - Raw counts suitable for DESeq2 or edgeR
- Salmon transcript counts: Available in three formats:
- matrix_transcript_raw_salmon.tsv: Raw, unnormalized counts
- matrix_transcript_tpm_salmon.tsv: TPM-scaled (normalized by library size)
- matrix_transcript_length_tpm_salmon.tsv: Length-scaled TPM (recommended for DE analysis)
- Salmon gene counts: Same normalization options as transcripts
Q: Should I disable subsampling for my analysis?
A: - Keep enabled for initial QC and rapid sample quality assessment - Disable for: - Final differential expression analysis - Isoform-level analysis requiring full depth - Low-input samples where every read matters - Publication-quality results
Q: Which gene counts should I use - STAR/HTSeq or Salmon?
A: Both are valid, but:
- STAR/HTSeq: Traditional approach, full alignment, good for splice junction analysis
- Salmon: Faster, includes length scaling (better for isoform differences), modern best practice
- Recommendation: Use Salmon gene counts (matrix_gene_length_tpm_salmon.tsv) with tximport in R for most analyses
Q: What does high mitochondrial percentage indicate?
A: High PropMT (>20%) can indicate: - Cell stress or apoptosis - Poor sample preservation - Cell lysis during library prep - For some cell types (e.g., neurons, cardiomyocytes), 10-15% is normal
Q: How many genes should I detect?
A: Typical ranges: - Bulk RNA-seq: 15,000-20,000 genes - Single-cell RNA-seq: 2,000-8,000 genes per cell - With 100K subsampling: Expect lower numbers; disable subsampling for accurate gene detection
Q: Can I analyze 10x Genomics data with this pipeline?
A: Yes, enable the "10x Support" module. Note: - Specialized processing for 10x barcode structure - Set library protocol to "chromium" - Different adapter sequences may be needed
Q: What if my samples show poor alignment rates?
A: Check: 1. Correct reference genome selected (human vs. mouse) 2. FASTQ quality (look at fastp report) 3. Adapter contamination (check adapter content) 4. Possible contamination (enable Kraken if available) 5. Library prep issues (consult with lab)
Q: How do I interpret the cell typing results?
A: The SingleR classifier compares your expression profile to reference databases: - High confidence: Scores >0.8 indicate strong matches - Low confidence: Scores <0.5 suggest ambiguous or novel cell types - Use case: Validation of expected cell types, not definitive classification - Always confirm with marker gene expression
Q: What files do I need for differential expression analysis?
A: You need:
1. Count matrix: matrix_gene_counts_starhtseq.txt OR matrix_gene_length_tpm_salmon.tsv
2. Sample metadata: Your original input CSV with group information
3. For R/Bioconductor:
- DESeq2: Use raw counts from STAR/HTSeq
- edgeR: Use raw counts from STAR/HTSeq
- limma-voom: Can use either, but include library size factors
- tximport + DESeq2: Use Salmon output with matrix_gene_length_tpm_salmon.tsv