Skip to content

BJ-Expression Pipeline User Manual#

Customer Support#

If you need assistance with the BJ-Expression pipeline, several support channels are available:

BaseJumper Online Manual#

For comprehensive documentation and guides, visit the BaseJumper Documentation Portal.

Email Support#

Contact our support team at: basejumper@bioskryb.com

Online Portal Ticketing System#

For bug reports and feature requests, submit a ticket through our online portal. Your account representative can provide access details.


Before You Begin#

Account Registration#

  1. Register for a BaseJumper account at https://basejumper.bioskryb.com select ‘Create Account’ and review the Terms and Conditions.
  2. Wait for account approval from your organization administrator
  3. Log in with your credentials to access the platform

Data Transfer Setup#

Before running the BJ-Expression pipeline, ensure your sequencing data is accessible to you:

Globus Data Transfer#

  • Recommended method for large-scale data transfers
  • Set up a Globus endpoint for your data storage
  • You can review Globus setup instructions in the BaseJumper Online Manual here
  • The BaseJumper support team will add you to your assigned workspace and an email confirming membership will arrive from groups.globus.org (you may need to check your Junk/Spam filter).
  • Contact BaseJumper support for additional Globus configuration assistance.

Alternative Transfer Methods#

  • AWS S3 bucket access (requires IAM credentials)
  • Direct upload through BaseJumper interface (for smaller datasets)
  • Contact support to discuss custom data transfer solutions

Sample Metadata Requirements#

The BJ-Expression pipeline requires a properly formatted input CSV file. You can get the initial format of this CSV from the BaseJumper project page and clicking the ‘Export’ button at the top of the biosample table. This file will have the following structure (and you will need to add the ‘groups’ column):

Input CSV Format#

Column Name Description Required Example
biosampleName Unique identifier for each sample Yes Expression-test1
read1 Path to forward reads (R1) FASTQ file Yes s3://bucket/sample_R1_001.fastq.gz
read2 Path to reverse reads (R2) FASTQ file Yes s3://bucket/sample_R2_001.fastq.gz
groups Optional grouping for batch analysis No Group1 or Control

Example Input File#

1
2
3
4
biosampleName,read1,read2,groups
Expression-test1,s3://bioskryb-data/sample1_R1_001.fastq.gz,s3://bioskryb-data/sample1_R2_001.fastq.gz,Group1
Expression-test2,s3://bioskryb-data/sample2_R1_001.fastq.gz,s3://bioskryb-data/sample2_R2_001.fastq.gz,Group1
Expression-test3,s3://bioskryb-data/sample3_R1_001.fastq.gz,s3://bioskryb-data/sample3_R2_001.fastq.gz,Group2

FASTQ Naming Conventions#

BJ-Expression accepts standard Illumina FASTQ naming formats:

  • {SampleName}_S{SampleNumber}_L{Lane}_R{ReadNumber}_001.fastq.gz
  • Example: Expression-test1_S1_L001_R1_001.fastq.gz

Important Notes: - Files must be gzipped (.fastq.gz extension) - Read pairs must have matching sample names - S3 paths should be accessible from your BaseJumper account - CRAM files are not currently supported - contact ResolveServices for CRAM processing


Project Design#

Creating a New Project#

  1. Navigate to Projects: From the BaseJumper dashboard, click the "Create Project" button in the upper right corner

  2. Project Configuration:

  3. Project Name: Choose a descriptive, unique name for your project
  4. Description: Add optional project details
  5. Select Pipeline: Choose BJ-Expression from the pipeline dropdown menu

  6. Sample Selection:

  7. Select samples from your Shared Data directory. If you do have an Illumina BaseSpace Sequence Hub token you can follow instructions in the BaseJumper Manual to provide this token to the Support email.

Shared Data Requirements#

If using samples from Shared Data: - Samples must follow Illumina naming conventions - Files must be in .fastq.gz format - Both paired-end reads (R1 and R2) must be present

Sample Organization#

Before submitting your project, consider:

  • Which samples to include? Select only samples you want to analyze together
  • Metadata grouping: Use the groups column to organize samples by experimental conditions
  • Quality control: Ensure all samples have sufficient read depth (minimum 5,000 reads)

⚠️ Warning: Samples with fewer than 5,000 reads may cause the pipeline to fail. Verify data quality before submission.

Automatic Pipeline Queuing#

By default, projects are automatically queued for execution after creation. To disable this:

  1. Uncheck the "Auto-queue project" option during project setup
  2. Manually start the pipeline from the project dashboard when ready

Pipeline Parameters#

Selecting Pipeline Version#

  1. From the pipeline configuration screen, select your preferred pipeline version
  2. Default: The most recent stable version (recommended)
  3. For reproducibility, you can select specific previous versions

Core Parameters#

Configure the following parameters based on your experimental design:

Reference Genome#

Select the reference genome that matches your sample organism:

Genome Description
GRCh38 (Default) Human reference genome, build 38
GRCh37 Human reference genome, build 37
GRCm39 Mouse reference genome, build 39

Default: GRCh38

Read Length#

Option Description Use Case
50 (Default) 50 base pair reads Standard single-cell RNA-seq
75 75 base pair reads Higher resolution analysis
100 100 base pair reads Deep sequencing applications
150 150 base pair reads Long-read RNA-seq protocols

Default: 50

Adapter Sequences#

Specify adapter sequences for trimming:

Parameter Default Value Description
Adapter Sequence for first read AAGCAGTGGTATCAACGCAGAGTACA Adapter sequence trimmed from R1 reads
Adapter Sequence for second read AAGCAGTGGTATCAACGCAGAGTACAT Adapter sequence trimmed from R2 reads

💡 Note: These defaults are optimized for standard single-cell RNA-seq library prep protocols. Your sequencing analysis workflow should provide these adapter sequences, which you can use here.

Optional Modules#

Enable or disable optional analysis modules based on your needs:

Module Default State Description
FastQ Reads Subsampling ✅ Enabled Randomly subsamples reads to 100,000 reads per sample for faster QC analysis. For comprehensive analysis, disable this option to use all reads.
CUTADAPT Adapter Removal ✅ Enabled Performs targeted adapter removal using user-specified sequences before FASTP trimming. Removes adapters from both 5' and 3' ends of reads.
10x Support ❌ Disabled Enables analysis of 10x Genomics Chromium data with specialized processing

Subsampling Details: - Default: 100,000 reads per sample. - Purpose: Enables faster computation time while providing sufficient data for QC metrics comparison. - Recommendation: Keep enabled for initial QC, disable for final full-depth analysis. This will keep data comparable from project to project but if you want to get the most out of phenotype prediction, isoform detection or variant calling, you can deselect this.

CUTADAPT Details: - Default: Enabled - Purpose: Removes known adapter sequences efficiently before comprehensive FASTP analysis - Recommendation: Keep enabled when using standard library prep protocols with known adapter sequences; can be disabled if adapters are minimal or if FASTP alone is sufficient


Workflow Overview#

Pipeline Execution Steps#

The BJ-Expression pipeline performs the following analyses:

flowchart LR
%% Colors %%
classDef black fill:#12294C,stroke:#12294C,stroke-width:2px,color:#fff
classDef blue fill:#20A4F3,stroke:#20A4F3,stroke-width:2px,color:#fff
classDef green fill:#3BCEAC,stroke:#3BCEAC,stroke-width:2px,color:#fff
classDef pink fill:#ef476f,stroke:#ef476f,stroke-width:2px,color:#fff
classDef orange fill:#f3722c,stroke:#f3722c,stroke-width:2px,color:#fff

    Start((Start)):::black --> Subsample[Subsample Reads<br/>SEQTK]:::blue
    Subsample --> Trim[Trim Adapters<br/>FASTP]:::blue
    Trim --> Salmon[Transcript Quantification<br/>SALMON]:::green
    Trim --> STAR[Splice-aware Alignment<br/>STAR]:::green
    STAR --> HTSeq[Gene Quantification<br/>HTSeq]:::green
    STAR --> Qualimap[Alignment QC<br/>Qualimap]:::pink
    STAR --> CellType[Cell Typing<br/>SingleR]:::pink
    STAR --> PCA[PCA Analysis]:::pink
    Salmon --> MultiQC[Aggregate Report<br/>MultiQC]:::orange
    HTSeq --> MultiQC
    Qualimap --> MultiQC
    CellType --> MultiQC
    PCA --> MultiQC
    MultiQC --> End((End)):::black

Processing Steps Detail#

1. Read Subsampling (SEQTK)#

  • Randomly selects 100,000 reads per sample (default)
  • Enables rapid QC and consistent cross-sample comparison
  • Can be disabled for full-depth analysis
  • Output: Subsampled FASTQ files

💡 Why subsample? For initial QC or comparison between runs, subsampling enables faster computation while providing sufficient data for quality assessment.

2. Adapter Removal (CUTADAPT)#

  • Removes adapter sequences from both 5' and 3' ends
  • Uses user-specified adapter sequences for targeted removal
  • Trims adapters from both forward and reverse reads
  • Output: Adapter-trimmed FASTQ files

💡 Note: CUTADAPT runs before FASTP to remove known adapter sequences, then FASTP performs additional quality trimming and filtering.

3. Quality Trimming and QC (FASTP)#

  • Performs additional adapter detection and removal
  • Trims low-quality bases from read ends
  • Performs poly-G tail removal (for two-color chemistry)
  • Filters reads below quality threshold
  • Generates comprehensive QC metrics
  • Output: Trimmed FASTQ files, QC JSON reports

4. Transcript-Level Quantification (SALMON)#

  • Uses pseudo-alignment for rapid transcript (isoform) quantification
  • Maps reads to transcript sequences without full alignment
  • Generates TPM (Transcripts Per Million) values and normalizes for isoform length
  • Provides both transcript and gene-level counts
  • Output: Transcript counts, TPM values, gene counts

5. Splice-Aware Alignment (STAR)#

  • Aligns reads to reference genome with splice junction detection
  • Handles intron-spanning reads for RNA-seq data
  • Generates chimeric junction output for fusion detection
  • Output: BAM alignment files, junction files

6. Primary Read Extraction (SAMTOOLS)#

  • Extracts primary alignments from STAR output
  • Filters secondary and supplementary alignments
  • Creates indexed BAM files
  • Output: Filtered BAM files with indexes

7. Gene-Level Quantification (HTSeq)#

  • Counts reads overlapping gene features
  • Uses STAR alignments and GTF annotations
  • Provides raw read counts per gene
  • Output: Gene count matrices

8. Alignment Quality Assessment (QUALIMAP)#

  • Evaluates alignment quality metrics
  • Calculates exonic/intronic/intergenic read proportions
  • Assesses 5'-3' bias
  • Generates coverage profiles
  • Output: Quality metrics, coverage statistics

9. Cell Typing Classification (SingleR)#

  • Predicts cell type based on gene expression patterns
  • Uses reference databases: HPCA, GTEx, TCGA
  • Classifies by cell phase, progenitor type, tissue type
  • Output: Cell type predictions with confidence scores

10. Principal Component Analysis (PCA)#

  • Identifies sample relationships and batch effects
  • Visualizes expression patterns across samples
  • Detects outliers or technical artifacts
  • Output: PCA plots, variance explained

11. Report Generation (MultiQC)#

  • Aggregates metrics across all samples and tools
  • Creates interactive HTML report
  • Generates summary tables and visualizations
  • Output: multiqc_report.html

File Naming Conventions#

Input Files#

  • Forward reads: {biosampleName}_R1_001.fastq.gz
  • Reverse reads: {biosampleName}_R2_001.fastq.gz

Output Files#

  • Aligned BAM: {biosampleName}.bam
  • BAM index: {biosampleName}.bam.bai
  • Gene counts: {biosampleName}.htseq_counts.tsv
  • Chimeric junctions: {biosampleName}_Chimeric.out.junction

Estimated Processing Time#

Project Size Estimated Time Notes
10 samples <1 hour With 100K reads subsampling
50 samples 1-3 hours Standard QC analysis
100 samples 2-6 hours Large batch processing
384 samples 4-12 hours High-throughput plate

Factors affecting runtime: - Read depth per sample - Subsampling enabled/disabled - Reference genome size - Current cluster load


Summarizing Output#

MultiQC Report#

The MultiQC report aggregates all QC metrics into a single interactive HTML file located at:

1
multiqc/multiqc_report.html

MultiQC Report Sections#

1. General Statistics#

Overview table showing key metrics for all samples: - Total reads - Alignment rate - Exonic read percentage - Number of genes detected - Mitochondrial read percentage

2. Selected Metrics#

Curated selection of the most important QC metrics across all analysis tools. This section consolidates key quality indicators from read processing, alignment, quantification, and cell typing analyses into a single comprehensive table for easy comparison across samples. Metrics include read counts at various pipeline stages, quality filtering statistics, alignment proportions (exonic, intronic, intergenic, mitochondrial), gene and transcript detection counts, insert size measurements, coverage metrics, expression dynamics, and automated cell type classifications. See Appendix B for detailed descriptions of all metrics, including typical ranges and quality indicators.

3. Fastp Section#

Trimming and filtering statistics: - Reads passed/filtered - Adapter trimming rates - Quality filtering results - Poly-G tail removal (if applicable) - Before/after comparison plots

4. Salmon Section#

Transcript quantification metrics: - Mapping rate - Number of transcripts detected - Library type detection - Fragment length distribution

5. STAR Alignment Section#

Splice-aware alignment statistics: - Uniquely mapped reads percentage - Multi-mapped reads - Unmapped reads - Chimeric reads - Reads mapped to multiple loci

6. Qualimap Section#

Alignment quality metrics: - Exonic reads: Percentage of reads mapping to exons - Intronic reads: Percentage of reads mapping to introns - Intergenic reads: Percentage of reads in intergenic regions - 5'-3' bias: Coverage uniformity along transcript length - Gene body coverage: Distribution of reads across gene features

7. Gene Detection Section#

Gene-level quantification summary: - Total genes detected - Protein-coding genes - lncRNAs, pseudogenes, miRNAs - Gene type distribution - Mitochondrial gene proportion

8. Cell Typing Section#

Automated cell type classification: - Predicted cell type - Cell cycle phase - Tissue type prediction - Confidence scores

9. Expression Dynamics#

Dynamic range and expression metrics: - Housekeeping gene expression consistency - Coefficient of variation (CV) for housekeeping genes - Expression range (min/max/average) - Top and bottom 10% expression levels

10. PCA Analysis#

Sample clustering and relationships: - Principal component plots - Variance explained by each PC - Sample outlier detection - Group separation visualization

11. ClusterQC Composition#

Quality category distribution: - Number of samples per quality category - Overall batch quality assessment - Category definitions (1-5 scale)

View Example MultiQC Report

ClusterQC Analysis#

ClusterQC provides visual assessment of RNA-seq sample quality:

RNA-QC Composition Plot#

Location: tertiary_analyses/qc_plots/composition_rnaqc.pdf

This plot displays the distribution of samples across quality categories.

Quality Categories:

Category Quality Level Characteristics
Category 5 Excellent High alignment rate (>70%), high exonic proportion (>75%), low intergenic reads (<5%), >10K genes detected
Category 4 Good Alignment rate 50-70%, exonic proportion 60-75%, suitable for analysis
Category 3 Moderate Alignment rate 30-50%, may require additional QC review
Category 2 Fair Alignment rate 10-30%, consider re-sequencing
Category 1 Poor Alignment rate <10%, not recommended for analysis
Category 0 Failed Unable to process or extreme technical failure

Interpretation: - Samples in Category 4-5 are suitable for downstream RNA-seq analysis - Samples in Category 1-2 may indicate library preparation or sequencing issues - Use this plot to identify problematic samples before proceeding to differential expression analysis

Summary Verdict Table#

Location: tertiary_analyses/qc_plots/summary_verdict.txt

Summary table showing: - Number of samples in each quality category - Proportion of samples in each category - Overall project quality assessment

Example:

1
2
3
4
5
6
7
Category    NumberCells    ProportionCells
5           0              0
4           5              100
3           0              0
2           0              0
1           0              0
0           0              0

In this example, all 5 samples are Category 4 (good quality).

Group-Stratified Summary#

Location: tertiary_analyses/qc_plots/summary_verdict_group.txt

If groups are specified in the input CSV, this file shows quality category distribution per group, useful for detecting batch effects.


Output Files#

All pipeline results are organized in a structured directory hierarchy:

Directory Structure#

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
project_results/
├── multiqc/
├── read_counts/
├── secondary_analyses/
│   ├── alignment_htseq/
│   ├── insert_size/
│   ├── quantification_htseq/
│   ├── quantification_salmon/
│   └── secondary_metrics/
├── tertiary_analyses/
│   ├── classification_cell_typing/
│   └── qc_plots/
└── execution_info/

Detailed Output Descriptions#

1. multiqc/ Directory#

File Description
multiqc_report.html Main QC report - Interactive HTML report with all QC metrics. Open in web browser.
multiqc_data/ Raw data used to generate MultiQC report (JSON, TSV files)
multiqc_report_plots/ Plot files - PDF, PNG, and SVG versions of all plots featured in MultiQC
multiqc_version.yml Version information for reproducibility

Example: multiqc_report.html

2. secondary_analyses/alignment_htseq/ Directory#

Aligned reads for each biosample:

1
2
3
4
secondary_analyses/alignment_htseq/
├── {biosampleName}.bam
├── {biosampleName}.bam.bai
└── {biosampleName}_Chimeric.out.junction
File Type Description
.bam Binary Alignment Map - STAR-aligned reads in binary format. Can be viewed with IGV or samtools. Contains primary alignments only.
.bam.bai BAM Index - Required for efficient random access to BAM file.
_Chimeric.out.junction Chimeric junctions - Detected fusion/chimeric reads that span distant genomic locations. Useful for fusion gene detection.

Size: Typically 50-500 MB per sample (for 100K reads)

Use Case: Visual inspection of alignments, splice junction analysis, fusion detection

3. secondary_analyses/quantification_htseq/ Directory#

Gene-level quantification from STAR and HTSeq:

1
2
3
4
5
6
7
8
9
secondary_analyses/quantification_htseq/
├── df_gene_counts_starhtseq.tsv
├── df_mt_gene_counts_starhtseq.tsv
├── df_gene_types_detected_summary_starhtseq.tsv
├── matrix_gene_counts_starhtseq.txt
├── HouseKeepingGenes_Expression.pdf
├── HKGenes_Expression__mqc.png
├── HouseKeepingGenes_Counts_mqc.tsv
└── HouseKeepingGenes_CV_mqc.tsv
File Description
df_gene_counts_starhtseq.tsv Main gene counts table - Contains ENSEMBL gene IDs, gene symbols, gene biotypes, and HTSeq counts for each sample and detected gene
df_mt_gene_counts_starhtseq.tsv Mitochondrial metrics - Contains MT_counts (mitochondrial gene counts), Total_counts (total detected gene counts), and PropMT (proportion of MT to total counts) per sample
df_gene_types_detected_summary_starhtseq.tsv Gene biotype summary - Details the number and proportion of various gene types detected: protein-coding genes, lncRNAs, pseudogenes, miRNAs, etc.
matrix_gene_counts_starhtseq.txt Count matrix - Project-level matrix with all samples (columns) and read counts for all genes (rows). Ready for differential expression analysis.
HouseKeepingGenes_Expression.pdf Housekeeping gene heatmap - Clustergram showing expression consistency of housekeeping genes across samples
HouseKeepingGenes_CV_mqc.tsv Coefficient of variation - CV rates for housekeeping genes, indicating technical variability
HouseKeepingGenes_Counts_mqc.tsv Housekeeping counts - Raw counts for housekeeping genes used for QC assessment

Example Files: - df_gene_counts_starhtseq.tsv - df_mt_gene_counts_starhtseq.tsv - df_gene_types_detected_summary_starhtseq.tsv - matrix_gene_counts_starhtseq.txt - HKGenes_Expression__mqc.png - Housekeeping gene heatmap

Use Case: - Use matrix_gene_counts_starhtseq.txt for differential expression analysis in R (DESeq2, edgeR) - Check df_mt_gene_counts_starhtseq.tsv for contamination or poor cell quality (high PropMT indicates dying cells) - Review df_gene_types_detected_summary_starhtseq.tsv to assess library complexity

4. secondary_analyses/quantification_salmon/ Directory#

Transcript-level quantification from Salmon:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
secondary_analyses/quantification_salmon/
├── df_transcript_counts_salmon.tsv
├── df_transcript_types_detected_summary_salmon.tsv
├── matrix_transcript_raw_salmon.tsv
├── matrix_transcript_tpm_salmon.tsv
├── matrix_transcript_length_tpm_salmon.tsv
├── df_gene_counts_salmon.tsv
├── df_mt_gene_counts_salmon.tsv
├── df_gene_types_detected_summary_salmon.tsv
├── matrix_gene_counts_salmon.tsv
├── matrix_gene_tpm_salmon.tsv
└── matrix_gene_length_tpm_salmon.tsv

Transcript-Level Files:

File Description
df_transcript_counts_salmon.tsv Main transcript table - Contains ENSEMBL transcript IDs, transcript lengths, TPM values, and both scaled and unscaled transcript counts
matrix_transcript_raw_salmon.tsv Raw transcript counts - Unscaled read counts per transcript across all samples
matrix_transcript_tpm_salmon.tsv TPM-scaled transcripts - Transcripts Per Million scaled by library size
matrix_transcript_length_tpm_salmon.tsv Length-scaled TPM - Scaled first by average transcript length, then by library size (recommended for differential expression)
df_transcript_types_detected_summary_salmon.tsv Transcript biotype summary - Number and types of transcripts detected per sample

Example Files: - df_transcript_counts_salmon.tsv

Gene-Level Files (Salmon-based):

File Description
df_gene_counts_salmon.tsv Salmon gene counts - Gene-level counts generated by collapsing transcript counts from the same gene
df_mt_gene_counts_salmon.tsv Mitochondrial metrics - Similar to HTSeq version but derived from Salmon quantification
df_gene_types_detected_summary_salmon.tsv Gene biotype summary - Gene types detected using Salmon-based quantification
matrix_gene_counts_salmon.tsv Gene count matrix - Unscaled gene counts across all samples
matrix_gene_tpm_salmon.tsv Gene TPM matrix - TPM-scaled gene expression values
matrix_gene_length_tpm_salmon.tsv Length-scaled gene TPM - Recommended for differential expression analysis using tximport-compatible tools

Example Files: - df_gene_counts_salmon.tsv - matrix_gene_length_tpm_salmon.tsv

Use Case: - Use matrix_transcript_length_tpm_salmon.tsv for transcript-level differential expression - Use matrix_gene_length_tpm_salmon.tsv with tximport in R for gene-level DE analysis - Salmon quantification is faster and doesn't require full alignment - Compare Salmon and HTSeq gene counts to assess quantification concordance

5. secondary_analyses/secondary_metrics/ Directory#

Comprehensive alignment and expression metrics:

1
2
3
4
secondary_analyses/secondary_metrics/
├── pipeline_metrics_summary_percents.csv
├── qualimap_stats_mqc.csv
└── df_dynamicrange_expression.tsv
File Description
pipeline_metrics_summary_percents.csv Percentage-based metrics - Contains metrics from the "QualiMap percent stats" section: percentage of exonic reads, intronic reads, intergenic reads, and total reads aligned
qualimap_stats_mqc.csv Qualimap statistics - Contains alignment stats including total aligned reads, alignments to genes, 5'-3' bias metrics
df_dynamicrange_expression.tsv Expression dynamics - Contains min/max/average expression, bottom 10% and top 10% expression levels, and dynamic range for each sample

Example Files: - pipeline_metrics_summary_percents.csv - qualimap_stats_mqc.csv - df_dynamicrange_expression.tsv

Key Metrics Explained:

Metric Description Good Quality Range
reads.aligned_P_Total Percentage of reads successfully aligned > 50%
exonic_P_gen Percentage of reads mapping to exons > 60%
intergenic_P_gen Percentage of reads in intergenic regions < 10%
dynamic_range_expr Spread of expression levels (top10% - bottom10%) Higher indicates more diverse expression

6. secondary_analyses/insert_size/ Directory#

Insert size distribution metrics (for paired-end sequencing):

1
2
secondary_analyses/insert_size/
└── aggregated_insert_size_summary.tsv
File Description
aggregated_insert_size_summary.tsv Aggregated insert size metrics - Contains mean and median insert sizes calculated separately for different RNA biotypes across all samples

This file provides insert size statistics stratified by gene biotype, useful for assessing library quality and fragment size distribution:

Biotype Columns: - Protein_Coding_Mean / Protein_Coding_Median: Insert sizes from reads mapping to protein-coding genes (most informative for library prep QC) - Ribosomal_RNA_Mean / Ribosomal_RNA_Median: Insert sizes from ribosomal RNA genes (indicates rRNA depletion efficiency) - Mitochondrial_RNA_Mean / Mitochondrial_RNA_Median: Insert sizes from mitochondrial genes - Small_RNAs_Mean / Small_RNAs_Median: Insert sizes from small nuclear/nucleolar RNAs - MicroRNAs_Mean / MicroRNAs_Median: Insert sizes from microRNA genes

Example Files: - aggregated_insert_size_summary.tsv

7. tertiary_analyses/classification_cell_typing/ Directory#

Cell type prediction results:

1
2
3
tertiary_analyses/classification_cell_typing/
├── df_cell_typing_summary_singler_hpca_gtex_tcga.tsv
└── df_cell_typing_scores_singler_hpca_gtex_tcga.tsv
File Description
df_cell_typing_summary_singler_hpca_gtex_tcga.tsv Cell type predictions - Contains assigned Cell Phase (G1, S, G2M), Progenitor Type, Tissue Type, TGCA Tissue Type, and TGCA Tumor Type for each sample
df_cell_typing_scores_singler_hpca_gtex_tcga.tsv Prediction scores - Confidence scores for each sample against all possible phases, progenitors, tissues, etc. Used to assess classification confidence

Example Files: - df_cell_typing_summary_singler_hpca_gtex_tcga.tsv - df_cell_typing_scores_singler_hpca_gtex_tcga.tsv

Cell Classification Categories:

  • Phase: Cell cycle stage (G1, S, G2M)
  • Progenitor: Cell lineage type (e.g., B_cell, T_cell, Monocyte)
  • Tissue: Predicted tissue of origin (e.g., Blood, Spleen, Brain)
  • TGCA Tissue: TCGA reference tissue type
  • TGCA Tumor: TCGA tumor classification

Use Case: Validate expected cell types, detect contamination, identify unexpected cell populations

8. tertiary_analyses/qc_plots/ Directory#

Aggregate quality control visualizations:

1
2
3
4
5
tertiary_analyses/qc_plots/
├── composition_rnaqc.pdf
├── RNAQC_composition_mqc.jpg
├── summary_verdict.txt
└── summary_verdict_group.txt
File Description
composition_rnaqc.pdf QC category distribution - Bar chart showing number and percentage of samples in each quality category (1-5)
RNAQC_composition_mqc.jpg JPEG version for inclusion in reports
summary_verdict.txt Overall summary - Table with number and proportion of samples per quality category
summary_verdict_group.txt Group-stratified summary - Quality metrics broken down by experimental groups (if specified in input CSV)

Example Files: - composition_rnaqc.pdf - RNAQC_composition_mqc.jpg - summary_verdict.txt - summary_verdict_group.txt

9. read_counts/ Directory#

Read count tracking at each processing step:

Tracks the number of reads remaining after subsampling, trimming, filtering, and alignment. Useful for identifying where read loss occurs.

10. execution_info/ Directory#

Pipeline execution metadata:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
execution_info/
├── input_dataset.csv
├── execution_report.html
├── execution_timeline.html
├── execution_trace.txt
├── params_file.json
├── tool_versions.yml
├── input.csv
├── output.json
└── eb_event.json
File Description
input_dataset.csv Copy of the input CSV used to run the pipeline
execution_report.html Nextflow execution report with resource usage and runtime statistics
execution_timeline.html Visual timeline of task execution
execution_trace.txt Detailed trace of all executed tasks
params_file.json Complete parameter settings used for the pipeline run
tool_versions.yml Software versions for all tools used in the pipeline (STAR, Salmon, HTSeq, etc.)

Use Case: Troubleshooting, reproducibility, performance optimization, documenting analysis parameters


Appendix#

A. FASTQ Naming Conventions#

The BJ-Expression pipeline accepts Illumina FASTQ files following these naming patterns:

Standard Illumina Format#

1
{SampleName}_S{SampleNumber}_L{LaneNumber}_R{ReadNumber}_001.fastq.gz

Components: - {SampleName}: User-defined sample identifier (e.g., Expression-test1, Sample_A) - S{SampleNumber}: Sample number from sample sheet (e.g., S1, S12) - L{LaneNumber}: Sequencing lane (e.g., L001, L002) - R{ReadNumber}: Read direction (R1 for forward, R2 for reverse) - 001: File segment (always 001 for single files)

Examples:

1
2
3
4
Expression-test1_S1_L001_R1_001.fastq.gz
Expression-test1_S1_L001_R2_001.fastq.gz
ResolveOMEv2.1-RNA-04C_S4_L003_R1_001.fastq.gz
ResolveOMEv2.1-RNA-04C_S4_L003_R2_001.fastq.gz

Simplified Format (Also Accepted)#

1
2
{SampleName}_R1_001.fastq.gz
{SampleName}_R2_001.fastq.gz

Requirements: - Files must be gzip compressed (.fastq.gz or .fq.gz) - Forward and reverse reads must have matching sample names - Read pairs must be indicated by _R1_ and _R2_ (or _R1. and _R2.) - Both reads in a pair are required (pipeline does not support single-end for paired data)

B. Selected Metrics Descriptions#

The following table describes key metrics found in the MultiQC report and output files:

Metric Description Typical Range Quality Indicator
SampleId Unique sample identifier N/A Must match biosampleName from input CSV
BJ_calculated_raw_read_pairs Total raw read pairs calculated by BaseJumper Varies by design Project-dependent
Nonsubsampled_reads Total reads before subsampling step Varies by design Project-dependent
Subsampled_reads Reads after subsampling (100,000 default) 100,000 if enabled Should match target if enabled
Pass_filtered_reads Reads passing FASTP quality filters > 80% of input Higher indicates good quality
Low_quality_reads Reads removed due to low quality scores < 10% of input Lower is better
Many_Ns_reads Reads with excessive N bases removed < 5% of input Lower is better
Too_short_reads Reads too short after trimming < 10% of input Lower is better
Prop_pass_filtered_reads Proportion of reads passing all filters > 0.8 (80%) Higher indicates good quality
Prop_low_quality_reads Proportion failing quality threshold < 0.1 (10%) Lower is better
Prop_of_many_Ns_reads Proportion with too many N bases < 0.05 (5%) Lower is better
Prop_of_too_short_reads Proportion too short after trimming < 0.1 (10%) Lower is better
Prop_mappability Percentage of reads successfully aligned 40-80% Higher indicates good library quality
Prop_exonic Percentage of reads mapping to exons 60-80% Higher indicates good mRNA enrichment
Prop_intronic Percentage of reads mapping to introns 10-30% Too high may indicate pre-mRNA contamination
Prop_intergenic Percentage of reads in intergenic regions < 10% Lower indicates less genomic DNA contamination
Prop_mitochondrion Proportion of mitochondrial reads < 0.2 (20%) Higher may indicate cell stress or poor quality
Protein_coding_genes Number of protein-coding genes detected > 5,000 Higher indicates good library complexity
Protein_coding_transcripts Number of protein-coding transcripts detected > 10,000 Higher indicates isoform diversity
Protein_coding_Mean_insertsize Mean insert size from protein-coding genes 150-300 bp (varies by protocol) Should match expected library prep size
Protein_coding_Median_insertsize Median insert size from protein-coding genes 150-300 bp (varies by protocol) More robust to outliers than mean
Mean_number_of_txs_per_gene Average transcripts detected per gene 1.5-3.0 Higher indicates isoform detection
Coverage_ratio_gene_body Ratio of 5' to 3' coverage across genes 0.8-1.2 ~1.0 indicates no bias
Median_coverage_gene_body Median coverage depth across gene bodies Varies by depth Higher indicates better coverage
Dynamic_range Expression spread (top 10% - bottom 10%) 3,000-6,000 Higher indicates diverse expression
Cell_type_Phase Predicted cell cycle phase G1, S, or G2M Validation of expected phase
Cell_type_Progenitor Predicted progenitor cell type B_cell, T_cell, etc. Validation of expected lineage
Cell_type_Tissue Predicted tissue of origin Blood, Brain, etc. Validation of expected tissue
Cell_type_TGCA_tissue TCGA reference tissue classification Various tissue types Reference-based classification
Cell_type_TGCA_tumor TCGA tumor type classification Various tumor types Tumor-specific classification

C. ClusterQC Cutoffs and Interpretation#

Quality Metrics Explained#

1. Mappability (Alignment Rate) - What it measures: Percentage of reads successfully aligned to the reference genome - Interpretation: - High (>70%): Excellent library quality, minimal contamination - Medium (50-70%): Good quality, acceptable for analysis - Low (<50%): Possible contamination, degraded RNA, or wrong reference genome

2. Proportion Exonic - What it measures: Percentage of aligned reads mapping to exon regions - Interpretation: - High (>75%): Excellent mRNA enrichment, successful library prep - Medium (60-75%): Acceptable mRNA content - Low (<60%): Poor mRNA enrichment, possible genomic DNA contamination or pre-mRNA

3. Proportion Intergenic - What it measures: Percentage of aligned reads mapping to intergenic regions - Interpretation: - Low (<5%): Excellent specificity to genic regions - Medium (5-10%): Acceptable background - High (>10%): Genomic DNA contamination or poor library enrichment

4. Proportion Mitochondria - What it measures: Percentage of reads mapping to mitochondrial genes - Interpretation: - Low (<10%): Healthy cells with intact cytoplasmic RNA - Medium (10-20%): Acceptable for some cell types - High (>20%): Cell stress, apoptosis, or poor sample quality

5. Number of Protein Coding Genes - What it measures: Total number of protein-coding genes with detected expression - Interpretation: - High (>10,000): Excellent library complexity - Medium (5,000-10,000): Good complexity, suitable for analysis - Low (<5,000): Poor library complexity or insufficient sequencing depth

D. How Adapter Trimming Works#

The BJ-Expression pipeline uses a two-step adapter trimming approach for optimal read cleanup:

CUTADAPT (Step 1: Targeted Adapter Removal)#

CUTADAPT performs the initial adapter removal using known adapter sequences:

Adapter Removal Strategy: - Removes adapters from both 5' (-b flag) and 3' (-a flag) ends of reads - Processes both R1 and R2 reads independently with specified adapters - Uses exact adapter sequence matching for precise removal - Particularly effective for removing known library prep adapters

FASTP (Step 2: Comprehensive QC and Trimming)#

FASTP performs comprehensive adapter and quality trimming:

Adapter Detection: - Automatically detects adapter sequences by analyzing read ends - Can use user-specified adapter sequences (recommended for single-cell RNA-seq) - Detects and removes both 3' and 5' adapters

Quality Trimming: - Removes low-quality bases from read ends (default Q score < 15) - Performs sliding window quality trimming - Removes reads shorter than minimum length after trimming

Poly-G Tail Removal: - For NovaSeq/NextSeq (two-color chemistry), removes artificial poly-G tails - Poly-G tails occur when no signal is detected in two-color chemistry - Automatically detected based on instrument parameter

Read Filtering: - Removes reads with too many N bases - Filters reads below complexity threshold - Removes extremely short reads after trimming

Output: - Trimmed FASTQ files with improved quality - JSON report with detailed trimming statistics - HTML report with visual QC plots

Technical Note: The default adapter sequences (AAGCAGTGGTATCAACGCAGAGTACA) are optimized for single-cell RNA-seq protocols using template-switching. If using different library prep protocols, specify appropriate adapter sequences.

E. Frequently Asked Questions#

Q: How do I download gene or transcript count tables?

A: Follow these steps: 1. Export your project using BaseJumper (see Data Export manual) 2. In the export_data/ folder, navigate to: - secondary_analyses/quantification_htseq/ for STAR/HTSeq-based gene counts - secondary_analyses/quantification_salmon/ for Salmon-based transcript or gene counts 3. Download the matrix files for use in downstream analysis tools

Q: Are the gene or transcript count tables normalized?

A: - STAR/HTSeq gene counts (matrix_gene_counts_starhtseq.txt): Not normalized - Raw counts suitable for DESeq2 or edgeR - Salmon transcript counts: Available in three formats: - matrix_transcript_raw_salmon.tsv: Raw, unnormalized counts - matrix_transcript_tpm_salmon.tsv: TPM-scaled (normalized by library size) - matrix_transcript_length_tpm_salmon.tsv: Length-scaled TPM (recommended for DE analysis) - Salmon gene counts: Same normalization options as transcripts

Q: Should I disable subsampling for my analysis?

A: - Keep enabled for initial QC and rapid sample quality assessment - Disable for: - Final differential expression analysis - Isoform-level analysis requiring full depth - Low-input samples where every read matters - Publication-quality results

Q: Which gene counts should I use - STAR/HTSeq or Salmon?

A: Both are valid, but: - STAR/HTSeq: Traditional approach, full alignment, good for splice junction analysis - Salmon: Faster, includes length scaling (better for isoform differences), modern best practice - Recommendation: Use Salmon gene counts (matrix_gene_length_tpm_salmon.tsv) with tximport in R for most analyses

Q: What does high mitochondrial percentage indicate?

A: High PropMT (>20%) can indicate: - Cell stress or apoptosis - Poor sample preservation - Cell lysis during library prep - For some cell types (e.g., neurons, cardiomyocytes), 10-15% is normal

Q: How many genes should I detect?

A: Typical ranges: - Bulk RNA-seq: 15,000-20,000 genes - Single-cell RNA-seq: 2,000-8,000 genes per cell - With 100K subsampling: Expect lower numbers; disable subsampling for accurate gene detection

Q: Can I analyze 10x Genomics data with this pipeline?

A: Yes, enable the "10x Support" module. Note: - Specialized processing for 10x barcode structure - Set library protocol to "chromium" - Different adapter sequences may be needed

Q: What if my samples show poor alignment rates?

A: Check: 1. Correct reference genome selected (human vs. mouse) 2. FASTQ quality (look at fastp report) 3. Adapter contamination (check adapter content) 4. Possible contamination (enable Kraken if available) 5. Library prep issues (consult with lab)

Q: How do I interpret the cell typing results?

A: The SingleR classifier compares your expression profile to reference databases: - High confidence: Scores >0.8 indicate strong matches - Low confidence: Scores <0.5 suggest ambiguous or novel cell types - Use case: Validation of expected cell types, not definitive classification - Always confirm with marker gene expression

Q: What files do I need for differential expression analysis?

A: You need: 1. Count matrix: matrix_gene_counts_starhtseq.txt OR matrix_gene_length_tpm_salmon.tsv 2. Sample metadata: Your original input CSV with group information 3. For R/Bioconductor: - DESeq2: Use raw counts from STAR/HTSeq - edgeR: Use raw counts from STAR/HTSeq - limma-voom: Can use either, but include library size factors - tximport + DESeq2: Use Salmon output with matrix_gene_length_tpm_salmon.tsv