BJ-Expression Pipeline User Manual#

Customer Support#

If you need assistance with the BJ-Expression pipeline, several support channels are available:

BaseJumper Online Manual#

For comprehensive documentation and guides, visit the BaseJumper Documentation Portal.

Email Support#

Contact our support team at: basejumper@bioskryb.com

Online Portal Ticketing System#

For bug reports and feature requests, submit a ticket through our online portal. Your account representative can provide access details.

Before You Begin#

Account Registration#

Register for a BaseJumper account at https://basejumper.bioskryb.com select ‘Create Account’ and review the Terms and Conditions.
Wait for account approval from your organization administrator
Log in with your credentials to access the platform

Data Transfer Setup#

Before running the BJ-Expression pipeline, ensure your sequencing data is accessible to you:

Globus Data Transfer#

Recommended method for large-scale data transfers
Set up a Globus endpoint for your data storage
You can review Globus setup instructions in the BaseJumper Online Manual here
The BaseJumper support team will add you to your assigned workspace and an email confirming membership will arrive from groups.globus.org (you may need to check your Junk/Spam filter).
Contact BaseJumper support for additional Globus configuration assistance.

Alternative Transfer Methods#

AWS S3 bucket access (requires IAM credentials)
Direct upload through BaseJumper interface (for smaller datasets)
Contact support to discuss custom data transfer solutions

Sample Metadata Requirements#

The BJ-Expression pipeline requires a properly formatted input CSV file. You can get the initial format of this CSV from the BaseJumper project page and clicking the ‘Export’ button at the top of the biosample table. This file will have the following structure (and you will need to add the ‘groups’ column):

Input CSV Format#

Column Name	Description	Required	Example
`biosampleName`	Unique identifier for each sample	Yes	`Expression-test1`
`read1`	Path to forward reads (R1) FASTQ file	Yes	`s3://bucket/sample_R1_001.fastq.gz`
`read2`	Path to reverse reads (R2) FASTQ file	Yes	`s3://bucket/sample_R2_001.fastq.gz`
`groups`	Optional grouping for batch analysis	No	`Group1` or `Control`

Example Input File#

biosampleName,read1,read2,groups
Expression-test1,s3://bioskryb-data/sample1_R1_001.fastq.gz,s3://bioskryb-data/sample1_R2_001.fastq.gz,Group1
Expression-test2,s3://bioskryb-data/sample2_R1_001.fastq.gz,s3://bioskryb-data/sample2_R2_001.fastq.gz,Group1
Expression-test3,s3://bioskryb-data/sample3_R1_001.fastq.gz,s3://bioskryb-data/sample3_R2_001.fastq.gz,Group2

FASTQ Naming Conventions#

BJ-Expression accepts standard Illumina FASTQ naming formats:

{SampleName}_S{SampleNumber}_L{Lane}_R{ReadNumber}_001.fastq.gz
Example: Expression-test1_S1_L001_R1_001.fastq.gz

Important Notes: - Files must be gzipped (.fastq.gz extension) - Read pairs must have matching sample names - S3 paths should be accessible from your BaseJumper account - CRAM files are not currently supported - contact ResolveServices for CRAM processing

Project Design#

Creating a New Project#

Navigate to Projects: From the BaseJumper dashboard, click the "Create Project" button in the upper right corner
Project Configuration:
Project Name: Choose a descriptive, unique name for your project
Description: Add optional project details
Select Pipeline: Choose BJ-Expression from the pipeline dropdown menu
Sample Selection:
Select samples from your Shared Data directory. If you do have an Illumina BaseSpace Sequence Hub token you can follow instructions in the BaseJumper Manual to provide this token to the Support email.

Shared Data Requirements#

If using samples from Shared Data: - Samples must follow Illumina naming conventions - Files must be in .fastq.gz format - Both paired-end reads (R1 and R2) must be present

Sample Organization#

Before submitting your project, consider:

Which samples to include? Select only samples you want to analyze together
Metadata grouping: Use the groups column to organize samples by experimental conditions
Quality control: Ensure all samples have sufficient read depth (minimum 5,000 reads)

⚠️ Warning: Samples with fewer than 5,000 reads may cause the pipeline to fail. Verify data quality before submission.

Automatic Pipeline Queuing#

By default, projects are automatically queued for execution after creation. To disable this:

Uncheck the "Auto-queue project" option during project setup
Manually start the pipeline from the project dashboard when ready

Pipeline Parameters#

Selecting Pipeline Version#

From the pipeline configuration screen, select your preferred pipeline version
Default: The most recent stable version (recommended)
For reproducibility, you can select specific previous versions

Core Parameters#

Configure the following parameters based on your experimental design:

Reference Genome#

Select the reference genome that matches your sample organism:

Genome	Description
`GRCh38` (Default)	Human reference genome, build 38
`GRCh37`	Human reference genome, build 37
`GRCm39`	Mouse reference genome, build 39

Default: GRCh38

Read Length#

Option	Description	Use Case
`50` (Default)	50 base pair reads	Standard single-cell RNA-seq
`75`	75 base pair reads	Higher resolution analysis
`100`	100 base pair reads	Deep sequencing applications
`150`	150 base pair reads	Long-read RNA-seq protocols

Default: 50

Adapter Sequences#

Specify adapter sequences for trimming:

Parameter	Default Value	Description
Adapter Sequence for first read	`AAGCAGTGGTATCAACGCAGAGTACA`	Adapter sequence trimmed from R1 reads
Adapter Sequence for second read	`AAGCAGTGGTATCAACGCAGAGTACAT`	Adapter sequence trimmed from R2 reads

💡 Note: These defaults are optimized for standard single-cell RNA-seq library prep protocols. Your sequencing analysis workflow should provide these adapter sequences, which you can use here.

Optional Modules#

Enable or disable optional analysis modules based on your needs:

Module	Default State	Description
FastQ Reads Subsampling	✅ Enabled	Randomly subsamples reads to 100,000 reads per sample for faster QC analysis. For comprehensive analysis, disable this option to use all reads.
CUTADAPT Adapter Removal	✅ Enabled	Performs targeted adapter removal using user-specified sequences before FASTP trimming. Removes adapters from both 5' and 3' ends of reads.
10x Support	❌ Disabled	Enables analysis of 10x Genomics Chromium data with specialized processing

Subsampling Details: - Default: 100,000 reads per sample. - Purpose: Enables faster computation time while providing sufficient data for QC metrics comparison. - Recommendation: Keep enabled for initial QC, disable for final full-depth analysis. This will keep data comparable from project to project but if you want to get the most out of phenotype prediction, isoform detection or variant calling, you can deselect this.

CUTADAPT Details: - Default: Enabled - Purpose: Removes known adapter sequences efficiently before comprehensive FASTP analysis - Recommendation: Keep enabled when using standard library prep protocols with known adapter sequences; can be disabled if adapters are minimal or if FASTP alone is sufficient

Workflow Overview#

Pipeline Execution Steps#

The BJ-Expression pipeline performs the following analyses:

flowchart LR
%% Colors %%
classDef black fill:#12294C,stroke:#12294C,stroke-width:2px,color:#fff
classDef blue fill:#20A4F3,stroke:#20A4F3,stroke-width:2px,color:#fff
classDef green fill:#3BCEAC,stroke:#3BCEAC,stroke-width:2px,color:#fff
classDef pink fill:#ef476f,stroke:#ef476f,stroke-width:2px,color:#fff
classDef orange fill:#f3722c,stroke:#f3722c,stroke-width:2px,color:#fff

    Start((Start)):::black --> Subsample[Subsample Reads<br/>SEQTK]:::blue
    Subsample --> Trim[Trim Adapters<br/>FASTP]:::blue
    Trim --> Salmon[Transcript Quantification<br/>SALMON]:::green
    Trim --> STAR[Splice-aware Alignment<br/>STAR]:::green
    STAR --> HTSeq[Gene Quantification<br/>HTSeq]:::green
    STAR --> Qualimap[Alignment QC<br/>Qualimap]:::pink
    STAR --> CellType[Cell Typing<br/>SingleR]:::pink
    STAR --> PCA[PCA Analysis]:::pink
    Salmon --> MultiQC[Aggregate Report<br/>MultiQC]:::orange
    HTSeq --> MultiQC
    Qualimap --> MultiQC
    CellType --> MultiQC
    PCA --> MultiQC
    MultiQC --> End((End)):::black

Processing Steps Detail#

1. Read Subsampling (SEQTK)#

Randomly selects 100,000 reads per sample (default)
Enables rapid QC and consistent cross-sample comparison
Can be disabled for full-depth analysis
Output: Subsampled FASTQ files

💡 Why subsample? For initial QC or comparison between runs, subsampling enables faster computation while providing sufficient data for quality assessment.

2. Adapter Removal (CUTADAPT)#

Removes adapter sequences from both 5' and 3' ends
Uses user-specified adapter sequences for targeted removal
Trims adapters from both forward and reverse reads
Output: Adapter-trimmed FASTQ files

💡 Note: CUTADAPT runs before FASTP to remove known adapter sequences, then FASTP performs additional quality trimming and filtering.

3. Quality Trimming and QC (FASTP)#

Performs additional adapter detection and removal
Trims low-quality bases from read ends
Performs poly-G tail removal (for two-color chemistry)
Filters reads below quality threshold
Generates comprehensive QC metrics
Output: Trimmed FASTQ files, QC JSON reports

4. Transcript-Level Quantification (SALMON)#

Uses pseudo-alignment for rapid transcript (isoform) quantification
Maps reads to transcript sequences without full alignment
Generates TPM (Transcripts Per Million) values and normalizes for isoform length
Provides both transcript and gene-level counts
Output: Transcript counts, TPM values, gene counts

5. Splice-Aware Alignment (STAR)#

Aligns reads to reference genome with splice junction detection
Handles intron-spanning reads for RNA-seq data
Generates chimeric junction output for fusion detection
Output: BAM alignment files, junction files

6. Primary Read Extraction (SAMTOOLS)#

Extracts primary alignments from STAR output
Filters secondary and supplementary alignments
Creates indexed BAM files
Output: Filtered BAM files with indexes

7. Gene-Level Quantification (HTSeq)#

Counts reads overlapping gene features
Uses STAR alignments and GTF annotations
Provides raw read counts per gene
Output: Gene count matrices

8. Alignment Quality Assessment (QUALIMAP)#

Evaluates alignment quality metrics
Calculates exonic/intronic/intergenic read proportions
Assesses 5'-3' bias
Generates coverage profiles
Output: Quality metrics, coverage statistics

9. Cell Typing Classification (SingleR)#

Predicts cell type based on gene expression patterns
Uses reference databases: HPCA, GTEx, TCGA
Classifies by cell phase, progenitor type, tissue type
Output: Cell type predictions with confidence scores

10. Principal Component Analysis (PCA)#

Identifies sample relationships and batch effects
Visualizes expression patterns across samples
Detects outliers or technical artifacts
Output: PCA plots, variance explained

11. Report Generation (MultiQC)#

Aggregates metrics across all samples and tools
Creates interactive HTML report
Generates summary tables and visualizations
Output: multiqc_report.html

File Naming Conventions#

Input Files#

Forward reads: {biosampleName}_R1_001.fastq.gz
Reverse reads: {biosampleName}_R2_001.fastq.gz

Output Files#

Aligned BAM: {biosampleName}.bam
BAM index: {biosampleName}.bam.bai
Gene counts: {biosampleName}.htseq_counts.tsv
Chimeric junctions: {biosampleName}_Chimeric.out.junction

Estimated Processing Time#

Project Size	Estimated Time	Notes
10 samples	<1 hour	With 100K reads subsampling
50 samples	1-3 hours	Standard QC analysis
100 samples	2-6 hours	Large batch processing
384 samples	4-12 hours	High-throughput plate

Factors affecting runtime: - Read depth per sample - Subsampling enabled/disabled - Reference genome size - Current cluster load

Summarizing Output#

MultiQC Report#

The MultiQC report aggregates all QC metrics into a single interactive HTML file located at:

1	`multiqc/multiqc_report.html`

MultiQC Report Sections#

1. General Statistics#

Overview table showing key metrics for all samples: - Total reads - Alignment rate - Exonic read percentage - Number of genes detected - Mitochondrial read percentage

2. Selected Metrics#

Curated selection of the most important QC metrics across all analysis tools. This section consolidates key quality indicators from read processing, alignment, quantification, and cell typing analyses into a single comprehensive table for easy comparison across samples. Metrics include read counts at various pipeline stages, quality filtering statistics, alignment proportions (exonic, intronic, intergenic, mitochondrial), gene and transcript detection counts, insert size measurements, coverage metrics, expression dynamics, and automated cell type classifications. See Appendix B for detailed descriptions of all metrics, including typical ranges and quality indicators.

3. Fastp Section#

Trimming and filtering statistics: - Reads passed/filtered - Adapter trimming rates - Quality filtering results - Poly-G tail removal (if applicable) - Before/after comparison plots

4. Salmon Section#

Transcript quantification metrics: - Mapping rate - Number of transcripts detected - Library type detection - Fragment length distribution

5. STAR Alignment Section#

Splice-aware alignment statistics: - Uniquely mapped reads percentage - Multi-mapped reads - Unmapped reads - Chimeric reads - Reads mapped to multiple loci

6. Qualimap Section#

Alignment quality metrics: - Exonic reads: Percentage of reads mapping to exons - Intronic reads: Percentage of reads mapping to introns - Intergenic reads: Percentage of reads in intergenic regions - 5'-3' bias: Coverage uniformity along transcript length - Gene body coverage: Distribution of reads across gene features

7. Gene Detection Section#

Gene-level quantification summary: - Total genes detected - Protein-coding genes - lncRNAs, pseudogenes, miRNAs - Gene type distribution - Mitochondrial gene proportion

8. Cell Typing Section#

Automated cell type classification: - Predicted cell type - Cell cycle phase - Tissue type prediction - Confidence scores

9. Expression Dynamics#

Dynamic range and expression metrics: - Housekeeping gene expression consistency - Coefficient of variation (CV) for housekeeping genes - Expression range (min/max/average) - Top and bottom 10% expression levels

10. PCA Analysis#

Sample clustering and relationships: - Principal component plots - Variance explained by each PC - Sample outlier detection - Group separation visualization

11. ClusterQC Composition#

Quality category distribution: - Number of samples per quality category - Overall batch quality assessment - Category definitions (1-5 scale)

View Example MultiQC Report

ClusterQC Analysis#

ClusterQC provides visual assessment of RNA-seq sample quality:

RNA-QC Composition Plot#

Location: tertiary_analyses/qc_plots/composition_rnaqc.pdf

This plot displays the distribution of samples across quality categories.

Quality Categories:

Category	Quality Level	Characteristics
Category 5	Excellent	High alignment rate (>70%), high exonic proportion (>75%), low intergenic reads (<5%), >10K genes detected
Category 4	Good	Alignment rate 50-70%, exonic proportion 60-75%, suitable for analysis
Category 3	Moderate	Alignment rate 30-50%, may require additional QC review
Category 2	Fair	Alignment rate 10-30%, consider re-sequencing
Category 1	Poor	Alignment rate <10%, not recommended for analysis
Category 0	Failed	Unable to process or extreme technical failure

Interpretation: - Samples in Category 4-5 are suitable for downstream RNA-seq analysis - Samples in Category 1-2 may indicate library preparation or sequencing issues - Use this plot to identify problematic samples before proceeding to differential expression analysis

Summary Verdict Table#

Location: tertiary_analyses/qc_plots/summary_verdict.txt

Summary table showing: - Number of samples in each quality category - Proportion of samples in each category - Overall project quality assessment

Example:

Category    NumberCells    ProportionCells
         0              0
         5              100
         0              0
         0              0
         0              0
         0              0

In this example, all 5 samples are Category 4 (good quality).

Group-Stratified Summary#

Location: tertiary_analyses/qc_plots/summary_verdict_group.txt

If groups are specified in the input CSV, this file shows quality category distribution per group, useful for detecting batch effects.

Output Files#

All pipeline results are organized in a structured directory hierarchy:

Directory Structure#

project_results/
├── multiqc/
├── read_counts/
├── secondary_analyses/
│   ├── alignment_htseq/
│   ├── insert_size/
│   ├── quantification_htseq/
│   ├── quantification_salmon/
│   └── secondary_metrics/
├── tertiary_analyses/
│   ├── classification_cell_typing/
│   └── qc_plots/
└── execution_info/

Detailed Output Descriptions#

1. `multiqc/` Directory#

File	Description
`multiqc_report.html`	Main QC report - Interactive HTML report with all QC metrics. Open in web browser.
`multiqc_data/`	Raw data used to generate MultiQC report (JSON, TSV files)
`multiqc_report_plots/`	Plot files - PDF, PNG, and SVG versions of all plots featured in MultiQC
`multiqc_version.yml`	Version information for reproducibility

Example: multiqc_report.html

2. `secondary_analyses/alignment_htseq/` Directory#

Aligned reads for each biosample:

secondary_analyses/alignment_htseq/
├── {biosampleName}.bam
├── {biosampleName}.bam.bai
└── {biosampleName}_Chimeric.out.junction

File Type	Description
`.bam`	Binary Alignment Map - STAR-aligned reads in binary format. Can be viewed with IGV or samtools. Contains primary alignments only.
`.bam.bai`	BAM Index - Required for efficient random access to BAM file.
`_Chimeric.out.junction`	Chimeric junctions - Detected fusion/chimeric reads that span distant genomic locations. Useful for fusion gene detection.

Size: Typically 50-500 MB per sample (for 100K reads)

Use Case: Visual inspection of alignments, splice junction analysis, fusion detection

3. `secondary_analyses/quantification_htseq/` Directory#

Gene-level quantification from STAR and HTSeq:

secondary_analyses/quantification_htseq/
├── df_gene_counts_starhtseq.tsv
├── df_mt_gene_counts_starhtseq.tsv
├── df_gene_types_detected_summary_starhtseq.tsv
├── matrix_gene_counts_starhtseq.txt
├── HouseKeepingGenes_Expression.pdf
├── HKGenes_Expression__mqc.png
├── HouseKeepingGenes_Counts_mqc.tsv
└── HouseKeepingGenes_CV_mqc.tsv

File	Description
`df_gene_counts_starhtseq.tsv`	Main gene counts table - Contains ENSEMBL gene IDs, gene symbols, gene biotypes, and HTSeq counts for each sample and detected gene
`df_mt_gene_counts_starhtseq.tsv`	Mitochondrial metrics - Contains MT_counts (mitochondrial gene counts), Total_counts (total detected gene counts), and PropMT (proportion of MT to total counts) per sample
`df_gene_types_detected_summary_starhtseq.tsv`	Gene biotype summary - Details the number and proportion of various gene types detected: protein-coding genes, lncRNAs, pseudogenes, miRNAs, etc.
`matrix_gene_counts_starhtseq.txt`	Count matrix - Project-level matrix with all samples (columns) and read counts for all genes (rows). Ready for differential expression analysis.
`HouseKeepingGenes_Expression.pdf`	Housekeeping gene heatmap - Clustergram showing expression consistency of housekeeping genes across samples
`HouseKeepingGenes_CV_mqc.tsv`	Coefficient of variation - CV rates for housekeeping genes, indicating technical variability
`HouseKeepingGenes_Counts_mqc.tsv`	Housekeeping counts - Raw counts for housekeeping genes used for QC assessment

Example Files: - df_gene_counts_starhtseq.tsv - df_mt_gene_counts_starhtseq.tsv - df_gene_types_detected_summary_starhtseq.tsv - matrix_gene_counts_starhtseq.txt - HKGenes_Expression__mqc.png - Housekeeping gene heatmap

Use Case: - Use matrix_gene_counts_starhtseq.txt for differential expression analysis in R (DESeq2, edgeR) - Check df_mt_gene_counts_starhtseq.tsv for contamination or poor cell quality (high PropMT indicates dying cells) - Review df_gene_types_detected_summary_starhtseq.tsv to assess library complexity

4. `secondary_analyses/quantification_salmon/` Directory#

Transcript-level quantification from Salmon:

secondary_analyses/quantification_salmon/
├── df_transcript_counts_salmon.tsv
├── df_transcript_types_detected_summary_salmon.tsv
├── matrix_transcript_raw_salmon.tsv
├── matrix_transcript_tpm_salmon.tsv
├── matrix_transcript_length_tpm_salmon.tsv
├── df_gene_counts_salmon.tsv
├── df_mt_gene_counts_salmon.tsv
├── df_gene_types_detected_summary_salmon.tsv
├── matrix_gene_counts_salmon.tsv
├── matrix_gene_tpm_salmon.tsv
└── matrix_gene_length_tpm_salmon.tsv

Transcript-Level Files:

File	Description
`df_transcript_counts_salmon.tsv`	Main transcript table - Contains ENSEMBL transcript IDs, transcript lengths, TPM values, and both scaled and unscaled transcript counts
`matrix_transcript_raw_salmon.tsv`	Raw transcript counts - Unscaled read counts per transcript across all samples
`matrix_transcript_tpm_salmon.tsv`	TPM-scaled transcripts - Transcripts Per Million scaled by library size
`matrix_transcript_length_tpm_salmon.tsv`	Length-scaled TPM - Scaled first by average transcript length, then by library size (recommended for differential expression)
`df_transcript_types_detected_summary_salmon.tsv`	Transcript biotype summary - Number and types of transcripts detected per sample

Example Files: - df_transcript_counts_salmon.tsv

Gene-Level Files (Salmon-based):

File	Description
`df_gene_counts_salmon.tsv`	Salmon gene counts - Gene-level counts generated by collapsing transcript counts from the same gene
`df_mt_gene_counts_salmon.tsv`	Mitochondrial metrics - Similar to HTSeq version but derived from Salmon quantification
`df_gene_types_detected_summary_salmon.tsv`	Gene biotype summary - Gene types detected using Salmon-based quantification
`matrix_gene_counts_salmon.tsv`	Gene count matrix - Unscaled gene counts across all samples
`matrix_gene_tpm_salmon.tsv`	Gene TPM matrix - TPM-scaled gene expression values
`matrix_gene_length_tpm_salmon.tsv`	Length-scaled gene TPM - Recommended for differential expression analysis using tximport-compatible tools

Example Files: - df_gene_counts_salmon.tsv - matrix_gene_length_tpm_salmon.tsv

Use Case: - Use matrix_transcript_length_tpm_salmon.tsv for transcript-level differential expression - Use matrix_gene_length_tpm_salmon.tsv with tximport in R for gene-level DE analysis - Salmon quantification is faster and doesn't require full alignment - Compare Salmon and HTSeq gene counts to assess quantification concordance

5. `secondary_analyses/secondary_metrics/` Directory#

Comprehensive alignment and expression metrics:

secondary_analyses/secondary_metrics/
├── pipeline_metrics_summary_percents.csv
├── qualimap_stats_mqc.csv
└── df_dynamicrange_expression.tsv

File	Description
`pipeline_metrics_summary_percents.csv`	Percentage-based metrics - Contains metrics from the "QualiMap percent stats" section: percentage of exonic reads, intronic reads, intergenic reads, and total reads aligned
`qualimap_stats_mqc.csv`	Qualimap statistics - Contains alignment stats including total aligned reads, alignments to genes, 5'-3' bias metrics
`df_dynamicrange_expression.tsv`	Expression dynamics - Contains min/max/average expression, bottom 10% and top 10% expression levels, and dynamic range for each sample

Example Files: - pipeline_metrics_summary_percents.csv - qualimap_stats_mqc.csv - df_dynamicrange_expression.tsv

Key Metrics Explained:

Metric	Description	Good Quality Range
`reads.aligned_P_Total`	Percentage of reads successfully aligned	> 50%
`exonic_P_gen`	Percentage of reads mapping to exons	> 60%
`intergenic_P_gen`	Percentage of reads in intergenic regions	< 10%
`dynamic_range_expr`	Spread of expression levels (top10% - bottom10%)	Higher indicates more diverse expression

6. `secondary_analyses/insert_size/` Directory#

Insert size distribution metrics (for paired-end sequencing):

secondary_analyses/insert_size/
└── aggregated_insert_size_summary.tsv

File	Description
`aggregated_insert_size_summary.tsv`	Aggregated insert size metrics - Contains mean and median insert sizes calculated separately for different RNA biotypes across all samples

This file provides insert size statistics stratified by gene biotype, useful for assessing library quality and fragment size distribution:

Biotype Columns: - Protein_Coding_Mean / Protein_Coding_Median: Insert sizes from reads mapping to protein-coding genes (most informative for library prep QC) - Ribosomal_RNA_Mean / Ribosomal_RNA_Median: Insert sizes from ribosomal RNA genes (indicates rRNA depletion efficiency) - Mitochondrial_RNA_Mean / Mitochondrial_RNA_Median: Insert sizes from mitochondrial genes - Small_RNAs_Mean / Small_RNAs_Median: Insert sizes from small nuclear/nucleolar RNAs - MicroRNAs_Mean / MicroRNAs_Median: Insert sizes from microRNA genes

Example Files: - aggregated_insert_size_summary.tsv

7. `tertiary_analyses/classification_cell_typing/` Directory#

Cell type prediction results:

tertiary_analyses/classification_cell_typing/
├── df_cell_typing_summary_singler_hpca_gtex_tcga.tsv
└── df_cell_typing_scores_singler_hpca_gtex_tcga.tsv

File	Description
`df_cell_typing_summary_singler_hpca_gtex_tcga.tsv`	Cell type predictions - Contains assigned Cell Phase (G1, S, G2M), Progenitor Type, Tissue Type, TGCA Tissue Type, and TGCA Tumor Type for each sample
`df_cell_typing_scores_singler_hpca_gtex_tcga.tsv`	Prediction scores - Confidence scores for each sample against all possible phases, progenitors, tissues, etc. Used to assess classification confidence

Example Files: - df_cell_typing_summary_singler_hpca_gtex_tcga.tsv - df_cell_typing_scores_singler_hpca_gtex_tcga.tsv

Cell Classification Categories:

Phase: Cell cycle stage (G1, S, G2M)
Progenitor: Cell lineage type (e.g., B_cell, T_cell, Monocyte)
Tissue: Predicted tissue of origin (e.g., Blood, Spleen, Brain)
TGCA Tissue: TCGA reference tissue type
TGCA Tumor: TCGA tumor classification

Use Case: Validate expected cell types, detect contamination, identify unexpected cell populations

8. `tertiary_analyses/qc_plots/` Directory#

Aggregate quality control visualizations:

tertiary_analyses/qc_plots/
├── composition_rnaqc.pdf
├── RNAQC_composition_mqc.jpg
├── summary_verdict.txt
└── summary_verdict_group.txt

File	Description
`composition_rnaqc.pdf`	QC category distribution - Bar chart showing number and percentage of samples in each quality category (1-5)
`RNAQC_composition_mqc.jpg`	JPEG version for inclusion in reports
`summary_verdict.txt`	Overall summary - Table with number and proportion of samples per quality category
`summary_verdict_group.txt`	Group-stratified summary - Quality metrics broken down by experimental groups (if specified in input CSV)

Example Files: - composition_rnaqc.pdf - RNAQC_composition_mqc.jpg - summary_verdict.txt - summary_verdict_group.txt

9. `read_counts/` Directory#

Read count tracking at each processing step:

Tracks the number of reads remaining after subsampling, trimming, filtering, and alignment. Useful for identifying where read loss occurs.

10. `execution_info/` Directory#

Pipeline execution metadata:

execution_info/
├── input_dataset.csv
├── execution_report.html
├── execution_timeline.html
├── execution_trace.txt
├── params_file.json
├── tool_versions.yml
├── input.csv
├── output.json
└── eb_event.json

File	Description
`input_dataset.csv`	Copy of the input CSV used to run the pipeline
`execution_report.html`	Nextflow execution report with resource usage and runtime statistics
`execution_timeline.html`	Visual timeline of task execution
`execution_trace.txt`	Detailed trace of all executed tasks
`params_file.json`	Complete parameter settings used for the pipeline run
`tool_versions.yml`	Software versions for all tools used in the pipeline (STAR, Salmon, HTSeq, etc.)

Use Case: Troubleshooting, reproducibility, performance optimization, documenting analysis parameters

Appendix#

A. FASTQ Naming Conventions#

The BJ-Expression pipeline accepts Illumina FASTQ files following these naming patterns:

Standard Illumina Format#

1	`{SampleName}_S{SampleNumber}_L{LaneNumber}_R{ReadNumber}_001.fastq.gz`

Components: - {SampleName}: User-defined sample identifier (e.g., Expression-test1, Sample_A) - S{SampleNumber}: Sample number from sample sheet (e.g., S1, S12) - L{LaneNumber}: Sequencing lane (e.g., L001, L002) - R{ReadNumber}: Read direction (R1 for forward, R2 for reverse) - 001: File segment (always 001 for single files)

Examples:

Expression-test1_S1_L001_R1_001.fastq.gz
Expression-test1_S1_L001_R2_001.fastq.gz
ResolveOMEv2.1-RNA-04C_S4_L003_R1_001.fastq.gz
ResolveOMEv2.1-RNA-04C_S4_L003_R2_001.fastq.gz

Simplified Format (Also Accepted)#

{SampleName}_R1_001.fastq.gz
{SampleName}_R2_001.fastq.gz

Requirements: - Files must be gzip compressed (.fastq.gz or .fq.gz) - Forward and reverse reads must have matching sample names - Read pairs must be indicated by _R1_ and _R2_ (or _R1. and _R2.) - Both reads in a pair are required (pipeline does not support single-end for paired data)

B. Selected Metrics Descriptions#

The following table describes key metrics found in the MultiQC report and output files:

Metric	Description	Typical Range	Quality Indicator
`SampleId`	Unique sample identifier	N/A	Must match biosampleName from input CSV
`BJ_calculated_raw_read_pairs`	Total raw read pairs calculated by BaseJumper	Varies by design	Project-dependent
`Nonsubsampled_reads`	Total reads before subsampling step	Varies by design	Project-dependent
`Subsampled_reads`	Reads after subsampling (100,000 default)	100,000 if enabled	Should match target if enabled
`Pass_filtered_reads`	Reads passing FASTP quality filters	> 80% of input	Higher indicates good quality
`Low_quality_reads`	Reads removed due to low quality scores	< 10% of input	Lower is better
`Many_Ns_reads`	Reads with excessive N bases removed	< 5% of input	Lower is better
`Too_short_reads`	Reads too short after trimming	< 10% of input	Lower is better
`Prop_pass_filtered_reads`	Proportion of reads passing all filters	> 0.8 (80%)	Higher indicates good quality
`Prop_low_quality_reads`	Proportion failing quality threshold	< 0.1 (10%)	Lower is better
`Prop_of_many_Ns_reads`	Proportion with too many N bases	< 0.05 (5%)	Lower is better
`Prop_of_too_short_reads`	Proportion too short after trimming	< 0.1 (10%)	Lower is better
`Prop_mappability`	Percentage of reads successfully aligned	40-80%	Higher indicates good library quality
`Prop_exonic`	Percentage of reads mapping to exons	60-80%	Higher indicates good mRNA enrichment
`Prop_intronic`	Percentage of reads mapping to introns	10-30%	Too high may indicate pre-mRNA contamination
`Prop_intergenic`	Percentage of reads in intergenic regions	< 10%	Lower indicates less genomic DNA contamination
`Prop_mitochondrion`	Proportion of mitochondrial reads	< 0.2 (20%)	Higher may indicate cell stress or poor quality
`Protein_coding_genes`	Number of protein-coding genes detected	> 5,000	Higher indicates good library complexity
`Protein_coding_transcripts`	Number of protein-coding transcripts detected	> 10,000	Higher indicates isoform diversity
`Protein_coding_Mean_insertsize`	Mean insert size from protein-coding genes	150-300 bp (varies by protocol)	Should match expected library prep size
`Protein_coding_Median_insertsize`	Median insert size from protein-coding genes	150-300 bp (varies by protocol)	More robust to outliers than mean
`Mean_number_of_txs_per_gene`	Average transcripts detected per gene	1.5-3.0	Higher indicates isoform detection
`Coverage_ratio_gene_body`	Ratio of 5' to 3' coverage across genes	0.8-1.2	~1.0 indicates no bias
`Median_coverage_gene_body`	Median coverage depth across gene bodies	Varies by depth	Higher indicates better coverage
`Dynamic_range`	Expression spread (top 10% - bottom 10%)	3,000-6,000	Higher indicates diverse expression
`Cell_type_Phase`	Predicted cell cycle phase	G1, S, or G2M	Validation of expected phase
`Cell_type_Progenitor`	Predicted progenitor cell type	B_cell, T_cell, etc.	Validation of expected lineage
`Cell_type_Tissue`	Predicted tissue of origin	Blood, Brain, etc.	Validation of expected tissue
`Cell_type_TGCA_tissue`	TCGA reference tissue classification	Various tissue types	Reference-based classification
`Cell_type_TGCA_tumor`	TCGA tumor type classification	Various tumor types	Tumor-specific classification

C. ClusterQC Cutoffs and Interpretation#

Quality Metrics Explained#

1. Mappability (Alignment Rate) - What it measures: Percentage of reads successfully aligned to the reference genome - Interpretation: - High (>70%): Excellent library quality, minimal contamination - Medium (50-70%): Good quality, acceptable for analysis - Low (<50%): Possible contamination, degraded RNA, or wrong reference genome

2. Proportion Exonic - What it measures: Percentage of aligned reads mapping to exon regions - Interpretation: - High (>75%): Excellent mRNA enrichment, successful library prep - Medium (60-75%): Acceptable mRNA content - Low (<60%): Poor mRNA enrichment, possible genomic DNA contamination or pre-mRNA

3. Proportion Intergenic - What it measures: Percentage of aligned reads mapping to intergenic regions - Interpretation: - Low (<5%): Excellent specificity to genic regions - Medium (5-10%): Acceptable background - High (>10%): Genomic DNA contamination or poor library enrichment

4. Proportion Mitochondria - What it measures: Percentage of reads mapping to mitochondrial genes - Interpretation: - Low (<10%): Healthy cells with intact cytoplasmic RNA - Medium (10-20%): Acceptable for some cell types - High (>20%): Cell stress, apoptosis, or poor sample quality

5. Number of Protein Coding Genes - What it measures: Total number of protein-coding genes with detected expression - Interpretation: - High (>10,000): Excellent library complexity - Medium (5,000-10,000): Good complexity, suitable for analysis - Low (<5,000): Poor library complexity or insufficient sequencing depth

D. How Adapter Trimming Works#

The BJ-Expression pipeline uses a two-step adapter trimming approach for optimal read cleanup:

CUTADAPT (Step 1: Targeted Adapter Removal)#

CUTADAPT performs the initial adapter removal using known adapter sequences:

Adapter Removal Strategy: - Removes adapters from both 5' (-b flag) and 3' (-a flag) ends of reads - Processes both R1 and R2 reads independently with specified adapters - Uses exact adapter sequence matching for precise removal - Particularly effective for removing known library prep adapters

FASTP (Step 2: Comprehensive QC and Trimming)#

FASTP performs comprehensive adapter and quality trimming:

Adapter Detection: - Automatically detects adapter sequences by analyzing read ends - Can use user-specified adapter sequences (recommended for single-cell RNA-seq) - Detects and removes both 3' and 5' adapters

Quality Trimming: - Removes low-quality bases from read ends (default Q score < 15) - Performs sliding window quality trimming - Removes reads shorter than minimum length after trimming

Poly-G Tail Removal: - For NovaSeq/NextSeq (two-color chemistry), removes artificial poly-G tails - Poly-G tails occur when no signal is detected in two-color chemistry - Automatically detected based on instrument parameter

Read Filtering: - Removes reads with too many N bases - Filters reads below complexity threshold - Removes extremely short reads after trimming

Output: - Trimmed FASTQ files with improved quality - JSON report with detailed trimming statistics - HTML report with visual QC plots

Technical Note: The default adapter sequences (AAGCAGTGGTATCAACGCAGAGTACA) are optimized for single-cell RNA-seq protocols using template-switching. If using different library prep protocols, specify appropriate adapter sequences.

E. Frequently Asked Questions#

Q: How do I download gene or transcript count tables?

A: Follow these steps: 1. Export your project using BaseJumper (see Data Export manual) 2. In the export_data/ folder, navigate to: - secondary_analyses/quantification_htseq/ for STAR/HTSeq-based gene counts - secondary_analyses/quantification_salmon/ for Salmon-based transcript or gene counts 3. Download the matrix files for use in downstream analysis tools

Q: Are the gene or transcript count tables normalized?

A: - STAR/HTSeq gene counts (matrix_gene_counts_starhtseq.txt): Not normalized - Raw counts suitable for DESeq2 or edgeR - Salmon transcript counts: Available in three formats: - matrix_transcript_raw_salmon.tsv: Raw, unnormalized counts - matrix_transcript_tpm_salmon.tsv: TPM-scaled (normalized by library size) - matrix_transcript_length_tpm_salmon.tsv: Length-scaled TPM (recommended for DE analysis) - Salmon gene counts: Same normalization options as transcripts

Q: Should I disable subsampling for my analysis?

A: - Keep enabled for initial QC and rapid sample quality assessment - Disable for: - Final differential expression analysis - Isoform-level analysis requiring full depth - Low-input samples where every read matters - Publication-quality results

Q: Which gene counts should I use - STAR/HTSeq or Salmon?

A: Both are valid, but: - STAR/HTSeq: Traditional approach, full alignment, good for splice junction analysis - Salmon: Faster, includes length scaling (better for isoform differences), modern best practice - Recommendation: Use Salmon gene counts (matrix_gene_length_tpm_salmon.tsv) with tximport in R for most analyses

Q: What does high mitochondrial percentage indicate?

A: High PropMT (>20%) can indicate: - Cell stress or apoptosis - Poor sample preservation - Cell lysis during library prep - For some cell types (e.g., neurons, cardiomyocytes), 10-15% is normal

Q: How many genes should I detect?

A: Typical ranges: - Bulk RNA-seq: 15,000-20,000 genes - Single-cell RNA-seq: 2,000-8,000 genes per cell - With 100K subsampling: Expect lower numbers; disable subsampling for accurate gene detection

Q: Can I analyze 10x Genomics data with this pipeline?

A: Yes, enable the "10x Support" module. Note: - Specialized processing for 10x barcode structure - Set library protocol to "chromium" - Different adapter sequences may be needed

Q: What if my samples show poor alignment rates?

A: Check: 1. Correct reference genome selected (human vs. mouse) 2. FASTQ quality (look at fastp report) 3. Adapter contamination (check adapter content) 4. Possible contamination (enable Kraken if available) 5. Library prep issues (consult with lab)

Q: How do I interpret the cell typing results?

A: The SingleR classifier compares your expression profile to reference databases: - High confidence: Scores >0.8 indicate strong matches - Low confidence: Scores <0.5 suggest ambiguous or novel cell types - Use case: Validation of expected cell types, not definitive classification - Always confirm with marker gene expression

Q: What files do I need for differential expression analysis?

A: You need: 1. Count matrix: matrix_gene_counts_starhtseq.txt OR matrix_gene_length_tpm_salmon.tsv 2. Sample metadata: Your original input CSV with group information 3. For R/Bioconductor: - DESeq2: Use raw counts from STAR/HTSeq - edgeR: Use raw counts from STAR/HTSeq - limma-voom: Can use either, but include library size factors - tximport + DESeq2: Use Salmon output with matrix_gene_length_tpm_salmon.tsv

BJ-Expression Pipeline User Manual#

Customer Support#

BaseJumper Online Manual#

Email Support#

Online Portal Ticketing System#

Before You Begin#

Account Registration#

Data Transfer Setup#

Globus Data Transfer#

Alternative Transfer Methods#

Sample Metadata Requirements#

Input CSV Format#

Example Input File#

FASTQ Naming Conventions#

Project Design#

Creating a New Project#

Shared Data Requirements#

Sample Organization#

Automatic Pipeline Queuing#

Pipeline Parameters#

Selecting Pipeline Version#

Core Parameters#

Reference Genome#

Read Length#

Adapter Sequences#

Optional Modules#

Workflow Overview#

Pipeline Execution Steps#

Processing Steps Detail#

1. Read Subsampling (SEQTK)#

2. Adapter Removal (CUTADAPT)#

3. Quality Trimming and QC (FASTP)#

4. Transcript-Level Quantification (SALMON)#

5. Splice-Aware Alignment (STAR)#

6. Primary Read Extraction (SAMTOOLS)#

7. Gene-Level Quantification (HTSeq)#

8. Alignment Quality Assessment (QUALIMAP)#

9. Cell Typing Classification (SingleR)#

10. Principal Component Analysis (PCA)#

11. Report Generation (MultiQC)#

File Naming Conventions#

Input Files#

Output Files#

Estimated Processing Time#

Summarizing Output#

MultiQC Report#

MultiQC Report Sections#

1. General Statistics#

2. Selected Metrics#

3. Fastp Section#

4. Salmon Section#

5. STAR Alignment Section#

6. Qualimap Section#

7. Gene Detection Section#

8. Cell Typing Section#

9. Expression Dynamics#

10. PCA Analysis#

11. ClusterQC Composition#

ClusterQC Analysis#

RNA-QC Composition Plot#

Summary Verdict Table#

Group-Stratified Summary#

Output Files#

Directory Structure#

Detailed Output Descriptions#

1. multiqc/ Directory#

2. secondary_analyses/alignment_htseq/ Directory#

3. secondary_analyses/quantification_htseq/ Directory#

4. secondary_analyses/quantification_salmon/ Directory#

5. secondary_analyses/secondary_metrics/ Directory#

6. secondary_analyses/insert_size/ Directory#

7. tertiary_analyses/classification_cell_typing/ Directory#

8. tertiary_analyses/qc_plots/ Directory#

9. read_counts/ Directory#

10. execution_info/ Directory#

Appendix#

A. FASTQ Naming Conventions#

Standard Illumina Format#

Simplified Format (Also Accepted)#

B. Selected Metrics Descriptions#

1. `multiqc/` Directory#

2. `secondary_analyses/alignment_htseq/` Directory#

3. `secondary_analyses/quantification_htseq/` Directory#

4. `secondary_analyses/quantification_salmon/` Directory#

5. `secondary_analyses/secondary_metrics/` Directory#

6. `secondary_analyses/insert_size/` Directory#

7. `tertiary_analyses/classification_cell_typing/` Directory#

8. `tertiary_analyses/qc_plots/` Directory#

9. `read_counts/` Directory#

10. `execution_info/` Directory#