BJ-Expression

Background#

The BJ-Expression pipeline is a scalable and reproducible bioinformatics pipeline to process RNAseq data and assess transcript-level and gene-level quantification. The pipeline supports both single-end and paired-end data. The pipeline takes raw sequencing data in the form of FASTQ files and performs down-sampling (randomly selecting a fixed, smaller number of reads from the full set of reads) and adapter trimming of FASTQ files. The pipeline then performs transcript-level quantification using the pseudo-alignment method Salmon as well as gene-level quantification using STAR (Spliced Transcripts Alignment to a Reference) and HTSeq.

Pipeline Overview#

flowchart LR
%% Colors %%
classDef panel fill:transparent,stroke:#323232,stroke-dasharray:8
classDef black fill:#12294C,stroke:#12294C,stroke-width:2px,color:#fff
classDef blue fill:#20A4F3,stroke:#20A4F3,stroke-width:2px,color:#fff
classDef green fill:#3BCEAC,stroke:#3BCEAC,stroke-width:2px,color:#fff
classDef yellow fill:#ffd166,stroke:#ffd166,stroke-width:2px,color:#fff
classDef pink fill:#ef476f,stroke:#ef476f,stroke-width:2px,color:#fff
classDef orange fill:#f3722c,stroke:#f3722c,stroke-width:2px,color:#fff
classDef red fill:#BB4430,stroke:#BB4430,stroke-width:2px,color:#fff
classDef ming fill:#387780,stroke:#387780,stroke-width:2px,color:#fff
    Start((Start)):::black --fastq--> Trimming[Subsample Reads <br/> <br/> Trim Reads]:::blue
    subgraph Preprocess
        Trimming
    end
    subgraph Transcript-Quant
        direction TB
        Trimming --> Salmon:::green
    end
    subgraph Gene-Quant
        direction TB
        Trimming --> Alignment[STAR <br/> <br/> Samtools <br/> <br/> HTSeq]:::green
    end
    subgraph Evaluate
        Alignment --> M_Metrics[Qualimap <br/><br/> Custom Metrics <br/><br/> Coverage Metrics]:::pink
        Alignment --> M_CellTyping[Cell Typing]:::pink
        Alignment --> M_PCA[PCA]:::pink
    end
    subgraph Report
        Trimming --> Mqc[MultiQC Report]:::orange
        Alignment --> Mqc
        Salmon --> Mqc
        M_Metrics --> Mqc
        M_CellTyping --> Mqc
        M_PCA --> Mqc
    end
    Mqc --> End((End)):::black
    Preprocess:::panel
    Transcript-Quant:::panel
    Gene-Quant:::panel
    Evaluate:::panel
    Report:::panel

The following are the steps and tools that pipeline uses to perform the analyses:

Subsample the paired-end reads to 200,000 reads using SEQTK SAMPLE to compare metrics across samples

Info

Why subsample? For a first pass analysis, or comparison between sequencing runs, subsampling of the data enables a faster computation time and allows for sufficient QC analysis. However, to take full advantage of your sequencing data, subsampling should be disabled for thorough analysis.

Evaluate sequencing quality using FASTP and trim/clip reads
Perform transcript-level quantification using the pseudo-alignment method implemented in SALMON
Perform splice-aware alignment using STAR
Extract primary aligned reads from STAR-based bam using SAMTOOLS
Perform gene-level quantification from STAR alignment using the HTSEQ
Evaluate STAR alignment (BAM) quality control using QUALIMAP
Evaluate cell typing, custom metrics, and perform PCA using custom tools
Aggregate the metrics across biosamples and tools to create overall pipeline statistics summary using MULTIQC

Pipeline Parameters#

Parameter Name	Options	Description
Instrument	`NovaSeq` `NextSeq` `MiSeq` `MiniSeq` _^(default) `ISeq` `Other`	Instrument used to perform sequencing
Genome	`GRCh38` _^(default) `GRCm39` `GRCh37`	Reference genome to use for alignment
Read Length	`50` `75` _^(default) `100` `150`	Cycle used for sequencing
Adapter Sequence for first read	`AAGCAGTGGTATCAACGCAGAGTACA` _^(default)	Adapter Sequence to be trimmed from first read
Adapter Sequence for second read	`AAGCAGTGGTATCAACGCAGAGTACAT` _^(default)	Adapter Sequence to be trimmed from second read

Module Parameters#

Module	Parameter Name	Options	Description
FastQ Reads Subsampling		_^(default)	Default is set to 100000 reads
10x		_^(default)	Support module for 10x-derived data.

Output Directories and Contents#

Output Directory and Contents	Notes
`secondary_analyses`/ `alignment_htseq`/ `{.bam,.bai}` `_Chimeric.out.junction`	Biosample level output containing .bam alignment files.
`secondary_analyses`/ `secondary_metrics`/ `pipeline_metrics_summary.csv` `pipeline_metrics_summary_percents.csv`	The `pipeline_metrics_summary.csv` file contains metrics that are found in the "QualiMap Stats" section of the MultiQC output. This file contains metrics such as alignment stats, including the number of total aligned reads, exonic and intronic reads, and alignemnts to genes as well as 5'-3' bias. The `pipeline_metrics_summary_percents.csv` file contains metrics that are found in the "QualiMap percent stats" section of the MultiQC output. This file contains metrics such as percentage of exonic and intronic reads and percentage of total reads aligned.
`secondary_analyses`/ `quantification_htseq`/ `df_gene_counts_starhtseq.tsv` `df_mt_gene_counts_starhtseq.tsv` `matrix_gene_counts_starhtseq.txt` `df_gene_types_detected_summary_starhtseq.tsv`	This section includes output files for gene-level quantification from STAR and HTSeq. The main output files here are `df_gene_counts_starhtseq.tsv` which includes information such as ENSEMBL gene IDs, gene symbol IDs and HTseq counts for each detected gene as well as `df_mt_gene_counts_starhtseq.tsv` which includes mitochondrial gene counts (MT_counts), total detected gene counts (Total_counts), and proportion of mitochondrial gene counts to total number of gene counts (PropMT). The `matrix_gene_counts_starhtseq.txt` file contains the project level count matrix containing all samples and read counts for all genes. Importantly, these gene counts are used to generate plots in the MultiQC on BaseJumper. The `df_gene_types_detected_summary.tsv` file details the number of various gene biotypes detected in each sample (i.e. protein coding genes, lncRNAs, pseudogenes, miRNAs, etc).
`secondary_analyses`/ `quantification_salmon`/ `df_transcript_counts_salmon.tsv` `matrix_transcript_raw_salmon.tsv` `matrix_transcript_tpm_salmon.tsv` `matrix_transcript_length_tpm_salmon.tsv` `df_transcript_types_detected_summary_salmon.tsv` `df_gene_counts_salmon.tsv` `df_mt_gene_counts_salmon.tsv` `df_gene_types_detected_summary_salmon.tsv` `matrix_gene_counts_salmon.tsv` `matrix_gene_tpm_salmon.tsv` `matrix_gene_length_tpm_salmon.tsv`	The directory includes the output files for the transcript-level quantification from Salmon. The main output file here is the `df_transcript_counts_salmon.tsv`. This includes things such as ENSEMBL transcript ids, transcript lengths, TPM (Transcripts per Million) values, and both scaled and un-scaled transcript counts. The `matrix_` files contain matrices of transcript counts either unscaled (`matrix_transcript_raw_salmon.tsv`), scaled up to library size (`matrix_transcript_tpm_salmon.tsv`), or scaled first using the average transcript length and then library size (`matrix_transcript_length_tpm.tsv`). The `df_transcript_types_detected_summary_salmon.tsv` file details the number of various transcript biotypes detected in each sample (i.e. protein coding transcripts, lncRNAs, pseudogenes, miRNAs, etc). This directory also includes similar output files, but for Salmon-based gene-counts generated by collapsing the counts of all detected transcripts from the same gene.
`tertiary_analyses`/ `classification_cell_typing`/ `df_cell_typing_scores_singler_hpca_gtex_tcga.tsv` `df_cell_typing_summary_singler_hpca_gtex_tcga.tsv`	This section includes the output files for cell-typing analysis. The main output file here is the `df_cell_typing_summary_singler_hpca_gtex_tcga.tsv` This includes metrics for each sample such as the sample's assigned Cell Phase, Progenitor Type, Tissue Type, TGCA Tissue Type, and TGCA Tumor Type. The scores for each sample against each possible phase, progenitor, tissue, etc is found in the `df_cell_typing_scores_singler_hpca_gtex_tcga.tsv` file.
`multiqc`/ `multiqc_report.html` `multiqc_version.yml` `multiqc_report_data`/ `multiqc_report_plots`/	This directory includes output files containing metrics from various tools to create a multiqc report. The main file here is `multiqc_report.html`, which contains the overall QC metrics and certain plots such as PCA and Heatmap. This also appears as the MultiQC tab in BaseJumper. The `multiqc_version.yml` file is a configuration file containing the version of MultiQC that was run and can be opened/viewed using NotePad. The subdirectory `multiqc_report_plots`/ contains PDF, PNG, and SVG versions of all QC plots featured in MultiQC while the `multiqc_report_data`/ subdirectory contains the data tables used to generate those plots.
`execution_info`/ `input_dataset.csv` `execution_report.html` `execution_timeline.html` `eb_event.json` `execution_trace.txt` `input.csv` `output.json` `params_file.json` `recalculate_size_eb_event.json` `tool_versions.yml`	This directory includes execution information from the pipeline run.

Frequently Asked Questions#

How do I download a table of gene or transcript counts?
Step 1: Export your project using BaseJumper. Not sure how? Visit our Data Export manual page: https://docs.basejumper.bioskryb.com/getting-started/data-export/data-export/.
Step 2: In the export_data/ folder, navigate to the directory secondary_analyses/ then quantification_htseq/ for STAR/HTSeq-based gene counts or quantification_salmon/ for Salmon-based transcript or gene counts.
Are the gene or transcript count tables normalized?
STAR/HTSeq-based gene counts are not normalized. Salmon-based transcript and gene counts are available both as raw counts and TPM and length-TPM scaled counts (calculated using tximports).