BJ-Expression
Background#
The BJ-Expression pipeline is a scalable and reproducible bioinformatics pipeline to process RNAseq data and assess transcript-level and gene-level quantification. The pipeline supports both single-end and paired-end data. The pipeline takes raw sequencing data in the form of FASTQ files and performs down-sampling (randomly selecting a fixed, smaller number of reads from the full set of reads) and adapter trimming of FASTQ files. The pipeline then performs transcript-level quantification using the pseudo-alignment method Salmon as well as gene-level quantification using STAR (Spliced Transcripts Alignment to a Reference) and HTSeq.
Pipeline Overview#
flowchart LR
%% Colors %%
classDef panel fill:transparent,stroke:#323232,stroke-dasharray:8
classDef black fill:#12294C,stroke:#12294C,stroke-width:2px,color:#fff
classDef blue fill:#20A4F3,stroke:#20A4F3,stroke-width:2px,color:#fff
classDef green fill:#3BCEAC,stroke:#3BCEAC,stroke-width:2px,color:#fff
classDef yellow fill:#ffd166,stroke:#ffd166,stroke-width:2px,color:#fff
classDef pink fill:#ef476f,stroke:#ef476f,stroke-width:2px,color:#fff
classDef orange fill:#f3722c,stroke:#f3722c,stroke-width:2px,color:#fff
classDef red fill:#BB4430,stroke:#BB4430,stroke-width:2px,color:#fff
classDef ming fill:#387780,stroke:#387780,stroke-width:2px,color:#fff
Start((Start)):::black --fastq--> Trimming[Subsample Reads <br/> <br/> Trim Reads]:::blue
subgraph Preprocess
Trimming
end
subgraph Transcript-Quant
direction TB
Trimming --> Salmon:::green
end
subgraph Gene-Quant
direction TB
Trimming --> Alignment[STAR <br/> <br/> Samtools <br/> <br/> HTSeq]:::green
end
subgraph Evaluate
Alignment --> M_Metrics[Qualimap <br/><br/> Custom Metrics <br/><br/> Coverage Metrics]:::pink
Alignment --> M_CellTyping[Cell Typing]:::pink
Alignment --> M_PCA[PCA]:::pink
end
subgraph Report
Trimming --> Mqc[MultiQC Report]:::orange
Alignment --> Mqc
Salmon --> Mqc
M_Metrics --> Mqc
M_CellTyping --> Mqc
M_PCA --> Mqc
end
Mqc --> End((End)):::black
Preprocess:::panel
Transcript-Quant:::panel
Gene-Quant:::panel
Evaluate:::panel
Report:::panel
The following are the steps and tools that pipeline uses to perform the analyses:
- Subsample the paired-end reads to 200,000 reads using
SEQTK SAMPLE
to compare metrics across samples
Info
Why subsample? For a first pass analysis, or comparison between sequencing runs, subsampling of the data enables a faster computation time and allows for sufficient QC analysis. However, to take full advantage of your sequencing data, subsampling should be disabled for thorough analysis.
- Evaluate sequencing quality using
FASTP
and trim/clip reads - Perform transcript-level quantification using the pseudo-alignment method implemented in
SALMON
- Perform splice-aware alignment using
STAR
- Extract primary aligned reads from STAR-based bam using
SAMTOOLS
- Perform gene-level quantification from STAR alignment using the
HTSEQ
- Evaluate STAR alignment (BAM) quality control using
QUALIMAP
- Evaluate cell typing, custom metrics, and perform PCA using custom tools
- Aggregate the metrics across biosamples and tools to create overall pipeline statistics summary using
MULTIQC
Pipeline Parameters#
Parameter Name | Options | Description |
---|---|---|
Instrument | NovaSeq NextSeq MiSeq MiniSeq (default) ISeq Other |
Instrument used to perform sequencing |
Genome | GRCh38 (default) GRCm39 GRCh37 |
Reference genome to use for alignment |
Read Length | 50 75 (default) 100 150 |
Cycle used for sequencing |
Adapter Sequence for first read | AAGCAGTGGTATCAACGCAGAGTACA (default) |
Adapter Sequence to be trimmed from first read |
Adapter Sequence for second read | AAGCAGTGGTATCAACGCAGAGTACAT (default) |
Adapter Sequence to be trimmed from second read |
Module Parameters#
Module | Parameter Name | Options | Description |
---|---|---|---|
FastQ Reads Subsampling | (default) | Default is set to 100000 reads | |
10x | (default) | Support module for 10x-derived data. |
Output Directories and Contents#
Output Directory and Contents |
Notes |
---|---|
secondary_analyses / alignment_htseq / *{.bam,.bai} *_Chimeric.out.junction |
Biosample level output containing .bam alignment files. |
secondary_analyses / secondary_metrics / pipeline_metrics_summary.csv pipeline_metrics_summary_percents.csv |
The pipeline_metrics_summary.csv file contains metrics that are found in the "QualiMap Stats" section of the MultiQC output. This file contains metrics such as alignment stats, including the number of total aligned reads, exonic and intronic reads, and alignemnts to genes as well as 5'-3' bias. The pipeline_metrics_summary_percents.csv file contains metrics that are found in the "QualiMap percent stats" section of the MultiQC output. This file contains metrics such as percentage of exonic and intronic reads and percentage of total reads aligned. |
secondary_analyses / quantification_htseq / df_gene_counts_starhtseq.tsv df_mt_gene_counts_starhtseq.tsv matrix_gene_counts_starhtseq.txt df_gene_types_detected_summary_starhtseq.tsv |
This section includes output files for gene-level quantification from STAR and HTSeq. The main output files here are df_gene_counts_starhtseq.tsv which includes information such as ENSEMBL gene IDs, gene symbol IDs and HTseq counts for each detected gene as well as df_mt_gene_counts_starhtseq.tsv which includes mitochondrial gene counts (MT_counts), total detected gene counts (Total_counts), and proportion of mitochondrial gene counts to total number of gene counts (PropMT). The matrix_gene_counts_starhtseq.txt file contains the project level count matrix containing all samples and read counts for all genes. Importantly, these gene counts are used to generate plots in the MultiQC on BaseJumper. The df_gene_types_detected_summary.tsv file details the number of various gene biotypes detected in each sample (i.e. protein coding genes, lncRNAs, pseudogenes, miRNAs, etc). |
secondary_analyses / quantification_salmon / df_transcript_counts_salmon.tsv matrix_transcript_raw_salmon.tsv matrix_transcript_tpm_salmon.tsv matrix_transcript_length_tpm_salmon.tsv df_transcript_types_detected_summary_salmon.tsv df_gene_counts_salmon.tsv df_mt_gene_counts_salmon.tsv df_gene_types_detected_summary_salmon.tsv matrix_gene_counts_salmon.tsv matrix_gene_tpm_salmon.tsv matrix_gene_length_tpm_salmon.tsv |
The directory includes the output files for the transcript-level quantification from Salmon. The main output file here is the df_transcript_counts_salmon.tsv . This includes things such as ENSEMBL transcript ids, transcript lengths, TPM (Transcripts per Million) values, and both scaled and un-scaled transcript counts. The matrix_ files contain matrices of transcript counts either unscaled (matrix_transcript_raw_salmon.tsv ), scaled up to library size (matrix_transcript_tpm_salmon.tsv ), or scaled first using the average transcript length and then library size (matrix_transcript_length_tpm.tsv ). The df_transcript_types_detected_summary_salmon.tsv file details the number of various transcript biotypes detected in each sample (i.e. protein coding transcripts, lncRNAs, pseudogenes, miRNAs, etc). This directory also includes similar output files, but for Salmon-based gene-counts generated by collapsing the counts of all detected transcripts from the same gene. |
tertiary_analyses / classification_cell_typing / df_cell_typing_scores_singler_hpca_gtex_tcga.tsv df_cell_typing_summary_singler_hpca_gtex_tcga.tsv |
This section includes the output files for cell-typing analysis. The main output file here is the df_cell_typing_summary_singler_hpca_gtex_tcga.tsv This includes metrics for each sample such as the sample's assigned Cell Phase, Progenitor Type, Tissue Type, TGCA Tissue Type, and TGCA Tumor Type. The scores for each sample against each possible phase, progenitor, tissue, etc is found in the df_cell_typing_scores_singler_hpca_gtex_tcga.tsv file. |
multiqc / multiqc_report.html multiqc_version.yml multiqc_report_data / multiqc_report_plots / |
This directory includes output files containing metrics from various tools to create a multiqc report. The main file here is multiqc_report.html , which contains the overall QC metrics and certain plots such as PCA and Heatmap. This also appears as the MultiQC tab in BaseJumper. The multiqc_version.yml file is a configuration file containing the version of MultiQC that was run and can be opened/viewed using NotePad. The subdirectory multiqc_report_plots / contains PDF, PNG, and SVG versions of all QC plots featured in MultiQC while the multiqc_report_data / subdirectory contains the data tables used to generate those plots. |
execution_info /input_dataset.csv execution_report.html execution_timeline.html eb_event.json execution_trace.txt input.csv output.json params_file.json recalculate_size_eb_event.json tool_versions.yml |
This directory includes execution information from the pipeline run. |
Frequently Asked Questions#
- How do I download a table of gene or transcript counts?
Step 1: Export your project using BaseJumper. Not sure how? Visit our Data Export manual page: https://docs.basejumper.bioskryb.com/getting-started/data-export/data-export/.
Step 2: In theexport_data
/ folder, navigate to the directorysecondary_analyses
/ thenquantification_htseq
/ for STAR/HTSeq-based gene counts orquantification_salmon
/ for Salmon-based transcript or gene counts. - Are the gene or transcript count tables normalized?
STAR/HTSeq-based gene counts are not normalized. Salmon-based transcript and gene counts are available both as raw counts and TPM and length-TPM scaled counts (calculated using tximports).