Skip to content

BJ-Expression

Background#

The BJ-Expression pipeline is a scalable and reproducible bioinformatics pipeline to process RNAseq data and assess transcript-level and gene-level quantification. The pipeline supports both single-end and paired-end data. The pipeline takes raw sequencing data in the form of FASTQ files and performs down-sampling (randomly selecting a fixed, smaller number of reads from the full set of reads) and adapter trimming of FASTQ files. The pipeline then performs transcript-level quantification using the pseudo-alignment method Salmon as well as gene-level quantification using STAR (Spliced Transcripts Alignment to a Reference) and HTSeq.

Pipeline Overview#

flowchart LR
%% Colors %%
classDef panel fill:transparent,stroke:#323232,stroke-dasharray:8
classDef black fill:#12294C,stroke:#12294C,stroke-width:2px,color:#fff
classDef blue fill:#20A4F3,stroke:#20A4F3,stroke-width:2px,color:#fff
classDef green fill:#3BCEAC,stroke:#3BCEAC,stroke-width:2px,color:#fff
classDef yellow fill:#ffd166,stroke:#ffd166,stroke-width:2px,color:#fff
classDef pink fill:#ef476f,stroke:#ef476f,stroke-width:2px,color:#fff
classDef orange fill:#f3722c,stroke:#f3722c,stroke-width:2px,color:#fff
classDef red fill:#BB4430,stroke:#BB4430,stroke-width:2px,color:#fff
classDef ming fill:#387780,stroke:#387780,stroke-width:2px,color:#fff
    Start((Start)):::black --fastq--> Trimming[Subsample Reads <br/> <br/> Trim Reads]:::blue
    subgraph Preprocess
        Trimming
    end
    subgraph Transcript-Quant
        direction TB
        Trimming --> Salmon:::green
    end
    subgraph Gene-Quant
        direction TB
        Trimming --> Alignment[STAR <br/> <br/> Samtools <br/> <br/> HTSeq]:::green
    end
    subgraph Evaluate
        Alignment --> M_Metrics[Qualimap <br/><br/> Custom Metrics <br/><br/> Coverage Metrics]:::pink
        Alignment --> M_CellTyping[Cell Typing]:::pink
        Alignment --> M_PCA[PCA]:::pink
    end
    subgraph Report
        Trimming --> Mqc[MultiQC Report]:::orange
        Alignment --> Mqc
        Salmon --> Mqc
        M_Metrics --> Mqc
        M_CellTyping --> Mqc
        M_PCA --> Mqc
    end
    Mqc --> End((End)):::black
    Preprocess:::panel
    Transcript-Quant:::panel
    Gene-Quant:::panel
    Evaluate:::panel
    Report:::panel

The following are the steps and tools that pipeline uses to perform the analyses:

  • Subsample the paired-end reads to 200,000 reads using SEQTK SAMPLE to compare metrics across samples

Info

Why subsample? For a first pass analysis, or comparison between sequencing runs, subsampling of the data enables a faster computation time and allows for sufficient QC analysis. However, to take full advantage of your sequencing data, subsampling should be disabled for thorough analysis.

  • Evaluate sequencing quality using FASTP and trim/clip reads
  • Perform transcript-level quantification using the pseudo-alignment method implemented in SALMON
  • Perform splice-aware alignment using STAR
  • Extract primary aligned reads from STAR-based bam using SAMTOOLS
  • Perform gene-level quantification from STAR alignment using the HTSEQ
  • Evaluate STAR alignment (BAM) quality control using QUALIMAP
  • Evaluate cell typing, custom metrics, and perform PCA using custom tools
  • Aggregate the metrics across biosamples and tools to create overall pipeline statistics summary using MULTIQC

Pipeline Parameters#

Parameter Name Options Description
Instrument NovaSeq
NextSeq
MiSeq
MiniSeq (default)
ISeq
Other
Instrument used to perform sequencing
Genome GRCh38 (default)
GRCm39
GRCh37
Reference genome to use for alignment
Read Length 50
75 (default)
100
150
Cycle used for sequencing
Adapter Sequence for first read AAGCAGTGGTATCAACGCAGAGTACA (default) Adapter Sequence to be trimmed from first read
Adapter Sequence for second read AAGCAGTGGTATCAACGCAGAGTACAT (default) Adapter Sequence to be trimmed from second read

Module Parameters#

Module Parameter Name Options Description
FastQ Reads Subsampling (default) Default is set to 100000 reads
10x (default) Support module for 10x-derived data.

Output Directories and Contents#

Output Directory and Contents
Notes
secondary_analyses/
       alignment_htseq/
           *{.bam,.bai}
           *_Chimeric.out.junction
Biosample level output containing .bam alignment files.
secondary_analyses/
      secondary_metrics/
          pipeline_metrics_summary.csv
          pipeline_metrics_summary_percents.csv
The pipeline_metrics_summary.csv file contains metrics that are found in the "QualiMap Stats" section of the MultiQC output. This file contains metrics such as alignment stats, including the number of total aligned reads, exonic and intronic reads, and alignemnts to genes as well as 5'-3' bias.
The pipeline_metrics_summary_percents.csv file contains metrics that are found in the "QualiMap percent stats" section of the MultiQC output. This file contains metrics such as percentage of exonic and intronic reads and percentage of total reads aligned.
secondary_analyses/
     quantification_htseq/
               df_gene_counts_starhtseq.tsv
               df_mt_gene_counts_starhtseq.tsv
               matrix_gene_counts_starhtseq.txt
     df_gene_types_detected_summary_starhtseq.tsv
This section includes output files for gene-level quantification from STAR and HTSeq. The main output files here are df_gene_counts_starhtseq.tsv which includes information such as ENSEMBL gene IDs, gene symbol IDs and HTseq counts for each detected gene as well as df_mt_gene_counts_starhtseq.tsv which includes mitochondrial gene counts (MT_counts), total detected gene counts (Total_counts), and proportion of mitochondrial gene counts to total number of gene counts (PropMT). The matrix_gene_counts_starhtseq.txt file contains the project level count matrix containing all samples and read counts for all genes. Importantly, these gene counts are used to generate plots in the MultiQC on BaseJumper. The df_gene_types_detected_summary.tsv file details the number of various gene biotypes detected in each sample (i.e. protein coding genes, lncRNAs, pseudogenes, miRNAs, etc).
secondary_analyses/
     quantification_salmon/
               df_transcript_counts_salmon.tsv
               matrix_transcript_raw_salmon.tsv
               matrix_transcript_tpm_salmon.tsv
           matrix_transcript_length_tpm_salmon.tsv
df_transcript_types_detected_summary_salmon.tsv
               df_gene_counts_salmon.tsv
               df_mt_gene_counts_salmon.tsv
        df_gene_types_detected_summary_salmon.tsv
               matrix_gene_counts_salmon.tsv
               matrix_gene_tpm_salmon.tsv
               matrix_gene_length_tpm_salmon.tsv
The directory includes the output files for the transcript-level quantification from Salmon. The main output file here is the df_transcript_counts_salmon.tsv. This includes things such as ENSEMBL transcript ids, transcript lengths, TPM (Transcripts per Million) values, and both scaled and un-scaled transcript counts. The matrix_ files contain matrices of transcript counts either unscaled (matrix_transcript_raw_salmon.tsv), scaled up to library size (matrix_transcript_tpm_salmon.tsv), or scaled first using the average transcript length and then library size (matrix_transcript_length_tpm.tsv). The df_transcript_types_detected_summary_salmon.tsv file details the number of various transcript biotypes detected in each sample (i.e. protein coding transcripts, lncRNAs, pseudogenes, miRNAs, etc).

This directory also includes similar output files, but for Salmon-based gene-counts generated by collapsing the counts of all detected transcripts from the same gene.
tertiary_analyses/
      classification_cell_typing/        
df_cell_typing_scores_singler_hpca_gtex_tcga.tsv
df_cell_typing_summary_singler_hpca_gtex_tcga.tsv
This section includes the output files for cell-typing analysis. The main output file here is the df_cell_typing_summary_singler_hpca_gtex_tcga.tsv This includes metrics for each sample such as the sample's assigned Cell Phase, Progenitor Type, Tissue Type, TGCA Tissue Type, and TGCA Tumor Type. The scores for each sample against each possible phase, progenitor, tissue, etc is found in the df_cell_typing_scores_singler_hpca_gtex_tcga.tsv file.
multiqc/
      multiqc_report.html
      multiqc_version.yml
      multiqc_report_data/
      multiqc_report_plots/
This directory includes output files containing metrics from various tools to create a multiqc report. The main file here is multiqc_report.html, which contains the overall QC metrics and certain plots such as PCA and Heatmap. This also appears as the MultiQC tab in BaseJumper. The multiqc_version.yml file is a configuration file containing the version of MultiQC that was run and can be opened/viewed using NotePad. The subdirectory multiqc_report_plots/ contains PDF, PNG, and SVG versions of all QC plots featured in MultiQC while the multiqc_report_data/ subdirectory contains the data tables used to generate those plots.
execution_info/
      input_dataset.csv
      execution_report.html
      execution_timeline.html
      eb_event.json
      execution_trace.txt
      input.csv
      output.json
      params_file.json
      recalculate_size_eb_event.json
      tool_versions.yml
This directory includes execution information from the pipeline run.


Frequently Asked Questions#

  • How do I download a table of gene or transcript counts?
      Step 1: Export your project using BaseJumper. Not sure how? Visit our Data Export manual page: https://docs.basejumper.bioskryb.com/getting-started/data-export/data-export/.
      Step 2: In the export_data/ folder, navigate to the directory secondary_analyses/ then quantification_htseq/ for STAR/HTSeq-based gene counts or quantification_salmon/ for Salmon-based transcript or gene counts.
  • Are the gene or transcript count tables normalized?
      STAR/HTSeq-based gene counts are not normalized. Salmon-based transcript and gene counts are available both as raw counts and TPM and length-TPM scaled counts (calculated using tximports).