Skip to content

BJ-Somatic-Variantcalling

Background#

BJ-Somatic-Variantcalling is a scalable and reproducible bioinformatics pipeline for processing single-cell sequencing data from BioSkryb's Whole Genome Amplification chemistries. Pipeline directly works with single-cells without the need for bulk control samples. It takes raw sequencing data in the form of FASTQ (Illumina/Element) or CRAM (Ultima), performs alignment and removes duplicate reads (Illumina/Element). Variant calling is performed using Google DeepVariant with a custom model trained with BioSkryb's single-cell data. Then it is followed by a somatic-heuristic-filter step to make somatic calls.

Pipeline Overview#

flowchart LR
%% Colors %%
classDef panel fill:transparent,stroke:#323232,stroke-dasharray:8
classDef panelt fill:transparent,stroke-opacity: 0
classDef black fill:#12294C,stroke:#12294C,stroke-width:2px,color:#fff
classDef blue fill:#20A4F3,stroke:#20A4F3,stroke-width:2px,color:#fff
classDef green fill:#3BCEAC,stroke:#3BCEAC,stroke-width:2px,color:#fff
classDef yellow fill:#ffd166,stroke:#ffd166,stroke-width:2px,color:#fff
classDef pink fill:#ef476f,stroke:#ef476f,stroke-width:2px,color:#fff
classDef orange fill:#f3722c,stroke:#f3722c,stroke-width:2px,color:#fff
classDef red fill:#BB4430,stroke:#BB4430,stroke-width:2px,color:#fff
classDef ming fill:#387780,stroke:#387780,stroke-width:2px,color:#fff
    Start((Start)):::black --fastq--> Alignment[Parabricks Align Reads <br/> <br/> Remove Duplicates]
    subgraph Map
        Alignment:::green
    end
    subgraph Variant[Variant Calling]
        Alignment --> Deepvariant[Deepvariant]:::yellow
    end
    subgraph Evaluate
        Alignment --> Metrics[Alignment Metrics <br/><br/> GC Metrics <br/><br/> Insert Size Metrics <br/><br/> WGS Metrics <br/><br/> HS Metrics - Exome mode ]:::pink
    end
    subgraph SHF[Somatic Heuristic Filter]
        Deepvariant --> shf[Somatic Heuristic Filter]:::pink

    end
    subgraph Report
        shf --> Mqc[MultiQC Report]:::orange
        Metrics --> Mqc

    end
    Mqc --> End((End)):::black
    Map:::panel
    Variant:::panel
    Evaluate:::panel
    SHF:::panel
    Report:::panel

Following are the steps and tools that pipeline uses to perform the analyses:

  • Map reads to reference genome and remove duplicate reads using PARABRICKS FQ2BAM

  • Perform variant calling with GOOGLE DEEPVARIANT

  • Make somatic calls using SOMATIC HEURISTIC FILTER. See below for details.

  • Evaluate metrics using PARABRICKS COLLECTMETRICS which includes Alignment, GC Bias, Insert Size, and Coverage metrics

  • Aggregate the metrics across biosamples and tools to create overall pipeline statistics summary using MULTIQC

Pipeline Parameters#

Parameter Name Options Description
Genome GRCh38 (default) Reference genome to use for alignment

Module Parameters#

Module Parameter Name Options Description
Subsampling (default) Enables downsampling of input reads to a specified read count using SEQTK.
Mutation Signature Profile (default) Performs mutational signature profiling.

Output files#

Output Directory/File
Notes
multiqc/ This section includes output files containing metrics from various tools to create a MultiQC report.

MultiQC Report Example

The all_metrics_mqc.txt contains metrics from the All Metrics section of the MultiQC report found in BaseJumper.

PARABRICKS_PRIMARY_WORKFLOW_PARABRICKS_FQ2BAM/
GOOGLE_DEEPVARIANT_WF_DEEPVARIANT_POSTPROCESS/
SOMATIC_VARIANT_WORKFLOW_Heuristic_Filter_SUBSET_VCF_VARIANTS/
PARABRICKS_PRIMARY_WORKFLOW_PARABRICKS_FQ2BAM/
Biosample level output containing aligned reads and index file.

GOOGLE_DEEPVARIANT_WF_DEEPVARIANT_POSTPROCESS/
Biosample level output containing the variant calls in VCF format and index file.

SOMATIC_VARIANT_WORKFLOW_Heuristic_Filter_SUBSET_VCF_VARIANTS/
Biosample level output containing the somatic filtered variant calls in VCF format and index file.
execution_info/ This section includes detail execution information regarding all the tasks in pipeline run.

Somatic Heuristic Filter#

The Somatic Heuristic Filter is designed to efficiently and accurately filter somatic variants in single-cell sequencing data. There is individual cell-level filters based on alignment score, proportion of clipped reads, and positional statistics. Then there is group-level filters that requires a minimum proportion of cells with sufficicent coverage and a minimum number of high-quality supporting reads. These filters are organized into two-step process:

Step 1: Binomial and Beta-binomial Filtering

  • Aggregates counts across all single cells.
  • Uses binomial and beta-binomial statistical tests (parallelized by chromosome).
  • Applies depth filters based on mean coverage.
  • Output: Table of variants passing depth and statistical filters.

Step 2: Extensive QC Filtering

  • Applies quality control to positions with variation in at least one cell.
  • Extracts detailed statistics from BAM pileups for each variant position, including:
    • Read support by strand and quality.
    • Median alignment score.
    • Proportion of clipped reads.
    • Positional statistics (standard deviation, median absolute deviation).
  • This step is computationally intensive due to the large number of positions.

Final Output

  • Variants passing all filters are used to construct matrices for phylogenetic inference.
  • A second round of Sequoia filtering is applied for final variant selection.

Key Parameters#

Step 1: Binomial/Beta-binomial Filtering#

Parameter Value Description
aggregated_min_mean_depth 10 Minimum average depth across all cells
aggregated_max_mean_depth 500 Maximum average depth across all cells
gender female Affects X/Y chromosome treatment in depth filters

Step 2: Concatenation and Filtering#

Parameter Value Description
first_pass_binomial_cutoff -10 Binomial log10(q-value) cutoff
first_pass_betabinomial_cutoff 0.4 Beta-binomial rho cutoff

Step 3: Pileup-based QC#

Parameter Value Description
cutoff_mq_hq 30 Minimum mapping quality for HQ read
cutoff_bq_hq 30 Minimum base quality for HQ read
cutoff_bps_start 15 Max bp start position to assess read start bias
num_lines_read_pileup 250000 Number of lines read per chunk for parallel processing

Extracted features include strand-specific support, alignment score, clipping, read start positions, and statistics like SD/MAD.


Step 4: Per-cell & Group-level Variant Filtering#

Per-cell Filtering Parameters

Parameter Value Description
cutoff_as 140 Min median alignment score of reads supporting variant
cutoff_prop_clipped_reads 0.5 Max proportion of clipped reads allowed
cutoff_prop_bp_under 0.9 Max proportion of reads starting too early in alignment
cutoff_sd_indiv 4 Min SD of variant position in strand-biased reads
cutoff_mad_indiv 0 Min MAD of variant position in strand-biased reads
cutoff_sd_both 2 Min SD for both-strand support
cutoff_mad_both 2 Min MAD for both-strand support
cutoff_sd_extreme 10 Extreme SD threshold for either strand
cutoff_mad_extreme 1 Extreme MAD threshold for either strand

Group-level Filtering Parameters

Parameter Value Description
cutoff_goodcov_depth 7 Minimum HQ depth per cell for variant position
cutoff_prop_cells_goodcov_group 0.7 Minimum proportion of cells with good coverage at position
cutoff_numreads_variant_manual 4 Minimum number of HQ reads supporting variant per position

Step 5: Phylogenetic Inference#

SEQUOIA - Reapplies filters and builds phylogenetic tree from Mat_NV and Mat_NR.

Parameter Value Description
second_pass_binomial_cutoff -10 Second-pass binomial log10(q-value) cutoff
second_pass_betabinomial_cutoff_rho_snp 0.4 Beta-binomial cutoff for SNPs
second_pass_betabinomial_cutoff_rho_indel 0.4 Beta-binomial cutoff for INDELs
aggregated_hq_min_mean_depth 10 HQ min mean depth
aggregated_hq_max_mean_depth 500 HQ max mean depth