BJ-Somatic-Variantcalling

Background#

BJ-Somatic-Variantcalling is a scalable and reproducible bioinformatics pipeline for processing single-cell sequencing data from BioSkryb's Whole Genome Amplification chemistries. Pipeline directly works with single-cells without the need for bulk control samples. It takes raw sequencing data in the form of FASTQ (Illumina/Element) or CRAM (Ultima), performs alignment and removes duplicate reads (Illumina/Element). Variant calling is performed using Google DeepVariant with a custom model trained with BioSkryb's single-cell data. Then it is followed by a somatic-heuristic-filter step to make somatic calls.

Pipeline Overview#

flowchart LR
%% Colors %%
classDef panel fill:transparent,stroke:#323232,stroke-dasharray:8
classDef panelt fill:transparent,stroke-opacity: 0
classDef black fill:#12294C,stroke:#12294C,stroke-width:2px,color:#fff
classDef blue fill:#20A4F3,stroke:#20A4F3,stroke-width:2px,color:#fff
classDef green fill:#3BCEAC,stroke:#3BCEAC,stroke-width:2px,color:#fff
classDef yellow fill:#ffd166,stroke:#ffd166,stroke-width:2px,color:#fff
classDef pink fill:#ef476f,stroke:#ef476f,stroke-width:2px,color:#fff
classDef orange fill:#f3722c,stroke:#f3722c,stroke-width:2px,color:#fff
classDef red fill:#BB4430,stroke:#BB4430,stroke-width:2px,color:#fff
classDef ming fill:#387780,stroke:#387780,stroke-width:2px,color:#fff
    Start((Start)):::black --fastq--> Alignment[Parabricks Align Reads <br/> <br/> Remove Duplicates]
    subgraph Map
        Alignment:::green
    end
    subgraph Variant[Variant Calling]
        Alignment --> Deepvariant[Deepvariant]:::yellow
    end
    subgraph Evaluate
        Alignment --> Metrics[Alignment Metrics <br/><br/> GC Metrics <br/><br/> Insert Size Metrics <br/><br/> WGS Metrics <br/><br/> HS Metrics - Exome mode ]:::pink
    end
    subgraph SHF[Somatic Heuristic Filter]
        Deepvariant --> shf[Somatic Heuristic Filter]:::pink

    end
    subgraph Report
        shf --> Mqc[MultiQC Report]:::orange
        Metrics --> Mqc

    end
    Mqc --> End((End)):::black
    Map:::panel
    Variant:::panel
    Evaluate:::panel
    SHF:::panel
    Report:::panel

Following are the steps and tools that pipeline uses to perform the analyses:

Map reads to reference genome and remove duplicate reads using PARABRICKS FQ2BAM
Perform variant calling with GOOGLE DEEPVARIANT
Make somatic calls using SOMATIC HEURISTIC FILTER. See below for details.
Evaluate metrics using PARABRICKS COLLECTMETRICS which includes Alignment, GC Bias, Insert Size, and Coverage metrics
Aggregate the metrics across biosamples and tools to create overall pipeline statistics summary using MULTIQC

Pipeline Parameters#

Parameter Name	Options	Description
Genome	`GRCh38` _^(default)	Reference genome to use for alignment

Module Parameters#

Module	Parameter Name	Options	Description
Subsampling		_^(default)	Enables downsampling of input reads to a specified read count using SEQTK.
Mutation Signature Profile		_^(default)	Performs mutational signature profiling.

Output files#

Output Directory/File	Notes
`multiqc`/	This section includes output files containing metrics from various tools to create a MultiQC report. MultiQC Report Example The `all_metrics_mqc.txt` contains metrics from the `All Metrics` section of the MultiQC report found in BaseJumper.
`PARABRICKS_PRIMARY_WORKFLOW_PARABRICKS_FQ2BAM/` `GOOGLE_DEEPVARIANT_WF_DEEPVARIANT_POSTPROCESS/` `SOMATIC_VARIANT_WORKFLOW_Heuristic_Filter_SUBSET_VCF_VARIANTS/`	`PARABRICKS_PRIMARY_WORKFLOW_PARABRICKS_FQ2BAM/` Biosample level output containing aligned reads and index file. `GOOGLE_DEEPVARIANT_WF_DEEPVARIANT_POSTPROCESS/` Biosample level output containing the variant calls in VCF format and index file. `SOMATIC_VARIANT_WORKFLOW_Heuristic_Filter_SUBSET_VCF_VARIANTS/` Biosample level output containing the somatic filtered variant calls in VCF format and index file.
`execution_info`/	This section includes detail execution information regarding all the tasks in pipeline run.

Somatic Heuristic Filter#

The Somatic Heuristic Filter is designed to efficiently and accurately filter somatic variants in single-cell sequencing data. There is individual cell-level filters based on alignment score, proportion of clipped reads, and positional statistics. Then there is group-level filters that requires a minimum proportion of cells with sufficicent coverage and a minimum number of high-quality supporting reads. These filters are organized into two-step process:

Step 1: Binomial and Beta-binomial Filtering

Aggregates counts across all single cells.
Uses binomial and beta-binomial statistical tests (parallelized by chromosome).
Applies depth filters based on mean coverage.
Output: Table of variants passing depth and statistical filters.

Step 2: Extensive QC Filtering

Applies quality control to positions with variation in at least one cell.
Extracts detailed statistics from BAM pileups for each variant position, including:
- Read support by strand and quality.
- Median alignment score.
- Proportion of clipped reads.
- Positional statistics (standard deviation, median absolute deviation).
This step is computationally intensive due to the large number of positions.

Final Output

Variants passing all filters are used to construct matrices for phylogenetic inference.
A second round of Sequoia filtering is applied for final variant selection.

Key Parameters#

Step 1: Binomial/Beta-binomial Filtering#

Parameter	Value	Description
`aggregated_min_mean_depth`	10	Minimum average depth across all cells
`aggregated_max_mean_depth`	500	Maximum average depth across all cells
`gender`	female	Affects X/Y chromosome treatment in depth filters

Step 2: Concatenation and Filtering#

Parameter	Value	Description
`first_pass_binomial_cutoff`	-10	Binomial log10(q-value) cutoff
`first_pass_betabinomial_cutoff`	0.4	Beta-binomial rho cutoff

Step 3: Pileup-based QC#

Parameter	Value	Description
`cutoff_mq_hq`	30	Minimum mapping quality for HQ read
`cutoff_bq_hq`	30	Minimum base quality for HQ read
`cutoff_bps_start`	15	Max bp start position to assess read start bias
`num_lines_read_pileup`	250000	Number of lines read per chunk for parallel processing

Extracted features include strand-specific support, alignment score, clipping, read start positions, and statistics like SD/MAD.

Step 4: Per-cell & Group-level Variant Filtering#

Per-cell Filtering Parameters

Parameter	Value	Description
`cutoff_as`	140	Min median alignment score of reads supporting variant
`cutoff_prop_clipped_reads`	0.5	Max proportion of clipped reads allowed
`cutoff_prop_bp_under`	0.9	Max proportion of reads starting too early in alignment
`cutoff_sd_indiv`	4	Min SD of variant position in strand-biased reads
`cutoff_mad_indiv`	0	Min MAD of variant position in strand-biased reads
`cutoff_sd_both`	2	Min SD for both-strand support
`cutoff_mad_both`	2	Min MAD for both-strand support
`cutoff_sd_extreme`	10	Extreme SD threshold for either strand
`cutoff_mad_extreme`	1	Extreme MAD threshold for either strand

Group-level Filtering Parameters

Parameter	Value	Description
`cutoff_goodcov_depth`	7	Minimum HQ depth per cell for variant position
`cutoff_prop_cells_goodcov_group`	0.7	Minimum proportion of cells with good coverage at position
`cutoff_numreads_variant_manual`	4	Minimum number of HQ reads supporting variant per position

Step 5: Phylogenetic Inference#

SEQUOIA - Reapplies filters and builds phylogenetic tree from Mat_NV and Mat_NR.

Parameter	Value	Description
`second_pass_binomial_cutoff`	-10	Second-pass binomial log10(q-value) cutoff
`second_pass_betabinomial_cutoff_rho_snp`	0.4	Beta-binomial cutoff for SNPs
`second_pass_betabinomial_cutoff_rho_indel`	0.4	Beta-binomial cutoff for INDELs
`aggregated_hq_min_mean_depth`	10	HQ min mean depth
`aggregated_hq_max_mean_depth`	500	HQ max mean depth