BJ-Somatic-Variantcalling
Background#
BJ-Somatic-Variantcalling is a scalable and reproducible bioinformatics pipeline for processing single-cell sequencing data from BioSkryb's Whole Genome Amplification chemistries. Pipeline directly works with single-cells without the need for bulk control samples. It takes raw sequencing data in the form of FASTQ (Illumina/Element) or CRAM (Ultima), performs alignment and removes duplicate reads (Illumina/Element). Variant calling is performed using Google DeepVariant with a custom model trained with BioSkryb's single-cell data. Then it is followed by a somatic-heuristic-filter step to make somatic calls.
Pipeline Overview#
flowchart LR
%% Colors %%
classDef panel fill:transparent,stroke:#323232,stroke-dasharray:8
classDef panelt fill:transparent,stroke-opacity: 0
classDef black fill:#12294C,stroke:#12294C,stroke-width:2px,color:#fff
classDef blue fill:#20A4F3,stroke:#20A4F3,stroke-width:2px,color:#fff
classDef green fill:#3BCEAC,stroke:#3BCEAC,stroke-width:2px,color:#fff
classDef yellow fill:#ffd166,stroke:#ffd166,stroke-width:2px,color:#fff
classDef pink fill:#ef476f,stroke:#ef476f,stroke-width:2px,color:#fff
classDef orange fill:#f3722c,stroke:#f3722c,stroke-width:2px,color:#fff
classDef red fill:#BB4430,stroke:#BB4430,stroke-width:2px,color:#fff
classDef ming fill:#387780,stroke:#387780,stroke-width:2px,color:#fff
    Start((Start)):::black --fastq--> Alignment[Parabricks Align Reads <br/> <br/> Remove Duplicates]
    subgraph Map
        Alignment:::green
    end
    subgraph Variant[Variant Calling]
        Alignment --> Deepvariant[Deepvariant]:::yellow
    end
    subgraph Evaluate
        Alignment --> Metrics[Alignment Metrics <br/><br/> GC Metrics <br/><br/> Insert Size Metrics <br/><br/> WGS Metrics <br/><br/> HS Metrics - Exome mode ]:::pink
    end
    subgraph SHF[Somatic Heuristic Filter]
        Deepvariant --> shf[Somatic Heuristic Filter]:::pink
    end
    subgraph Report
        shf --> Mqc[MultiQC Report]:::orange
        Metrics --> Mqc
    end
    Mqc --> End((End)):::black
    Map:::panel
    Variant:::panel
    Evaluate:::panel
    SHF:::panel
    Report:::panel
Following are the steps and tools that pipeline uses to perform the analyses:
- 
Map reads to reference genome and remove duplicate reads using
PARABRICKS FQ2BAM - 
Perform variant calling with
GOOGLE DEEPVARIANT - 
Make somatic calls using
SOMATIC HEURISTIC FILTER. See below for details. - 
Evaluate metrics using
PARABRICKS COLLECTMETRICSwhich includes Alignment, GC Bias, Insert Size, and Coverage metrics - 
Aggregate the metrics across biosamples and tools to create overall pipeline statistics summary using
MULTIQC 
Pipeline Parameters#
| Parameter Name | Options | Description | 
|---|---|---|
| Genome | GRCh38 (default) | 
Reference genome to use for alignment | 
Module Parameters#
| Module | Parameter Name | Options | Description | 
|---|---|---|---|
| Subsampling | (default) | Enables downsampling of input reads to a specified read count using SEQTK. | |
| Mutation Signature Profile | (default) | Performs mutational signature profiling. | 
Output files#
Output Directory/File  | 
Notes | 
|---|---|
multiqc/ | 
This section includes output files containing metrics from various tools to create a MultiQC report.  MultiQC Report Example The all_metrics_mqc.txt contains metrics from the All Metrics section of the MultiQC report found in BaseJumper.  | 
PARABRICKS_PRIMARY_WORKFLOW_PARABRICKS_FQ2BAM/GOOGLE_DEEPVARIANT_WF_DEEPVARIANT_POSTPROCESS/SOMATIC_VARIANT_WORKFLOW_Heuristic_Filter_SUBSET_VCF_VARIANTS/ | 
PARABRICKS_PRIMARY_WORKFLOW_PARABRICKS_FQ2BAM/Biosample level output containing aligned reads and index file. GOOGLE_DEEPVARIANT_WF_DEEPVARIANT_POSTPROCESS/Biosample level output containing the variant calls in VCF format and index file. SOMATIC_VARIANT_WORKFLOW_Heuristic_Filter_SUBSET_VCF_VARIANTS/Biosample level output containing the somatic filtered variant calls in VCF format and index file.  | 
execution_info/ | 
This section includes detail execution information regarding all the tasks in pipeline run. | 
Somatic Heuristic Filter#
The Somatic Heuristic Filter is designed to efficiently and accurately filter somatic variants in single-cell sequencing data. There is individual cell-level filters based on alignment score, proportion of clipped reads, and positional statistics. Then there is group-level filters that requires a minimum proportion of cells with sufficicent coverage and a minimum number of high-quality supporting reads. These filters are organized into two-step process:
Step 1: Binomial and Beta-binomial Filtering
- Aggregates counts across all single cells.
 - Uses binomial and beta-binomial statistical tests (parallelized by chromosome).
 - Applies depth filters based on mean coverage.
 - Output: Table of variants passing depth and statistical filters.
 
Step 2: Extensive QC Filtering
- Applies quality control to positions with variation in at least one cell.
 - Extracts detailed statistics from BAM pileups for each variant position, including:
- Read support by strand and quality.
 - Median alignment score.
 - Proportion of clipped reads.
 - Positional statistics (standard deviation, median absolute deviation).
 
 - This step is computationally intensive due to the large number of positions.
 
Final Output
- Variants passing all filters are used to construct matrices for phylogenetic inference.
 - A second round of Sequoia filtering is applied for final variant selection.
 
Key Parameters#
Step 1: Binomial/Beta-binomial Filtering#
| Parameter | Value | Description | 
|---|---|---|
aggregated_min_mean_depth | 
10 | Minimum average depth across all cells | 
aggregated_max_mean_depth | 
500 | Maximum average depth across all cells | 
gender | 
female | Affects X/Y chromosome treatment in depth filters | 
Step 2: Concatenation and Filtering#
| Parameter | Value | Description | 
|---|---|---|
first_pass_binomial_cutoff | 
-10 | Binomial log10(q-value) cutoff | 
first_pass_betabinomial_cutoff | 
0.4 | Beta-binomial rho cutoff | 
Step 3: Pileup-based QC#
| Parameter | Value | Description | 
|---|---|---|
cutoff_mq_hq | 
30 | Minimum mapping quality for HQ read | 
cutoff_bq_hq | 
30 | Minimum base quality for HQ read | 
cutoff_bps_start | 
15 | Max bp start position to assess read start bias | 
num_lines_read_pileup | 
250000 | Number of lines read per chunk for parallel processing | 
Extracted features include strand-specific support, alignment score, clipping, read start positions, and statistics like SD/MAD.
Step 4: Per-cell & Group-level Variant Filtering#
Per-cell Filtering Parameters
| Parameter | Value | Description | 
|---|---|---|
cutoff_as | 
140 | Min median alignment score of reads supporting variant | 
cutoff_prop_clipped_reads | 
0.5 | Max proportion of clipped reads allowed | 
cutoff_prop_bp_under | 
0.9 | Max proportion of reads starting too early in alignment | 
cutoff_sd_indiv | 
4 | Min SD of variant position in strand-biased reads | 
cutoff_mad_indiv | 
0 | Min MAD of variant position in strand-biased reads | 
cutoff_sd_both | 
2 | Min SD for both-strand support | 
cutoff_mad_both | 
2 | Min MAD for both-strand support | 
cutoff_sd_extreme | 
10 | Extreme SD threshold for either strand | 
cutoff_mad_extreme | 
1 | Extreme MAD threshold for either strand | 
Group-level Filtering Parameters
| Parameter | Value | Description | 
|---|---|---|
cutoff_goodcov_depth | 
7 | Minimum HQ depth per cell for variant position | 
cutoff_prop_cells_goodcov_group | 
0.7 | Minimum proportion of cells with good coverage at position | 
cutoff_numreads_variant_manual | 
4 | Minimum number of HQ reads supporting variant per position | 
Step 5: Phylogenetic Inference#
SEQUOIA - Reapplies filters and builds phylogenetic tree from Mat_NV and Mat_NR.
| Parameter | Value | Description | 
|---|---|---|
second_pass_binomial_cutoff | 
-10 | Second-pass binomial log10(q-value) cutoff | 
second_pass_betabinomial_cutoff_rho_snp | 
0.4 | Beta-binomial cutoff for SNPs | 
second_pass_betabinomial_cutoff_rho_indel | 
0.4 | Beta-binomial cutoff for INDELs | 
aggregated_hq_min_mean_depth | 
10 | HQ min mean depth | 
aggregated_hq_max_mean_depth | 
500 | HQ max mean depth |