BJ-Somatic-Variantcalling
Background#
BJ-Somatic-Variantcalling is a scalable and reproducible bioinformatics pipeline for processing single-cell sequencing data from BioSkryb's Whole Genome Amplification chemistries. Pipeline directly works with single-cells without the need for bulk control samples. It takes raw sequencing data in the form of FASTQ (Illumina/Element) or CRAM (Ultima), performs alignment and removes duplicate reads (Illumina/Element). Variant calling is performed using Google DeepVariant with a custom model trained with BioSkryb's single-cell data. Then it is followed by a somatic-heuristic-filter step to make somatic calls.
Pipeline Overview#
flowchart LR
%% Colors %%
classDef panel fill:transparent,stroke:#323232,stroke-dasharray:8
classDef panelt fill:transparent,stroke-opacity: 0
classDef black fill:#12294C,stroke:#12294C,stroke-width:2px,color:#fff
classDef blue fill:#20A4F3,stroke:#20A4F3,stroke-width:2px,color:#fff
classDef green fill:#3BCEAC,stroke:#3BCEAC,stroke-width:2px,color:#fff
classDef yellow fill:#ffd166,stroke:#ffd166,stroke-width:2px,color:#fff
classDef pink fill:#ef476f,stroke:#ef476f,stroke-width:2px,color:#fff
classDef orange fill:#f3722c,stroke:#f3722c,stroke-width:2px,color:#fff
classDef red fill:#BB4430,stroke:#BB4430,stroke-width:2px,color:#fff
classDef ming fill:#387780,stroke:#387780,stroke-width:2px,color:#fff
Start((Start)):::black --fastq--> Alignment[Parabricks Align Reads <br/> <br/> Remove Duplicates]
subgraph Map
Alignment:::green
end
subgraph Variant[Variant Calling]
Alignment --> Deepvariant[Deepvariant]:::yellow
end
subgraph Evaluate
Alignment --> Metrics[Alignment Metrics <br/><br/> GC Metrics <br/><br/> Insert Size Metrics <br/><br/> WGS Metrics <br/><br/> HS Metrics - Exome mode ]:::pink
end
subgraph SHF[Somatic Heuristic Filter]
Deepvariant --> shf[Somatic Heuristic Filter]:::pink
end
subgraph Report
shf --> Mqc[MultiQC Report]:::orange
Metrics --> Mqc
end
Mqc --> End((End)):::black
Map:::panel
Variant:::panel
Evaluate:::panel
SHF:::panel
Report:::panel
Following are the steps and tools that pipeline uses to perform the analyses:
-
Map reads to reference genome and remove duplicate reads using
PARABRICKS FQ2BAM
-
Perform variant calling with
GOOGLE DEEPVARIANT
-
Make somatic calls using
SOMATIC HEURISTIC FILTER
. See below for details. -
Evaluate metrics using
PARABRICKS COLLECTMETRICS
which includes Alignment, GC Bias, Insert Size, and Coverage metrics -
Aggregate the metrics across biosamples and tools to create overall pipeline statistics summary using
MULTIQC
Pipeline Parameters#
Parameter Name | Options | Description |
---|---|---|
Genome | GRCh38 (default) |
Reference genome to use for alignment |
Module Parameters#
Module | Parameter Name | Options | Description |
---|---|---|---|
Subsampling | (default) | Enables downsampling of input reads to a specified read count using SEQTK. | |
Mutation Signature Profile | (default) | Performs mutational signature profiling. |
Output files#
Output Directory/File |
Notes |
---|---|
multiqc / |
This section includes output files containing metrics from various tools to create a MultiQC report. MultiQC Report Example The all_metrics_mqc.txt contains metrics from the All Metrics section of the MultiQC report found in BaseJumper. |
PARABRICKS_PRIMARY_WORKFLOW_PARABRICKS_FQ2BAM/ GOOGLE_DEEPVARIANT_WF_DEEPVARIANT_POSTPROCESS/ SOMATIC_VARIANT_WORKFLOW_Heuristic_Filter_SUBSET_VCF_VARIANTS/ |
PARABRICKS_PRIMARY_WORKFLOW_PARABRICKS_FQ2BAM/ Biosample level output containing aligned reads and index file. GOOGLE_DEEPVARIANT_WF_DEEPVARIANT_POSTPROCESS/ Biosample level output containing the variant calls in VCF format and index file. SOMATIC_VARIANT_WORKFLOW_Heuristic_Filter_SUBSET_VCF_VARIANTS/ Biosample level output containing the somatic filtered variant calls in VCF format and index file. |
execution_info / |
This section includes detail execution information regarding all the tasks in pipeline run. |
Somatic Heuristic Filter#
The Somatic Heuristic Filter is designed to efficiently and accurately filter somatic variants in single-cell sequencing data. There is individual cell-level filters based on alignment score, proportion of clipped reads, and positional statistics. Then there is group-level filters that requires a minimum proportion of cells with sufficicent coverage and a minimum number of high-quality supporting reads. These filters are organized into two-step process:
Step 1: Binomial and Beta-binomial Filtering
- Aggregates counts across all single cells.
- Uses binomial and beta-binomial statistical tests (parallelized by chromosome).
- Applies depth filters based on mean coverage.
- Output: Table of variants passing depth and statistical filters.
Step 2: Extensive QC Filtering
- Applies quality control to positions with variation in at least one cell.
- Extracts detailed statistics from BAM pileups for each variant position, including:
- Read support by strand and quality.
- Median alignment score.
- Proportion of clipped reads.
- Positional statistics (standard deviation, median absolute deviation).
- This step is computationally intensive due to the large number of positions.
Final Output
- Variants passing all filters are used to construct matrices for phylogenetic inference.
- A second round of Sequoia filtering is applied for final variant selection.
Key Parameters#
Step 1: Binomial/Beta-binomial Filtering#
Parameter | Value | Description |
---|---|---|
aggregated_min_mean_depth |
10 | Minimum average depth across all cells |
aggregated_max_mean_depth |
500 | Maximum average depth across all cells |
gender |
female | Affects X/Y chromosome treatment in depth filters |
Step 2: Concatenation and Filtering#
Parameter | Value | Description |
---|---|---|
first_pass_binomial_cutoff |
-10 | Binomial log10(q-value) cutoff |
first_pass_betabinomial_cutoff |
0.4 | Beta-binomial rho cutoff |
Step 3: Pileup-based QC#
Parameter | Value | Description |
---|---|---|
cutoff_mq_hq |
30 | Minimum mapping quality for HQ read |
cutoff_bq_hq |
30 | Minimum base quality for HQ read |
cutoff_bps_start |
15 | Max bp start position to assess read start bias |
num_lines_read_pileup |
250000 | Number of lines read per chunk for parallel processing |
Extracted features include strand-specific support, alignment score, clipping, read start positions, and statistics like SD/MAD.
Step 4: Per-cell & Group-level Variant Filtering#
Per-cell Filtering Parameters
Parameter | Value | Description |
---|---|---|
cutoff_as |
140 | Min median alignment score of reads supporting variant |
cutoff_prop_clipped_reads |
0.5 | Max proportion of clipped reads allowed |
cutoff_prop_bp_under |
0.9 | Max proportion of reads starting too early in alignment |
cutoff_sd_indiv |
4 | Min SD of variant position in strand-biased reads |
cutoff_mad_indiv |
0 | Min MAD of variant position in strand-biased reads |
cutoff_sd_both |
2 | Min SD for both-strand support |
cutoff_mad_both |
2 | Min MAD for both-strand support |
cutoff_sd_extreme |
10 | Extreme SD threshold for either strand |
cutoff_mad_extreme |
1 | Extreme MAD threshold for either strand |
Group-level Filtering Parameters
Parameter | Value | Description |
---|---|---|
cutoff_goodcov_depth |
7 | Minimum HQ depth per cell for variant position |
cutoff_prop_cells_goodcov_group |
0.7 | Minimum proportion of cells with good coverage at position |
cutoff_numreads_variant_manual |
4 | Minimum number of HQ reads supporting variant per position |
Step 5: Phylogenetic Inference#
SEQUOIA - Reapplies filters and builds phylogenetic tree from Mat_NV
and Mat_NR
.
Parameter | Value | Description |
---|---|---|
second_pass_binomial_cutoff |
-10 | Second-pass binomial log10(q-value) cutoff |
second_pass_betabinomial_cutoff_rho_snp |
0.4 | Beta-binomial cutoff for SNPs |
second_pass_betabinomial_cutoff_rho_indel |
0.4 | Beta-binomial cutoff for INDELs |
aggregated_hq_min_mean_depth |
10 | HQ min mean depth |
aggregated_hq_max_mean_depth |
500 | HQ max mean depth |