How to configure your run¶
snpArcher is configured through a YAML file at config/config.yaml in your project directory.
The settings below are the ones you are most likely to change.
For the full specification of every option, see the config schema reference.
Minimal configuration¶
You need to set three things: the path to your sample sheet, a reference genome name, and a reference genome source.
samples: "config/samples.csv" # <-- change this
reference:
name: "my_organism" # <-- change this (used in output filenames)
source: "/data/reference/genome.fna.gz" # <-- change this
The reference.source field accepts:
- A local file path to a FASTA (optionally gzipped)
- A URL to a remote FASTA
- A GenBank or RefSeq accession (e.g.,
GCF_000002315.6). snpArcher will download and index it automatically.
Choose a variant caller¶
Set variant_calling.tool to select the caller.
The default is gatk.
| Tool | Description | Notes |
|---|---|---|
gatk |
GATK HaplotypeCaller + GenomicsDB + GenotypeGVCFs | Default. Well-tested for population genomics. |
sentieon |
Sentieon DNAseq (GATK-compatible, faster) | Requires a valid license (variant_calling.sentieon.license). |
bcftools |
bcftools mpileup + call | No per-sample gVCF step. Cannot accept gvcf input. |
deepvariant |
Google DeepVariant + GLnexus | ML-based caller. Cannot accept gvcf input. |
parabricks |
NVIDIA Parabricks | GPU-accelerated. Requires Apptainer image and NVIDIA GPUs. Cannot accept gvcf input. |
Note
You only need to configure the sub-block for the tool you selected. Settings for other callers are ignored.
Set the heterozygosity prior (GATK)¶
If you are using gatk and working with a non-model organism, you should adjust the heterozygosity prior.
The GATK default (0.001) is calibrated for humans.
Many non-model organisms have higher diversity, and raising this value improves variant sensitivity.
0.005 is a reasonable starting point for species with moderate to high nucleotide diversity. For high-diversity species (e.g., some invertebrates), try 0.01 or higher.
For more background, see Variant calling approaches.
Set ploidy and coverage¶
variant_calling:
ploidy: 2 # <-- change if not diploid
expected_coverage: "low" # "low" for <30x, "high" for >30x
The expected_coverage setting controls internal caller tuning.
Use low for most resequencing projects (5-15x per sample) and high for datasets with consistently deep coverage (>30x).
Enable modules¶
Optional modules are off by default.
Enable them under the modules block:
modules:
qc:
enabled: true # <-- interactive QC dashboard
clusters: 3 # number of k-means clusters for PCA
min_depth: 2 # exclude samples below this depth from QC
postprocess:
enabled: true # <-- sample exclusion and site filtering
Tip
We recommend enabling the QC module for every run. It costs little extra computation and helps catch problems early.
Configure callable sites¶
Callable sites produce a BED file of regions that pass coverage and mappability filters. This is important for analyses that need to distinguish "no variant" from "no data" (e.g., estimating nucleotide diversity).
Callable sites are enabled by default. To customize the coverage thresholds:
callable_sites:
generate_bed_file: true
coverage:
enabled: true
fraction: 1.0 # fraction of samples that must be callable
min_coverage: auto # "auto" = floor(mean_depth / 2)
max_coverage: auto # "auto" = ceil(mean_depth * 2)
merge_distance: 100
mappability:
enabled: true
kmer: 150
min_score: 1 # 1 = uniquely mappable only
merge_distance: 100
To disable callable sites entirely:
Configure postprocessing filters¶
When the postprocess module is enabled, you can control site-level filtering:
modules:
postprocess:
enabled: true
filtering:
contig_size: 10000 # exclude contigs smaller than this (bp)
maf: 0.01 # minimum minor allele frequency
missingness: 0.75 # maximum fraction of missing genotypes per site
exclude_scaffolds: "mtDNA,Y" # <-- comma-separated, no spaces
Sample exclusions are controlled through a separate metadata file. See How to filter and postprocess for details.
Full realistic example¶
Here is a complete config.yaml for a bird resequencing project using GATK with the QC and postprocess modules enabled:
samples: "config/samples.csv"
sample_metadata: "config/sample_metadata.csv"
reference:
name: "bTaeGut2"
source: "GCF_003957565.2"
variant_calling:
expected_coverage: "low"
tool: "gatk"
ploidy: 2
gatk:
het_prior: 0.005
intervals:
enabled: true
min_nmer: 500
num_gvcf_intervals: 50
callable_sites:
generate_bed_file: true
coverage:
enabled: true
fraction: 1.0
min_coverage: auto
max_coverage: auto
mappability:
enabled: true
kmer: 150
min_score: 1
modules:
qc:
enabled: true
clusters: 3
min_depth: 2
postprocess:
enabled: true
filtering:
contig_size: 10000
maf: 0.01
missingness: 0.75
exclude_scaffolds: "Z,W,mtDNA"
Next steps¶
- Run locally on a workstation
- Run on HPC on a SLURM cluster
- See the config schema reference for every available option