Quickstart: run the test dataset¶
The snpArcher test dataset ships with five simulated samples and a small reference genome. Running it end-to-end produces a joint-called, hard-filtered VCF and confirms that your installation is working before you commit time and compute to a real analysis.
Time estimate: ~5-10 minutes on a machine with 4 cores.
Prerequisites:
- snpArcher is cloned and the
snparcherconda environment is installed. If not, follow the installation guide first.
Step 1: Activate the environment and navigate to snpArcher¶
Open a terminal and activate the snpArcher conda environment:
Then navigate to the root of the snpArcher repository:
The test dataset lives in the example/ directory.
It contains a pre-configured sample sheet and config file that point to five simulated FASTQ pairs and a small reference genome.
Step 2: Perform a dry run¶
A dry run verifies that the configuration files parse correctly and that Snakemake can build the full dependency graph, without executing any jobs.
You should see output listing all the rules that Snakemake would run, along with the total number of jobs.
Look for familiar rule names like fastp, bwa_map, gatk_haplotypecaller, and genotype_gvcfs.
If you see an error instead, double-check that you activated the snparcher conda environment and that you are in the snpArcher repository root.
What does --dry-run do?
The --dry-run (or -n) flag tells Snakemake to resolve the full dependency graph and report what it would do, without executing any jobs.
This is a free sanity check. Use it before every real run.
Step 3: Run the pipeline¶
Run the pipeline with --cores 4:
First run takes longer
On the first execution, Snakemake will create conda environments for each pipeline step (downloading and installing tools like BWA, GATK, samtools, etc.). This adds several minutes of one-time setup. Subsequent runs reuse the cached environments.
Snakemake will print progress to the terminal as each rule starts and finishes. The entire run should complete in roughly five minutes once the conda environments are in place.
Step 4: Check the outputs¶
A successful run produces output files under example/results/.
The most important output is the hard-filtered VCF.
Verify it exists and is not empty:
You should see a file of at least a few kilobytes. To confirm it contains variant records:
You should see a positive number of variant records.
Other output files
The test run also produces per-sample BAMs (example/results/bams/) and per-sample gVCFs (example/results/gvcfs/).
For a complete list of output files and what they contain, see the outputs reference.
Step 5: Verify QC metrics¶
The core pipeline always produces a QC metrics summary, even when the interactive QC dashboard is not enabled. Check that it was generated:
This TSV file contains per-sample summary statistics including mapping rate and mean depth.
QC dashboard
The interactive QC dashboard (results/qc/qc_dashboard.html) is produced only when modules.qc.enabled is set to true in the config.
The test config has it disabled by default to keep the test fast.
In a real project, we strongly recommend enabling it. See the first project tutorial and the QC report how-to guide for details.
What just happened?¶
snpArcher executed the full variant calling pipeline:
- Trim (fastp) and align (BWA-MEM) reads to the reference genome.
- Mark duplicates in the aligned BAMs.
- Call variants per sample (GATK HaplotypeCaller), splitting the genome at runs of Ns for parallelism.
- Joint genotype across samples via GenomicsDB and GenotypeGVCFs.
- Hard filter the joint VCF to produce the final call set.
For a deeper explanation of the pipeline architecture and the scatter-by-Ns parallelization strategy, see Architecture and Parallelization.
Next steps¶
- Your first project: A real analysis from SRA data to a filtered VCF, including QC review and postprocessing.
- How to configure snpArcher: Customize variant calling settings for your organism.
- How to run on an HPC cluster: Submit snpArcher jobs to SLURM or other schedulers.