How to filter and postprocess¶
The postprocess module re-filters the VCF after excluding flagged samples and applying site-level quality filters.
For the rationale behind each filtering step, see Filtering philosophy.
Enable postprocessing¶
In your config.yaml, set the postprocess module to enabled:
Then re-run snpArcher. Snakemake detects that the postprocess rules are now required and runs only the new steps.
Exclude samples¶
Sample exclusions are managed through a sample metadata file, separate from the main sample sheet.
1. Create the metadata file¶
Create a CSV file with at least sample_id and exclude columns:
Set exclude to true for any sample you want removed from the postprocessed VCF.
The sample_id values must match entries in your main sample sheet.
2. Point config to the metadata file¶
Add the sample_metadata path to your config.yaml:
Tip
The metadata file can also include optional columns used by other modules:
lat,long(numeric): geographic coordinates for the QC map panels.
Configure site-level filters¶
The modules.postprocess.filtering block controls which sites are retained in the clean VCF:
modules:
postprocess:
enabled: true
filtering:
contig_size: 10000 # <-- exclude contigs smaller than this (bp)
maf: 0.01 # <-- minimum minor allele frequency
missingness: 0.75 # <-- maximum fraction of missing genotypes per site
exclude_scaffolds: "mtDNA,Y" # <-- comma-separated scaffold names to remove
Filter descriptions¶
| Filter | Default | Effect |
|---|---|---|
contig_size |
10000 |
SNPs on contigs/scaffolds this size or smaller are excluded. |
maf |
0.01 |
SNPs with minor allele frequency below this threshold are excluded. |
missingness |
0.75 |
SNPs where more than this fraction of genotypes are missing are excluded. |
exclude_scaffolds |
"mtDNA,Y" |
SNPs on these scaffolds are excluded entirely. Comma-separated, no spaces. |
Tip
For many organisms, you will want to exclude sex chromosomes and mitochondrial scaffolds.
For example, for a ZW bird: exclude_scaffolds: "Z,W,mtDNA".
Intersect with callable sites¶
If callable sites are enabled (the default), the postprocess module automatically intersects the VCF with the callable sites BED file, restricting the clean VCF to regions that pass both coverage and mappability filters.
If you disabled callable sites, the postprocess module will warn and skip this intersection step.
To configure callable sites thresholds, see How to configure your run.
Output files¶
After postprocessing, the following files are produced in results/postprocess/:
| File | Description |
|---|---|
filtered.vcf.gz |
VCF after removing excluded samples and sites failing hard filters or with ref=N or AF=0. |
clean_snps.vcf.gz |
Biallelic SNPs from the filtered VCF, after applying missingness, MAF, callable-site, and scaffold-exclusion filters. |
clean_indels.vcf.gz |
Indels from the filtered VCF, after applying the same filters. |
The clean_snps.vcf.gz file is typically what you want for downstream population genomic analyses.
Worked example¶
Suppose you ran snpArcher on 50 bird samples, reviewed the QC report, and decided to:
- Exclude 3 samples with very low depth (< 5x)
- Remove the Z chromosome and mitochondrial scaffolds
-
Set a stricter MAF filter
-
Create
config/sample_metadata.csv:(You only need to list the samples you want to exclude. You can optionally list all samples, setting
excludetofalsefor those you want to keep.) -
Update
config.yaml: -
Re-run snpArcher:
snakemake \ --snakefile /path/to/snpArcher/workflow/Snakefile \ --directory /path/to/my_project \ --use-conda \ --cores 8Snakemake will only run the postprocessing steps, since the core pipeline outputs already exist.
Next steps¶
- Filtering philosophy for background on why these filters matter
- Outputs reference for a complete listing of all output files