Skip to content

How to filter and postprocess

The postprocess module re-filters the VCF after excluding flagged samples and applying site-level quality filters.

For the rationale behind each filtering step, see Filtering philosophy.

Enable postprocessing

In your config.yaml, set the postprocess module to enabled:

modules:
  postprocess:
    enabled: true

Then re-run snpArcher. Snakemake detects that the postprocess rules are now required and runs only the new steps.

Exclude samples

Sample exclusions are managed through a sample metadata file, separate from the main sample sheet.

1. Create the metadata file

Create a CSV file with at least sample_id and exclude columns:

sample_id,exclude
bird_1,false
bird_2,false
bird_3,true
bird_4,false
bird_5,true

Set exclude to true for any sample you want removed from the postprocessed VCF. The sample_id values must match entries in your main sample sheet.

2. Point config to the metadata file

Add the sample_metadata path to your config.yaml:

sample_metadata: "config/sample_metadata.csv"  # <-- change this

Tip

The metadata file can also include optional columns used by other modules:

  • lat, long (numeric): geographic coordinates for the QC map panels.

Configure site-level filters

The modules.postprocess.filtering block controls which sites are retained in the clean VCF:

modules:
  postprocess:
    enabled: true
    filtering:
      contig_size: 10000  # <-- exclude contigs smaller than this (bp)
      maf: 0.01  # <-- minimum minor allele frequency
      missingness: 0.75  # <-- maximum fraction of missing genotypes per site
      exclude_scaffolds: "mtDNA,Y"  # <-- comma-separated scaffold names to remove

Filter descriptions

Filter Default Effect
contig_size 10000 SNPs on contigs/scaffolds this size or smaller are excluded.
maf 0.01 SNPs with minor allele frequency below this threshold are excluded.
missingness 0.75 SNPs where more than this fraction of genotypes are missing are excluded.
exclude_scaffolds "mtDNA,Y" SNPs on these scaffolds are excluded entirely. Comma-separated, no spaces.

Tip

For many organisms, you will want to exclude sex chromosomes and mitochondrial scaffolds. For example, for a ZW bird: exclude_scaffolds: "Z,W,mtDNA".

Intersect with callable sites

If callable sites are enabled (the default), the postprocess module automatically intersects the VCF with the callable sites BED file, restricting the clean VCF to regions that pass both coverage and mappability filters.

If you disabled callable sites, the postprocess module will warn and skip this intersection step.

To configure callable sites thresholds, see How to configure your run.

Output files

After postprocessing, the following files are produced in results/postprocess/:

File Description
filtered.vcf.gz VCF after removing excluded samples and sites failing hard filters or with ref=N or AF=0.
clean_snps.vcf.gz Biallelic SNPs from the filtered VCF, after applying missingness, MAF, callable-site, and scaffold-exclusion filters.
clean_indels.vcf.gz Indels from the filtered VCF, after applying the same filters.

The clean_snps.vcf.gz file is typically what you want for downstream population genomic analyses.

Worked example

Suppose you ran snpArcher on 50 bird samples, reviewed the QC report, and decided to:

  • Exclude 3 samples with very low depth (< 5x)
  • Remove the Z chromosome and mitochondrial scaffolds
  • Set a stricter MAF filter

  • Create config/sample_metadata.csv:

    sample_id,exclude
    bird_12,true
    bird_27,true
    bird_41,true
    

    (You only need to list the samples you want to exclude. You can optionally list all samples, setting exclude to false for those you want to keep.)

  • Update config.yaml:

    sample_metadata: "config/sample_metadata.csv"
    
    modules:
      postprocess:
        enabled: true
        filtering:
          contig_size: 10000
          maf: 0.05  # <-- stricter MAF filter
          missingness: 0.75
          exclude_scaffolds: "Z,mtDNA"
    
  • Re-run snpArcher:

    snakemake \
      --snakefile /path/to/snpArcher/workflow/Snakefile \
      --directory /path/to/my_project \
      --use-conda \
      --cores 8
    

    Snakemake will only run the postprocessing steps, since the core pipeline outputs already exist.

Next steps