Modules¶
Two optional modules extend the core pipeline, each enabled via modules.<name>.enabled: true in config.yaml.
QC¶
The QC module produces an interactive HTML dashboard with visualizations for quality control of population genomic datasets.
Requirements¶
- At least 3 samples in the sample sheet.
- The core pipeline must produce
results/vcfs/raw.vcf.gzandresults/qc_metrics/qc_report.tsv.
Config options¶
| Key | Type | Default | Constraints | Description |
|---|---|---|---|---|
modules.qc.enabled |
boolean | false |
Enable the module. | |
modules.qc.clusters |
integer | 3 |
>= 1 | Number of clusters for PCA-based clustering in the dashboard. |
modules.qc.min_depth |
number | 2 |
>= 0 | Samples with mean depth below this value are excluded from QC analyses. |
modules.qc.google_api_key |
string | "" |
Google Maps JavaScript API key for geographic map panels. If empty, map panels are omitted. | |
modules.qc.exclude_scaffolds |
string | "" |
Comma-separated list of scaffold/contig names to exclude from QC analyses (e.g., "scaffold_mt,scaffold_Z"). |
Dashboard panels¶
The HTML dashboard (results/qc/qc_dashboard.html) contains:
- PCA plots (PC1 vs PC2, PC1 vs PC3) with cluster assignments
- Per-sample depth vs. PC correlation plots
- Relatedness heatmap (kinship coefficients)
- Inbreeding coefficient (F) distribution
- Neighbor-joining tree
- ADMIXTURE bar plots (K=2 and K=3)
- Per-sample missingness and heterozygosity
- Geographic map of PCA clusters (requires
google_api_keyandlat/longin sample metadata) - Filter summary counts
Outputs¶
| Path | Format | Description |
|---|---|---|
results/qc/qc_dashboard.html |
HTML | Interactive QC dashboard. |
results/qc/pruned.vcf.gz |
VCF | LD-pruned, subsampled VCF used for analyses. |
results/qc/plink.eigenvec |
Text | PCA eigenvectors. |
results/qc/plink.eigenval |
Text | PCA eigenvalues. |
results/qc/plink.2.Q |
Text | ADMIXTURE Q matrix (K=2). |
results/qc/plink.3.Q |
Text | ADMIXTURE Q matrix (K=3). |
results/qc/individuals.idepth |
Text | Per-individual depth. |
results/qc/individuals.imiss |
Text | Per-individual missingness. |
results/qc/individuals.het |
Text | Per-individual heterozygosity. |
Postprocess¶
The postprocess module filters the raw VCF to produce clean SNP and indel call sets by removing excluded samples, applying site-level filters, and intersecting with callable sites.
Sample exclusion is controlled via the exclude column in the sample metadata sheet.
Requirements¶
- The core pipeline must produce
results/vcfs/raw.vcf.gz. - Callable sites BED (
results/callable_sites/callable_sites.bed) is used if available.
Config options¶
| Key | Type | Default | Constraints | Description |
|---|---|---|---|---|
modules.postprocess.enabled |
boolean | false |
Enable the module. | |
modules.postprocess.filtering.contig_size |
integer | 10000 |
>= 0 | Exclude SNPs on contigs with length equal to or smaller than this value (bp). Set to 0 to disable. |
modules.postprocess.filtering.maf |
number | 0.01 |
0 to 1 | Minimum minor allele frequency threshold. |
modules.postprocess.filtering.missingness |
number | 0.75 |
0 to 1 | Minimum genotyping rate (fraction of samples with a called genotype). |
modules.postprocess.filtering.exclude_scaffolds |
string | "mtDNA,Y" |
Comma-separated scaffold/contig names to exclude. |
Outputs¶
| Path | Format | Description |
|---|---|---|
results/postprocess/filtered.vcf.gz |
VCF | VCF with excluded samples removed and default GATK hard filters applied. |
results/postprocess/clean_snps.vcf.gz |
VCF | Final clean biallelic SNP VCF after all site-level filters. |
results/postprocess/clean_indels.vcf.gz |
VCF | Final clean indel VCF after all site-level filters. |
results/postprocess/callable_sites_filtered.bed |
BED | Callable sites BED after small contig and scaffold exclusion. |
results/postprocess/include_samples.txt |
Text | List of retained sample IDs. |