Sample sheet schema¶
CSV file specifying samples and their input data.
Validated at startup against workflow/schemas/samples.schema.yaml.
Required columns¶
| Column | Type | Constraints | Description |
|---|---|---|---|
sample_id |
string | 1-80 characters, pattern ^[A-Za-z0-9._-]+$ |
Unique sample identifier. Used as the key across the entire workflow. |
input_type |
string | srr, fastq, bam, gvcf |
Type of input data for this row. |
input |
string | Non-empty. For fastq, must contain a semicolon. |
Input path, SRA accession, or semicolon-separated FASTQ pair. Interpretation depends on input_type (see below). |
Optional columns¶
| Column | Type | Default | Constraints | Description |
|---|---|---|---|---|
library_id |
string | Same as sample_id |
1-80 characters, pattern ^[A-Za-z0-9._-]+$ |
Library identifier. Used to group rows within a sample for duplicate marking and read group assignment. |
mark_duplicates |
boolean | Value of reads.mark_duplicates in config |
See boolean parsing | Whether to mark duplicates for this sample. Must be consistent across all rows with the same sample_id. |
Input type semantics¶
The meaning of the input column depends on input_type:
input_type |
input interpretation |
Multiple rows per sample | Notes |
|---|---|---|---|
srr |
SRA run accession (e.g., SRR12345678) |
yes | Each row represents one SRA run. FASTQ files are downloaded automatically. |
fastq |
Semicolon-separated pair of FASTQ paths: R1;R2 |
yes | Paths can be absolute or relative to the working directory. Files must be gzipped (.fastq.gz). |
bam |
Path to an aligned BAM file | no | Exactly one row per sample. The BAM is used directly (no alignment step). |
gvcf |
Path to a gVCF file | no | Exactly one row per sample. Skips alignment and per-sample variant calling. |
gVCF compatibility
gvcf input is only supported with the gatk and sentieon variant calling tools.
It is rejected at startup when variant_calling.tool is bcftools, deepvariant, or parabricks.
Multiple rows per sample¶
A single sample_id can appear on multiple rows to represent multiple sequencing runs or libraries.
The rules are:
srrandfastq: Multiple rows are allowed. Each row represents one run or lane.bamandgvcf: Exactly one row persample_id.input_typeconsistency: All rows for a givensample_idmust have the sameinput_type.mark_duplicatesconsistency: All rows for a givensample_idmust have the samemark_duplicatesvalue (or leave it blank on all rows).library_id: Rows with the samelibrary_idare treated as the same library (e.g., multiple lanes). Rows with differentlibrary_idvalues are treated as different libraries. Within each(sample_id, library_id)group, rows are assigned ordinal input units (u1,u2, ...) internally.
Boolean parsing¶
The mark_duplicates column accepts the following values (case-insensitive):
| True values | False values |
|---|---|
true, t, yes, y, 1 |
false, f, no, n, 0 |
Missing or NA values fall back to the global reads.mark_duplicates config setting.
Invalid values cause an error with the row number reported.
Optional metadata sheet¶
A separate CSV file can be provided via the sample_metadata config key.
This sheet is validated against workflow/schemas/sample_metadata.schema.yaml.
Additional columns beyond those listed below are permitted (the schema allows additionalProperties).
| Column | Type | Required | Default | Constraints | Used by |
|---|---|---|---|---|---|
sample_id |
string | yes | Must match a sample_id in the main samples CSV. Pattern ^[A-Za-z0-9._-]+$. |
All modules | |
exclude |
boolean | no | false |
Boolean-like (same parsing as mark_duplicates). |
Postprocess: excluded samples are removed from filtered VCFs. |
lat |
number | no | -90 to 90 | QC: latitude for geographic map panels. | |
long |
number | no | -180 to 180 | QC: longitude for geographic map panels. |
Examples¶
Minimal sample sheet (SRA accessions)¶
Local FASTQ files with library IDs¶
sample_id,input_type,input,library_id,mark_duplicates
bird_A,fastq,/data/bird_A_lane1_R1.fq.gz;/data/bird_A_lane1_R2.fq.gz,lib1,true
bird_A,fastq,/data/bird_A_lane2_R1.fq.gz;/data/bird_A_lane2_R2.fq.gz,lib1,true
bird_B,fastq,/data/bird_B_R1.fq.gz;/data/bird_B_R2.fq.gz,lib1,true
Mixed input types¶
sample_id,input_type,input
bird_A,srr,SRR12345678
bird_A,srr,SRR12345679
bird_B,fastq,/data/bird_B_R1.fq.gz;/data/bird_B_R2.fq.gz
bird_C,bam,/data/bird_C.bam
bird_D,gvcf,/data/bird_D.g.vcf.gz
Metadata sheet¶
sample_id,exclude,lat,long
bird_A,false,42.38,-71.12
bird_B,true,42.36,-71.06
bird_C,false,43.07,-70.76
bird_D,false,,
v1 to v2 migration¶
v1 sample sheets are not compatible with v2
snpArcher v2 uses a different sample sheet schema. If you are migrating from v1, you must convert your sample sheet.
| v1 column | v2 column | Notes |
|---|---|---|
BioSample |
sample_id |
Must match pattern ^[A-Za-z0-9._-]+$. |
Run (SRR/ERR/DRR) |
input |
Set input_type to srr. |
fq1, fq2 |
input |
Join as fq1;fq2 and set input_type to fastq. |
LibraryName |
library_id |
Optional; defaults to sample_id. |
refGenome, refPath |
(removed) | Reference genome is now specified in config.yaml under reference.name and reference.source. |
SampleType |
exclude (in metadata sheet) |
exclude: true in the metadata sheet replaces SampleType: exclude. |
lat, long |
lat, long (in metadata sheet) |
Moved from the sample sheet to the optional metadata sheet. |
See the Changelog for the full list of v1-to-v2 breaking changes.