Skip to content

Sample sheet schema

CSV file specifying samples and their input data. Validated at startup against workflow/schemas/samples.schema.yaml.

Required columns

Column Type Constraints Description
sample_id string 1-80 characters, pattern ^[A-Za-z0-9._-]+$ Unique sample identifier. Used as the key across the entire workflow.
input_type string srr, fastq, bam, gvcf Type of input data for this row.
input string Non-empty. For fastq, must contain a semicolon. Input path, SRA accession, or semicolon-separated FASTQ pair. Interpretation depends on input_type (see below).

Optional columns

Column Type Default Constraints Description
library_id string Same as sample_id 1-80 characters, pattern ^[A-Za-z0-9._-]+$ Library identifier. Used to group rows within a sample for duplicate marking and read group assignment.
mark_duplicates boolean Value of reads.mark_duplicates in config See boolean parsing Whether to mark duplicates for this sample. Must be consistent across all rows with the same sample_id.

Input type semantics

The meaning of the input column depends on input_type:

input_type input interpretation Multiple rows per sample Notes
srr SRA run accession (e.g., SRR12345678) yes Each row represents one SRA run. FASTQ files are downloaded automatically.
fastq Semicolon-separated pair of FASTQ paths: R1;R2 yes Paths can be absolute or relative to the working directory. Files must be gzipped (.fastq.gz).
bam Path to an aligned BAM file no Exactly one row per sample. The BAM is used directly (no alignment step).
gvcf Path to a gVCF file no Exactly one row per sample. Skips alignment and per-sample variant calling.

gVCF compatibility

gvcf input is only supported with the gatk and sentieon variant calling tools. It is rejected at startup when variant_calling.tool is bcftools, deepvariant, or parabricks.

Multiple rows per sample

A single sample_id can appear on multiple rows to represent multiple sequencing runs or libraries. The rules are:

  • srr and fastq: Multiple rows are allowed. Each row represents one run or lane.
  • bam and gvcf: Exactly one row per sample_id.
  • input_type consistency: All rows for a given sample_id must have the same input_type.
  • mark_duplicates consistency: All rows for a given sample_id must have the same mark_duplicates value (or leave it blank on all rows).
  • library_id: Rows with the same library_id are treated as the same library (e.g., multiple lanes). Rows with different library_id values are treated as different libraries. Within each (sample_id, library_id) group, rows are assigned ordinal input units (u1, u2, ...) internally.

Boolean parsing

The mark_duplicates column accepts the following values (case-insensitive):

True values False values
true, t, yes, y, 1 false, f, no, n, 0

Missing or NA values fall back to the global reads.mark_duplicates config setting. Invalid values cause an error with the row number reported.

Optional metadata sheet

A separate CSV file can be provided via the sample_metadata config key. This sheet is validated against workflow/schemas/sample_metadata.schema.yaml. Additional columns beyond those listed below are permitted (the schema allows additionalProperties).

Column Type Required Default Constraints Used by
sample_id string yes Must match a sample_id in the main samples CSV. Pattern ^[A-Za-z0-9._-]+$. All modules
exclude boolean no false Boolean-like (same parsing as mark_duplicates). Postprocess: excluded samples are removed from filtered VCFs.
lat number no -90 to 90 QC: latitude for geographic map panels.
long number no -180 to 180 QC: longitude for geographic map panels.

Examples

Minimal sample sheet (SRA accessions)

sample_id,input_type,input
bird_A,srr,SRR12345678
bird_B,srr,SRR12345679
bird_C,srr,SRR12345680

Local FASTQ files with library IDs

sample_id,input_type,input,library_id,mark_duplicates
bird_A,fastq,/data/bird_A_lane1_R1.fq.gz;/data/bird_A_lane1_R2.fq.gz,lib1,true
bird_A,fastq,/data/bird_A_lane2_R1.fq.gz;/data/bird_A_lane2_R2.fq.gz,lib1,true
bird_B,fastq,/data/bird_B_R1.fq.gz;/data/bird_B_R2.fq.gz,lib1,true

Mixed input types

sample_id,input_type,input
bird_A,srr,SRR12345678
bird_A,srr,SRR12345679
bird_B,fastq,/data/bird_B_R1.fq.gz;/data/bird_B_R2.fq.gz
bird_C,bam,/data/bird_C.bam
bird_D,gvcf,/data/bird_D.g.vcf.gz

Metadata sheet

sample_id,exclude,lat,long
bird_A,false,42.38,-71.12
bird_B,true,42.36,-71.06
bird_C,false,43.07,-70.76
bird_D,false,,

v1 to v2 migration

v1 sample sheets are not compatible with v2

snpArcher v2 uses a different sample sheet schema. If you are migrating from v1, you must convert your sample sheet.

v1 column v2 column Notes
BioSample sample_id Must match pattern ^[A-Za-z0-9._-]+$.
Run (SRR/ERR/DRR) input Set input_type to srr.
fq1, fq2 input Join as fq1;fq2 and set input_type to fastq.
LibraryName library_id Optional; defaults to sample_id.
refGenome, refPath (removed) Reference genome is now specified in config.yaml under reference.name and reference.source.
SampleType exclude (in metadata sheet) exclude: true in the metadata sheet replaces SampleType: exclude.
lat, long lat, long (in metadata sheet) Moved from the sample sheet to the optional metadata sheet.

See the Changelog for the full list of v1-to-v2 breaking changes.