Limiting wildcard matching with wildcard constraints

Wildcards are one of the most powerful features in snakemake. But sometimes they cause trouble by matching too broadly, to too many files!

See the section on wildcards for an introduction to wildcards!

By default, wildcards in snakemake match to one or more characters - that is, they won't match to an empty string, but they'll match to anything else. As discussed in the wildcards chapter, this can cause problems!

snakemake supports limiting wildcard matching with a feature called wildcard constraints. Wildcard constraints are a flexible system for specifying what a particular wildcard can, and cannot, match using regular expressions.

Regular expressions

Regular expressions (commonly abbreviated "regexes" or "regexps") are a mini-language for flexible string matching.

CTB: more here; give a few useful/common examples. \d+, alpha-numeric words, ??

Python comes with a friendly introduction to regexps that is a good reference for more advanced use of regular expressions: see the Regular Expression HOWTO.

TODO:

  • use in wildcards in rules
  • use for glob_wildcards
  • where else?
  • named wildcards

Using wildcard constraints in glob_wildcards

Let's start by looking at using wildcard constraints with glob_wildcards. Consider a directory containing the following files:

letters-only-abc-xyz.txt
letters-only-abc.txt
letters-only-abc2.txt

We could match all three files easily enough with:

files, = glob_wildcards('letters-only-{word}.txt')

which would give us ['abc2', 'abc-xyz', 'abc'].

Now suppose we only want our wildcard pattern to match letters-only-abc.txt, but not the other files. How do we do this?

We can specify a constraint as below that only matches letters, not numbers:

letters_only, = glob_wildcards('letters-only-{name,[a-zA-Z]+}.txt')

and the letters_only list will be ['abc']

We can also specify characters to avoid, as opposed to characters that are allowed, using the regexp ^ (NOT) character - this will match a broader range of files than the previous example, but will still ignore words with numbers in them:

letters_only, = glob_wildcards('letters-only-{name,[^0-9]+}.txt')

Here, letters_only will be ['abc-xyz', 'abc'], because we are allowing anything but numbers.

Avoiding certain characters is particularly useful when we want to avoid matching in subdirectories. By default, glob_wildcards will include files in subdirectories - for example, if there is a file data/datafile.txt, then all_txt_files below would list data/datafile.txt:

all_txt_files, = glob_wildcards('{filename}.txt')

However, if we constrain the wildcard matching to avoid forward slashes (/) then files in subdirectories will not be matched:

this_dir_only, = glob_wildcards('{filename,[^/]+}.txt')

CTB check

Using wildcard constraints in rules

  • only need in first place wildcard is mentioned

Global wildcard constraints

snakemake supports global wildcard constraints like so:

wildcard_constraints:
    sample="\w+" # equivalent to {sample,\w+} - limits to alphabet letters
    num="[0-9]+" # equivalent to {num,[0-9]+} - limit to numbers

Anywhere where sample or num is used in the Snakefile, these constraints will be applied.