Using `expand` to generate filenames

Snakemake wildcards make it easy to apply rules to many files, but also create a new challenge: how do you generate all the filenames you want?

As an example of this challenge, consider the list of genomes needed for rule compare_genomes from Chapter 8 -

rule compare_genomes:
    input:
        "GCF_000017325.1.fna.gz.sig",
        "GCF_000020225.1.fna.gz.sig",
        "GCF_000021665.1.fna.gz.sig",
        "GCF_008423265.1.fna.gz.sig",

This list is critical because it specifies the sketches to be created by the wildcard rule. However, writing this list out is annoying and error prone, because parts of every filename are identical and repeated.

Even worse, if you needed to use this list in multiple places, or produce slightly different filenames with the same accessions, that will be error prone: you are likely to want to add, remove, or edit elements of the list, and you will need to change it in multiple places.

In Chapter 9, we showed how to chang this to a list of the accessions at the top of the Snakefile and then used a function called expand to generate the list:

ACCESSIONS = ["GCF_000017325.1",
              "GCF_000020225.1",
              "GCF_000021665.1",
              "GCF_008423265.1"]

#...

rule compare_genomes:
    input:
        expand("{acc}.fna.gz.sig", acc=ACCESSIONS),

Using expand to generate lists of filenames is a common pattern in Snakefiles, and in this chapter we'll explore it more!

Using `expand` with a single pattern and one list of values

In the example above, we provide a single pattern, {acc}.fna.gz.sig, and ask expand to resolve it into many filenames by filling in values for the field name acc from each element in ACCESSIONS. (You may recognize the keyword syntax for specifying values, acc=ACCESSIONS, from input and output blocks.)

The result of expand('{acc}.fna.gz.sig', acc=...) here is identical to writing out the four filenames in long form:

"GCF_000017325.1.fna.gz.sig",
"GCF_000020225.1.fna.gz.sig",
"GCF_000021665.1.fna.gz.sig",
"GCF_008423265.1.fna.gz.sig"

That is, expand doesn't do any special wildcard matching or pattern inference - it just fills in the values and returns the resulting list.

Here, ACCESSIONS can be any Python iterable - for example a list, a tuple, or a dictionary. See the Python appendix for details.

Using `expand` with multiple lists of values

You can also use expand with multiple field names. Consider:

expand('{acc}.fna.{extension}`, acc=ACCESSIONS, extension=['.gz.sig', .gz'])

This will produce the following eight filenames:

"GCF_000017325.1.fna.gz.sig",
"GCF_000017325.1.fna.gz",
"GCF_000020225.1.fna.gz.sig",
"GCF_000020225.1.fna.gz",
"GCF_000021665.1.fna.gz.sig",
"GCF_000021665.1.fna.gz",
"GCF_008423265.1.fna.gz.sig",
"GCF_008423265.1.fna.gz"

by building every combination of acc and extension.

Generating all combinations vs pairwise combinations

As we saw above, with multiple patterns, expand will generate all possible combinations: that is,

X = [1, 2, 3]
Y = ['a', 'b', 'c']

rule all:
   input:
      expand('{x}.by.{y}', x=X, y=Y)

will generate 9 filenames: 1.by.a, 1.by.b, 1.by.c, 2.by.a, etc. And if you added a third pattern to the expand string, expand would also add that into the combinations!

So what's going on here?

By default, expand does an all-by-all expansion containing all possible combinations. (This is sometimes called a Cartesian product, a cross-product, or an outer join.)

But you don't always want that. How can we change this behavior?

The expand function takes an optional second argument, the combinator, which tells expand how to combine the lists of values the come after. By default expand uses a Python function called itertools.product, which creates all possible combinations, but you can give it other functions.

In particular, you can tell expand to create pairwise combinations by using zip instead - something we did in one of the wildcard examples.

Here's an example:

X = [1, 2, 3]
Y = ['a', 'b', 'c']

rule all:
   input:
      expand('{x}.by.{y}', zip, x=X, y=Y)

which will now generate only three filenames: 1.by.a, 2.by.b, and 3.by.c.

CTB: mention what will happen if lists are different lengths.

For more information see the snakemake documentation on using zip instead of product.

Getting a list of identifiers to use in `expand`

The expand function provides an effective solution when you have lists of identifiers that you use multiple times in a workflow - a common pattern in bioinformatics! Writing these lists out in a Snakefile (as we do in the above examples) is not always practical, however; you may have dozens to hundreds of identifiers!

Lists of identifiers can be loaded from other files in a variety of ways, and they can also be generated from the set of actual files in a directory using glob_wildcards.

Examples of loading lists of accessions from files or directories

Loading a list of accessions from a text file

If you have a simple list of accessions in a text file accessions.txt, like so:

File accessions.txt:

GCF_000017325.1
GCF_000020225.1
GCF_000021665.1
GCF_008423265.1

then the following code will load each line in the text file in as a separate ID:

with open('accessions.txt', 'rt') as fp:
    ACCESSIONS = fp.readlines()
    ACCESSIONS = [ line.strip() for line in ACCESSIONS ]
    
print(f'ACCESSIONS is a Python list of length {len(ACCESSIONS)}')
print(ACCESSIONS)

rule all:
    input:
        expand("{acc}.sig", acc=ACCESSIONS)

rule sketch_genome:
    input:
        "genomes/{accession}.fna.gz",
    output:
        "{accession}.sig",
    shell: """
        sourmash sketch dna -p k=31 {input} --name-from-first -o {output}
    """

and build sourmash signatures for it each accession.

Loading a specific column from a CSV file

If instead of a text file you have a CSV file with multiple columns, and the IDs to load are all in one column, you can use the Python pandas library to read in the CSV. In the code below, pandas.read_csv loads the CSV into a pandas DataFrame object, and then we select the accession column and use that as an iterable.

@CTB link to pandas.

File accessions.csv:

accession,information
GCF_000017325.1,genome 1
GCF_000020225.1,genome 2
GCF_000021665.1,genome 3
GCF_008423265.1,genome 4

Snakefile to load accessions.csv:

import pandas

CSV_DATAFRAME = pandas.read_csv('accessions.csv')
ACCESSIONS = CSV_DATAFRAME['accession']

print(f'ACCESSIONS is a pandas Series of length {len(ACCESSIONS)}')
print(ACCESSIONS)

rule all:
    input:
        expand("{acc}.sig", acc=ACCESSIONS)

rule sketch_genome:
    input:
        "genomes/{accession}.fna.gz",
    output:
        "{accession}.sig",
    shell: """
        sourmash sketch dna -p k=31 {input} --name-from-first -o {output}
    """

Loading from the config file

Snakemake also supports the use of configuration files, where the snakefile supplies the name of the a default config file, which can be overridden on the command line.

A config file can also be a good place to put accessions. Consider:

accessions:
- GCF_000017325.1
- GCF_000020225.1
- GCF_000021665.1
- GCF_008423265.1

which is used by the following Snakefile:

configfile: "config.yml"

ACCESSIONS = config['accessions']
    
print(f'ACCESSIONS is a Python list of length {len(ACCESSIONS)}')
print(ACCESSIONS)

rule all:
    input:
        expand("{acc}.sig", acc=ACCESSIONS)

rule sketch_genome:
    input:
        "genomes/{accession}.fna.gz",
    output:
        "{accession}.sig",
    shell: """
        sourmash sketch dna -p k=31 {input} --name-from-first -o {output}
    """

Using `glob_wildcards` to load IDs or accessions from a set of files

We introduced the glob_wildcards command briefly in the chapter on wildcards: glob_wildcards does pattern matching on files actually present in the directory. This is a particularly convenient way to get a list of accessions, although it is dangerous to use this because Reasons.

CTB discuss use case for samples; recipes? CTB link to warning/reiterate warning CTB "We discuss glob_wildcards more ..." - constraints, wildcards, where else? Is there more to it?

GLOB_RESULTS = glob_wildcards("genomes/{acc}.fna.gz")
ACCESSIONS = GLOB_RESULTS.acc

print(f'ACCESSIONS is a Python list of length {len(ACCESSIONS)}')
print(ACCESSIONS)

rule all:
    input:
        expand("{acc}.sig", acc=ACCESSIONS)

rule sketch_genome:
    input:
        "genomes/{accession}.fna.gz",
    output:
        "{accession}.sig",
    shell: """
        sourmash sketch dna -p k=31 {input} --name-from-first -o {output}
    """

Example combining `glob_wildcards`.

link to example in wildcards, renaming recipe in recipes?

A common pattern: get list of files and

CTB note: link to Python list docs. CTB note: cover multiext too? CTB note: cover options to expand? see snakemake.io code

Wildcards and `expand` - some closing thoughts

Combined with wildcards, expand is extremely powerful and useful. Just like wildcards, however, this power comes with some complexity. Here is a brief rundown of how these features combine.

The expand function makes a list of files to create from a pattern and a list of values to fill in.

Wildcards in rules provide recipes to create files whose names match a pattern.

Typically in Snakefiles we use expand to generate a list of files that match a certain pattern, and then write a rule that uses wildcards to generate those actual files.

The list of values to use with expand can come from many places, including text files, CSV files, and config files. It can also come from glob_wildcards, which uses a pattern to extract the list of values from files that are actually present.

Links and references

snakemake reference documentation for expand
The Python itertools documentation.

An Introduction to Snakemake for Bioinformatics