Chapter 4 - running rules in parallel

Let's take a look at the sketch_genomes rule from the last Snakefile entry:

(@CTB note: Section 1 should be modified to have these explicit filenames in there!)

rule sketch_genomes:
    output:
        "GCF_000017325.1.fna.gz.sig",
        "GCF_000020225.1.fna.gz.sig",
        "GCF_000021665.1.fna.gz.sig"
    shell: """
        sourmash sketch dna -p k=31 genomes/*.fna.gz --name-from-first
    """

rule compare_genomes:
    input:
        "GCF_000017325.1.fna.gz.sig",
        "GCF_000020225.1.fna.gz.sig",
        "GCF_000021665.1.fna.gz.sig"
    output:
        "compare.mat"
    shell: """
        sourmash compare {input} -o {output}
    """

rule plot_comparison:
    message: "compare all input genomes using sourmash"
    input:
        "compare.mat"
    output:
        "compare.mat.matrix.png"
    shell: """
        sourmash plot {input}
    """

This command works fine as it is, but it is slightly awkward - because, bioinformatics being bioinformatics, we are likely to want to add more genomes into the comparison at some point, and right now each additional genome is going to have to be added to both input and output. It's not a lot of work, but it's unnecessary.

Moreover, if add in a lot of genomes, then this step could quickly become a bottleneck. sourmash sketch may run quickly on 10 or 20 genomes, but it will slow down if you give it 100 or 1000! (In fact, sourmash sketch will actually take 100 times longer on 100 genomes than on 1.) Is there a way to speed that up?

Yes - we can write a rule that can be run for each genome, and then ask snakemake to run it in parallel for us!

Note: sometimes you have to have a single rule that deals with all of the genomes - for example, compare_genomes has to compare all the genomes, and there's no simple way around that. But with sketch_genomes, we do have a simple option!

Let's start by breaking this one rule into three separate rules:

rule sketch_genomes_1:
    input:
        "genomes/GCF_000017325.1.fna.gz",
    output:
        "GCF_000017325.1.fna.gz.sig",
    shell: """
        sourmash sketch dna -p k=31 {input} \
            --name-from-first
    """

rule sketch_genomes_2:
    input:
        "genomes/GCF_000020225.1.fna.gz",
    output:
        "GCF_000020225.1.fna.gz.sig",
    shell: """
        sourmash sketch dna -p k=31 {input} \
            --name-from-first
    """

rule sketch_genomes_3:
    input:
        "genomes/GCF_000021665.1.fna.gz",
    output:
        "GCF_000021665.1.fna.gz.sig"
    shell: """
        sourmash sketch dna -p k=31 {input} \
            --name-from-first
    """

rule compare_genomes:
    input:
        "GCF_000017325.1.fna.gz.sig",
        "GCF_000020225.1.fna.gz.sig",
        "GCF_000021665.1.fna.gz.sig"
    output:
        "compare.mat"
    shell: """
        sourmash compare {input} -o {output}
    """

rule plot_comparison:
    message: "compare all input genomes using sourmash"
    input:
        "compare.mat"
    output:
        "compare.mat.matrix.png"
    shell: """
        sourmash plot {input}
    """

@CTB only include certain lines

It's wordy, but it will work - run:

snakemake -j 1 --delete-all plot_comparison
snakemake -j 1 plot_comparison

Before we modify the file further, let's enjoy the fruits of our labor: we can now tell snakemake to run more than one rule at a time!

@CTB note: is there a way to ask snakemake to just rerun everything? force?

Try typing this:

snakemake -j 1 --delete-all plot_comparison
snakemake -j 3 plot_comparison

If you look closely, you should see that snakemake is running all three sourmash sketch dna commands at the same time.

This is actually pretty cool and is one of the more powerful practical features of snakemake: once you tell snakemake what you want it to do (by specifying targets) and give snakemake the set of recipes telling it how to do each step, snakemake will figure out the fastest way to run all the necessary steps with the resources you've given it.

In this case, we told snakemake that it could run up to three jobs at a time, with -j 3. We could also have told it to run more jobs at a time, but at the moment there are only three rules that can actually be run at the same time - sketch_genomes_1, sketch_genomes_2, and sketch_genomes_3. This is because the rule compare_genomes needs the output of these three rules to run, and likewise plot_genomes needs the output of compare_genomes to run. So they can't be run at the same time as any other rules!

An Introduction to Snakemake for Bioinformatics

Chapter 4 - running rules in parallel