Subsampling FASTQ files

Level: beginner+

In Using wildcards to generalize your rules, we introduced the use of wildcards to generate

rule all:
    input:
        "big.subset100.fastq"

rule subset:
    input:
        "big.fastq"
    output:
        "big.subset{num_lines}.fastq"
    shell: """
        head -{wildcards.num_lines} {input} > {output}
    """

Ref:

  • wildcards

Subsampling records rather than lines

Here, one potential problem is that we are producing subset files based on the number of lines, not the number of records - typically, in FASTQ files, four lines make a record. Ideally, the subset FASTQ file produced by the recipe above would have the number of records in its filename, rather than the number of lines! However, this requires multiplying the number of records by 4!

You can do this using params: functions, which let you introduce Python functions into your rules.