Question

Snakemake: How to handle dynamically created files with dynamically created file names?

0

Entering edit mode

3.8 years ago

ljmesi • 0

I have a multifasta file DE.faa containing unknown number of protein sequences and having fasta headers e.g. GB012883-PA, GB009065-PA, GB007275-PA. I have split the multifasta file (using the rule gather_proteins_with_no_BL62_hits) to single fasta files with filenames: GB012883-PA.faa, GB009065-PA.faa, GB007275-PA.faa and now I wish to BLASTp each of them remotely.

As output from the BLASTp, I wish to have tsv files with matching file names GB012883-PA.tsv, GB009065-PA.tsv, GB007275-PA.tsv and search strategy files: GB012883-PA.asn, GB009065-PA.asn, GB007275-PA.asn.

Here is a simplified version of my Snakemake script:

rule gather_proteins_with_no_BL62_hits:
    input:
        all_found_prot_seqs = "DE.faa"
    output:
        directory("no_BL62_hits/"), 
        touch("fasta_expansion.done")
    script:
        "gather_proteins_with_no_BL62_hits.R"

def aggregate_input(wildcards):
    with open(checkpoints.BLAST_missing_protein_seqs.get(**wildcards).output.res, 'r') as f:
        return [seq_name.rstrip() + '.tsv' for seq_name in f]

rule aggregate:
    input:
        aggregate_input
    output:
        touch("BLAST45.done")

checkpoint BLAST_missing_protein_seqs:
    input:
        flag = "fasta_expansion.done",
        seqs = "no_BL62_hits/{seq_name}.faa"
    output:
        res = "BL45_headerless/{seq_name}.tsv",
        s_t = "BL45_headerless/{seq_name}.asn"
    shell:
        r"""
        blastp -remote \
        -query {input.seqs} \
        -evalue 0.01 \
        -db nr \
        -export_search_strategy {output.s_t} \
        -word_size 2 \
        -matrix "BLOSUM45"
        -max_target_seqs 100 \
        -outfmt '6 sseqid sacc qstart' > {output.res} 2> {log}
        """

This returns though InputFunctionException with message: WorkflowError: Missing wildcard values for seq_name.

My question is:

How do I inform Snakemake that wildcard seq_name refers to the sequence identifiers GB012883-PA, GB009065-PA, GB007275-PA (and fix the WorkflowError)?

snakemake • 3.8k views

ADD COMMENT • link updated 3.8 years ago by Jeremy Leipzig 22k • written 3.8 years ago by ljmesi • 0

score 1 · Answer 1 · 2020-07-17

1

Entering edit mode

3.8 years ago

Jeremy Leipzig 22k

Write a function to return your targets from your DE.faa. Go ahead and evaluate it up top and assign a variable to its output, use that as the target to aggregate.

You can't expect aggregate_input to somehow know what the wildcards are from another unrelated rule.

~~All your touch commands belong in script~~ oh I didn't know about the feature

ADD COMMENT • link 3.8 years ago by Jeremy Leipzig 22k

0

Entering edit mode

Thank you for your answer! How would the snakemake script look like in practice? I have managed to split the multifasta file into single fasta files using the gather_proteins_with_no_BL62_hits.R script in gather_proteins_with_no_BL62_hits rule but the challenge I've struggled with is running BLASTp on these dynamically created single fasta files.

ADD REPLY • link 3.8 years ago by ljmesi • 0

0

Entering edit mode

these don't really sound dynamically-created in the sense you have no idea what they'll be called - don't they simply correspond to the deflines in your fasta file? Then you can derive that target from a function that reads the deflines and transforms them into .tsv's (Correct Way To Parse A Fasta File In Python)

ADD REPLY • link 3.7 years ago by Jeremy Leipzig 22k

0

Entering edit mode

Thank you for taking the time to answer my question! The task I try to accomplish is the following. I have a list of protein sequences which I'm interested in. I first ran BLASTp on them with certain more stringent parameters and a narrower taxonomic group as the target and got a list of matches in tsv format. However, some of the proteins didn't get any significant matches on this run and now I wish to run another BLASTp query on them with less stringent parameters and a broader taxonomic group as the target.

My gather_proteins_with_no_BL62_hits.R script successfully finds the proteins with no hits from the tsv file retained on the first run and outputs each of them as single fastafiles to the directory no_BL62_hits/. The challenge is that I don't know in advance which of the protein sequences get significant matches on the first run so after the gather_proteins_with_no_BL62_hits rule has run I have to somehow detect which singlefasta files are there and run the second BLASTp on them only.

I also tried running a BLASTp query on a multifasta file containing the set of sequences which didn't get any significant matches on the first run but I continually encountered issues with the query. As I understood, the issue was that the query was too broad to handle in one batch. Therefore, I wish to run BLASTp on each of the sequences separately instead.

ADD REPLY • link 3.7 years ago by ljmesi • 0

0

Entering edit mode

you can try the new dynamic operator in Snakemake but it might be easier to have an intermediate target (fasta_expansion.done) and a final target that uses glob.glob('no_BL62_hits/*') to see which files exist in no_BL62_hits. Then run the targets sequentially Snakemake fasta_expansion.done && Snakemake finaltarget

ADD REPLY • link 3.7 years ago by Jeremy Leipzig 22k

0

Entering edit mode

While acknowledging that dynamic-flags will be deprecated in Snakemake 6.0, I tried the following:

# Create only multifasta file with all the seqs missing in the first BLASTp run
rule gather_proteins_with_no_BL62_hits:
    input:
        all_found_prot_seqs = "DE.fa"
    output:
        "no_BL62_hits.faa"
    script:
        "gather_proteins_with_no_BL62_hits.R"

# Use exonerate's utility function fastaexplode to split multifasta file
rule split_multifasta:
    input:
        "no_BL62_hits.faa"
    output:
        dynamic("{seq_name}.faa")
    shell:
        r"""
        fastaexplode \
        --fasta {input} \
        --directory "no_BL62_hits/"
        """

rule BLAST_missing_protein_seqs:
    input:
        seqs = "no_BL62_hits/{seq_name}.faa"
    output:
        res = "BL45_headerless/{seq_name}.tsv",
        s_t = "BL45_headerless/{seq_name}_search_strategy.asn"
    shell:
        r"""
        echo "BLAST search started"
        blastp -remote \
        -query {input.seqs} \
        -evalue 0.01 \
        -db nr \
        -export_search_strategy {output.s_t} \
        -word_size 2 \
        -matrix "BLOSUM45"
        -max_target_seqs 100 \
        -outfmt '6' > {output.res} 2> {log}
        """

But here, I'm don't know how to target the dynamically created BL45_headerless/{seq_name}.tsv and BL45_headerless/{seq_name}_search_strategy.asn files in any following rules since I don't know their names in advance.

I tried also the latter approach. Here, Snakemake seems to jump over the rule BLAST_missing_protein_seqs all together. I'm not sure whether I have misunderstood how the glob function works. Here is the second snakefile:

from glob import glob

rule all:
    input:
        "finaltarget"

rule BLAST_missing_protein_seqs:
    input:
        flag = "fasta_expansion.done",
        seqs = "no_BL62_hits/{seq_name}.faa"
    output:
        res = "BL45_headerless/{seq_name}.tsv",
        s_t = "BL45_headerless/{seq_name}_search_strategy.asn"
    shell:
        r"""
        echo "BLAST search started"
        blastp -remote \
        -query {input.seqs} \
        -evalue 0.01 \
        -db nr \
        -export_search_strategy {output.s_t} \
        -word_size 2 \
        -matrix "BLOSUM45"
        -max_target_seqs 100 \
        -outfmt '6' > {output.res} 2> {log}
        """

rule finished:
    input:
        glob("BL45_headerless/*.tsv")
    output: 
        "finaltarget"
    shell:
        '''
        touch {output}
        '''

ADD REPLY • link 3.7 years ago by ljmesi • 0

0

Entering edit mode

i mean run your rule finished should glob those fa's and generate the corresponding tsv's rule finished:

input:
   tsvlist = [os.path.basename(fa)+'.tsv' for f in glob.glob('no_BL62_hits/*')]

ADD REPLY • link 3.7 years ago by Jeremy Leipzig 22k

0

Entering edit mode

Thanks for giving a try in answering my question! I asked this same question in Stackoverflow and got an answer that solved the problem. Here is link to the answer.

ADD REPLY • link 3.7 years ago by ljmesi • 0

0

Entering edit mode

oh I was totally unaware of checkpoints and must have just ignored that - that's new. very cool.

ADD REPLY • link 3.7 years ago by Jeremy Leipzig 22k