Snakemake: How to handle dynamically created files with dynamically created file names?
1
0
Entering edit mode
3.8 years ago
ljmesi • 0

I have a multifasta file DE.faa containing unknown number of protein sequences and having fasta headers e.g. GB012883-PA, GB009065-PA, GB007275-PA. I have split the multifasta file (using the rule gather_proteins_with_no_BL62_hits) to single fasta files with filenames: GB012883-PA.faa, GB009065-PA.faa, GB007275-PA.faa and now I wish to BLASTp each of them remotely.

As output from the BLASTp, I wish to have tsv files with matching file names GB012883-PA.tsv, GB009065-PA.tsv, GB007275-PA.tsv and search strategy files: GB012883-PA.asn, GB009065-PA.asn, GB007275-PA.asn.

Here is a simplified version of my Snakemake script:

rule gather_proteins_with_no_BL62_hits:
    input:
        all_found_prot_seqs = "DE.faa"
    output:
        directory("no_BL62_hits/"), 
        touch("fasta_expansion.done")
    script:
        "gather_proteins_with_no_BL62_hits.R"

def aggregate_input(wildcards):
    with open(checkpoints.BLAST_missing_protein_seqs.get(**wildcards).output.res, 'r') as f:
        return [seq_name.rstrip() + '.tsv' for seq_name in f]

rule aggregate:
    input:
        aggregate_input
    output:
        touch("BLAST45.done")

checkpoint BLAST_missing_protein_seqs:
    input:
        flag = "fasta_expansion.done",
        seqs = "no_BL62_hits/{seq_name}.faa"
    output:
        res = "BL45_headerless/{seq_name}.tsv",
        s_t = "BL45_headerless/{seq_name}.asn"
    shell:
        r"""
        blastp -remote \
        -query {input.seqs} \
        -evalue 0.01 \
        -db nr \
        -export_search_strategy {output.s_t} \
        -word_size 2 \
        -matrix "BLOSUM45"
        -max_target_seqs 100 \
        -outfmt '6 sseqid sacc qstart' > {output.res} 2> {log}
        """

This returns though InputFunctionException with message: WorkflowError: Missing wildcard values for seq_name.

My question is:

How do I inform Snakemake that wildcard seq_name refers to the sequence identifiers GB012883-PA, GB009065-PA, GB007275-PA (and fix the WorkflowError)?

snakemake • 3.8k views
ADD COMMENT
1
Entering edit mode
3.8 years ago

Write a function to return your targets from your DE.faa. Go ahead and evaluate it up top and assign a variable to its output, use that as the target to aggregate.

You can't expect aggregate_input to somehow know what the wildcards are from another unrelated rule.

All your touch commands belong in script oh I didn't know about the feature

ADD COMMENT
0
Entering edit mode

Thank you for your answer! How would the snakemake script look like in practice? I have managed to split the multifasta file into single fasta files using the gather_proteins_with_no_BL62_hits.R script in gather_proteins_with_no_BL62_hits rule but the challenge I've struggled with is running BLASTp on these dynamically created single fasta files.

ADD REPLY
0
Entering edit mode

these don't really sound dynamically-created in the sense you have no idea what they'll be called - don't they simply correspond to the deflines in your fasta file? Then you can derive that target from a function that reads the deflines and transforms them into .tsv's (Correct Way To Parse A Fasta File In Python)

ADD REPLY
0
Entering edit mode

Thank you for taking the time to answer my question! The task I try to accomplish is the following. I have a list of protein sequences which I'm interested in. I first ran BLASTp on them with certain more stringent parameters and a narrower taxonomic group as the target and got a list of matches in tsv format. However, some of the proteins didn't get any significant matches on this run and now I wish to run another BLASTp query on them with less stringent parameters and a broader taxonomic group as the target.

My gather_proteins_with_no_BL62_hits.R script successfully finds the proteins with no hits from the tsv file retained on the first run and outputs each of them as single fastafiles to the directory no_BL62_hits/. The challenge is that I don't know in advance which of the protein sequences get significant matches on the first run so after the gather_proteins_with_no_BL62_hits rule has run I have to somehow detect which singlefasta files are there and run the second BLASTp on them only.

I also tried running a BLASTp query on a multifasta file containing the set of sequences which didn't get any significant matches on the first run but I continually encountered issues with the query. As I understood, the issue was that the query was too broad to handle in one batch. Therefore, I wish to run BLASTp on each of the sequences separately instead.

ADD REPLY
0
Entering edit mode

you can try the new dynamic operator in Snakemake but it might be easier to have an intermediate target (fasta_expansion.done) and a final target that uses glob.glob('no_BL62_hits/*') to see which files exist in no_BL62_hits. Then run the targets sequentially Snakemake fasta_expansion.done && Snakemake finaltarget

ADD REPLY
0
Entering edit mode

While acknowledging that dynamic-flags will be deprecated in Snakemake 6.0, I tried the following:

# Create only multifasta file with all the seqs missing in the first BLASTp run
rule gather_proteins_with_no_BL62_hits:
    input:
        all_found_prot_seqs = "DE.fa"
    output:
        "no_BL62_hits.faa"
    script:
        "gather_proteins_with_no_BL62_hits.R"

# Use exonerate's utility function fastaexplode to split multifasta file
rule split_multifasta:
    input:
        "no_BL62_hits.faa"
    output:
        dynamic("{seq_name}.faa")
    shell:
        r"""
        fastaexplode \
        --fasta {input} \
        --directory "no_BL62_hits/"
        """

rule BLAST_missing_protein_seqs:
    input:
        seqs = "no_BL62_hits/{seq_name}.faa"
    output:
        res = "BL45_headerless/{seq_name}.tsv",
        s_t = "BL45_headerless/{seq_name}_search_strategy.asn"
    shell:
        r"""
        echo "BLAST search started"
        blastp -remote \
        -query {input.seqs} \
        -evalue 0.01 \
        -db nr \
        -export_search_strategy {output.s_t} \
        -word_size 2 \
        -matrix "BLOSUM45"
        -max_target_seqs 100 \
        -outfmt '6' > {output.res} 2> {log}
        """

But here, I'm don't know how to target the dynamically created BL45_headerless/{seq_name}.tsv and BL45_headerless/{seq_name}_search_strategy.asn files in any following rules since I don't know their names in advance.

I tried also the latter approach. Here, Snakemake seems to jump over the rule BLAST_missing_protein_seqs all together. I'm not sure whether I have misunderstood how the glob function works. Here is the second snakefile:

from glob import glob

rule all:
    input:
        "finaltarget"

rule BLAST_missing_protein_seqs:
    input:
        flag = "fasta_expansion.done",
        seqs = "no_BL62_hits/{seq_name}.faa"
    output:
        res = "BL45_headerless/{seq_name}.tsv",
        s_t = "BL45_headerless/{seq_name}_search_strategy.asn"
    shell:
        r"""
        echo "BLAST search started"
        blastp -remote \
        -query {input.seqs} \
        -evalue 0.01 \
        -db nr \
        -export_search_strategy {output.s_t} \
        -word_size 2 \
        -matrix "BLOSUM45"
        -max_target_seqs 100 \
        -outfmt '6' > {output.res} 2> {log}
        """

rule finished:
    input:
        glob("BL45_headerless/*.tsv")
    output: 
        "finaltarget"
    shell:
        '''
        touch {output}
        '''
ADD REPLY
0
Entering edit mode

i mean run your rule finished should glob those fa's and generate the corresponding tsv's rule finished:

input:
   tsvlist = [os.path.basename(fa)+'.tsv' for f in glob.glob('no_BL62_hits/*')]
ADD REPLY
0
Entering edit mode

Thanks for giving a try in answering my question! I asked this same question in Stackoverflow and got an answer that solved the problem. Here is link to the answer.

ADD REPLY
0
Entering edit mode

oh I was totally unaware of checkpoints and must have just ignored that - that's new. very cool.

ADD REPLY

Login before adding your answer.

Traffic: 1861 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6