Snakemake: How to handle dynamically created files with dynamically created file names?
1
0
Entering edit mode
10 months ago
ljmesi • 0

I have a multifasta file DE.faa containing unknown number of protein sequences and having fasta headers e.g. GB012883-PA, GB009065-PA, GB007275-PA. I have split the multifasta file (using the rule gather_proteins_with_no_BL62_hits) to single fasta files with filenames: GB012883-PA.faa, GB009065-PA.faa, GB007275-PA.faa and now I wish to BLASTp each of them remotely.

As output from the BLASTp, I wish to have tsv files with matching file names GB012883-PA.tsv, GB009065-PA.tsv, GB007275-PA.tsv and search strategy files: GB012883-PA.asn, GB009065-PA.asn, GB007275-PA.asn.

Here is a simplified version of my Snakemake script:

rule gather_proteins_with_no_BL62_hits:
input:
all_found_prot_seqs = "DE.faa"
output:
directory("no_BL62_hits/"),
touch("fasta_expansion.done")
script:
"gather_proteins_with_no_BL62_hits.R"

def aggregate_input(wildcards):
with open(checkpoints.BLAST_missing_protein_seqs.get(**wildcards).output.res, 'r') as f:
return [seq_name.rstrip() + '.tsv' for seq_name in f]

rule aggregate:
input:
aggregate_input
output:
touch("BLAST45.done")

checkpoint BLAST_missing_protein_seqs:
input:
flag = "fasta_expansion.done",
seqs = "no_BL62_hits/{seq_name}.faa"
output:
shell:
r"""
blastp -remote \
-query {input.seqs} \
-evalue 0.01 \
-db nr \
-export_search_strategy {output.s_t} \
-word_size 2 \
-matrix "BLOSUM45"
-max_target_seqs 100 \
-outfmt '6 sseqid sacc qstart' > {output.res} 2> {log}
"""


This returns though InputFunctionException with message: WorkflowError: Missing wildcard values for seq_name.

My question is:

How do I inform Snakemake that wildcard seq_name refers to the sequence identifiers GB012883-PA, GB009065-PA, GB007275-PA (and fix the WorkflowError)?

snakemake • 1.0k views
0
Entering edit mode
10 months ago

Write a function to return your targets from your DE.faa. Go ahead and evaluate it up top and assign a variable to its output, use that as the target to aggregate.

You can't expect aggregate_input to somehow know what the wildcards are from another unrelated rule.

All your touch commands belong in script oh I didn't know about the feature

0
Entering edit mode

Thank you for your answer! How would the snakemake script look like in practice? I have managed to split the multifasta file into single fasta files using the gather_proteins_with_no_BL62_hits.R script in gather_proteins_with_no_BL62_hits rule but the challenge I've struggled with is running BLASTp on these dynamically created single fasta files.

0
Entering edit mode

these don't really sound dynamically-created in the sense you have no idea what they'll be called - don't they simply correspond to the deflines in your fasta file? Then you can derive that target from a function that reads the deflines and transforms them into .tsv's (Correct Way To Parse A Fasta File In Python)

0
Entering edit mode

Thank you for taking the time to answer my question! The task I try to accomplish is the following. I have a list of protein sequences which I'm interested in. I first ran BLASTp on them with certain more stringent parameters and a narrower taxonomic group as the target and got a list of matches in tsv format. However, some of the proteins didn't get any significant matches on this run and now I wish to run another BLASTp query on them with less stringent parameters and a broader taxonomic group as the target.

My gather_proteins_with_no_BL62_hits.R script successfully finds the proteins with no hits from the tsv file retained on the first run and outputs each of them as single fastafiles to the directory no_BL62_hits/. The challenge is that I don't know in advance which of the protein sequences get significant matches on the first run so after the gather_proteins_with_no_BL62_hits rule has run I have to somehow detect which singlefasta files are there and run the second BLASTp on them only.

I also tried running a BLASTp query on a multifasta file containing the set of sequences which didn't get any significant matches on the first run but I continually encountered issues with the query. As I understood, the issue was that the query was too broad to handle in one batch. Therefore, I wish to run BLASTp on each of the sequences separately instead.

0
Entering edit mode

you can try the new dynamic operator in Snakemake but it might be easier to have an intermediate target (fasta_expansion.done) and a final target that uses glob.glob('no_BL62_hits/*') to see which files exist in no_BL62_hits. Then run the targets sequentially Snakemake fasta_expansion.done && Snakemake finaltarget

0
Entering edit mode

While acknowledging that dynamic-flags will be deprecated in Snakemake 6.0, I tried the following:

# Create only multifasta file with all the seqs missing in the first BLASTp run
rule gather_proteins_with_no_BL62_hits:
input:
all_found_prot_seqs = "DE.fa"
output:
"no_BL62_hits.faa"
script:
"gather_proteins_with_no_BL62_hits.R"

# Use exonerate's utility function fastaexplode to split multifasta file
rule split_multifasta:
input:
"no_BL62_hits.faa"
output:
dynamic("{seq_name}.faa")
shell:
r"""
fastaexplode \
--fasta {input} \
--directory "no_BL62_hits/"
"""

rule BLAST_missing_protein_seqs:
input:
seqs = "no_BL62_hits/{seq_name}.faa"
output:
shell:
r"""
echo "BLAST search started"
blastp -remote \
-query {input.seqs} \
-evalue 0.01 \
-db nr \
-export_search_strategy {output.s_t} \
-word_size 2 \
-matrix "BLOSUM45"
-max_target_seqs 100 \
-outfmt '6' > {output.res} 2> {log}
"""


But here, I'm don't know how to target the dynamically created BL45_headerless/{seq_name}.tsv and BL45_headerless/{seq_name}_search_strategy.asn files in any following rules since I don't know their names in advance.

I tried also the latter approach. Here, Snakemake seems to jump over the rule BLAST_missing_protein_seqs all together. I'm not sure whether I have misunderstood how the glob function works. Here is the second snakefile:

from glob import glob

rule all:
input:
"finaltarget"

rule BLAST_missing_protein_seqs:
input:
flag = "fasta_expansion.done",
seqs = "no_BL62_hits/{seq_name}.faa"
output:
shell:
r"""
echo "BLAST search started"
blastp -remote \
-query {input.seqs} \
-evalue 0.01 \
-db nr \
-export_search_strategy {output.s_t} \
-word_size 2 \
-matrix "BLOSUM45"
-max_target_seqs 100 \
-outfmt '6' > {output.res} 2> {log}
"""

rule finished:
input:
output:
"finaltarget"
shell:
'''
touch {output}
'''

0
Entering edit mode

i mean run your rule finished should glob those fa's and generate the corresponding tsv's rule finished:

input:
tsvlist = [os.path.basename(fa)+'.tsv' for f in glob.glob('no_BL62_hits/*')]

0
Entering edit mode

Thanks for giving a try in answering my question! I asked this same question in Stackoverflow and got an answer that solved the problem. Here is link to the answer.

0
Entering edit mode

oh I was totally unaware of checkpoints and must have just ignored that - that's new. very cool.