I have a multifasta file DE.faa containing unknown number of protein sequences and having fasta headers e.g. GB012883-PA, GB009065-PA, GB007275-PA. I have split the multifasta file (using the rule gather_proteins_with_no_BL62_hits) to single fasta files with filenames: GB012883-PA.faa, GB009065-PA.faa, GB007275-PA.faa and now I wish to BLASTp each of them remotely.
As output from the BLASTp, I wish to have tsv files with matching file names GB012883-PA.tsv, GB009065-PA.tsv, GB007275-PA.tsv and search strategy files: GB012883-PA.asn, GB009065-PA.asn, GB007275-PA.asn.
Here is a simplified version of my Snakemake script:
rule gather_proteins_with_no_BL62_hits:
input:
all_found_prot_seqs = "DE.faa"
output:
directory("no_BL62_hits/"),
touch("fasta_expansion.done")
script:
"gather_proteins_with_no_BL62_hits.R"
def aggregate_input(wildcards):
with open(checkpoints.BLAST_missing_protein_seqs.get(**wildcards).output.res, 'r') as f:
return [seq_name.rstrip() + '.tsv' for seq_name in f]
rule aggregate:
input:
aggregate_input
output:
touch("BLAST45.done")
checkpoint BLAST_missing_protein_seqs:
input:
flag = "fasta_expansion.done",
seqs = "no_BL62_hits/{seq_name}.faa"
output:
res = "BL45_headerless/{seq_name}.tsv",
s_t = "BL45_headerless/{seq_name}.asn"
shell:
r"""
blastp -remote \
-query {input.seqs} \
-evalue 0.01 \
-db nr \
-export_search_strategy {output.s_t} \
-word_size 2 \
-matrix "BLOSUM45"
-max_target_seqs 100 \
-outfmt '6 sseqid sacc qstart' > {output.res} 2> {log}
"""
This returns though InputFunctionException with message: WorkflowError: Missing wildcard values for seq_name.
My question is:
How do I inform Snakemake that wildcard seq_name refers to the sequence identifiers GB012883-PA, GB009065-PA, GB007275-PA (and fix the WorkflowError)?
Thank you for your answer! How would the snakemake script look like in practice? I have managed to split the multifasta file into single fasta files using the
gather_proteins_with_no_BL62_hits.Rscript ingather_proteins_with_no_BL62_hitsrule but the challenge I've struggled with is running BLASTp on these dynamically created single fasta files.these don't really sound dynamically-created in the sense you have no idea what they'll be called - don't they simply correspond to the deflines in your fasta file? Then you can derive that target from a function that reads the deflines and transforms them into .tsv's (Correct Way To Parse A Fasta File In Python)
Thank you for taking the time to answer my question! The task I try to accomplish is the following. I have a list of protein sequences which I'm interested in. I first ran BLASTp on them with certain more stringent parameters and a narrower taxonomic group as the target and got a list of matches in tsv format. However, some of the proteins didn't get any significant matches on this run and now I wish to run another BLASTp query on them with less stringent parameters and a broader taxonomic group as the target.
My
gather_proteins_with_no_BL62_hits.Rscript successfully finds the proteins with no hits from the tsv file retained on the first run and outputs each of them as single fastafiles to the directoryno_BL62_hits/. The challenge is that I don't know in advance which of the protein sequences get significant matches on the first run so after thegather_proteins_with_no_BL62_hitsrule has run I have to somehow detect which singlefasta files are there and run the second BLASTp on them only.I also tried running a BLASTp query on a multifasta file containing the set of sequences which didn't get any significant matches on the first run but I continually encountered issues with the query. As I understood, the issue was that the query was too broad to handle in one batch. Therefore, I wish to run BLASTp on each of the sequences separately instead.
you can try the new
dynamicoperator in Snakemake but it might be easier to have an intermediate target (fasta_expansion.done) and a final target that usesglob.glob('no_BL62_hits/*')to see which files exist in no_BL62_hits. Then run the targets sequentiallySnakemake fasta_expansion.done && Snakemake finaltargetWhile acknowledging that dynamic-flags will be deprecated in Snakemake 6.0, I tried the following:
But here, I'm don't know how to target the dynamically created
BL45_headerless/{seq_name}.tsvandBL45_headerless/{seq_name}_search_strategy.asnfiles in any following rules since I don't know their names in advance.I tried also the latter approach. Here, Snakemake seems to jump over the rule
BLAST_missing_protein_seqsall together. I'm not sure whether I have misunderstood how theglobfunction works. Here is the second snakefile:i mean run your rule finished should glob those fa's and generate the corresponding tsv's rule finished:
Thanks for giving a try in answering my question! I asked this same question in Stackoverflow and got an answer that solved the problem. Here is link to the answer.
oh I was totally unaware of checkpoints and must have just ignored that - that's new. very cool.