I have a multifasta file DE.faa
containing unknown number of protein sequences and having fasta headers e.g. GB012883-PA
, GB009065-PA
, GB007275-PA
. I have split the multifasta file (using the rule gather_proteins_with_no_BL62_hits
) to single fasta files with filenames: GB012883-PA.faa
, GB009065-PA.faa
, GB007275-PA.faa
and now I wish to BLASTp each of them remotely.
As output from the BLASTp, I wish to have tsv files with matching file names GB012883-PA.tsv
, GB009065-PA.tsv
, GB007275-PA.tsv
and search strategy files: GB012883-PA.asn
, GB009065-PA.asn
, GB007275-PA.asn
.
Here is a simplified version of my Snakemake script:
rule gather_proteins_with_no_BL62_hits:
input:
all_found_prot_seqs = "DE.faa"
output:
directory("no_BL62_hits/"),
touch("fasta_expansion.done")
script:
"gather_proteins_with_no_BL62_hits.R"
def aggregate_input(wildcards):
with open(checkpoints.BLAST_missing_protein_seqs.get(**wildcards).output.res, 'r') as f:
return [seq_name.rstrip() + '.tsv' for seq_name in f]
rule aggregate:
input:
aggregate_input
output:
touch("BLAST45.done")
checkpoint BLAST_missing_protein_seqs:
input:
flag = "fasta_expansion.done",
seqs = "no_BL62_hits/{seq_name}.faa"
output:
res = "BL45_headerless/{seq_name}.tsv",
s_t = "BL45_headerless/{seq_name}.asn"
shell:
r"""
blastp -remote \
-query {input.seqs} \
-evalue 0.01 \
-db nr \
-export_search_strategy {output.s_t} \
-word_size 2 \
-matrix "BLOSUM45"
-max_target_seqs 100 \
-outfmt '6 sseqid sacc qstart' > {output.res} 2> {log}
"""
This returns though InputFunctionException
with message: WorkflowError: Missing wildcard values for seq_name
.
My question is:
How do I inform Snakemake that wildcard seq_name
refers to the sequence identifiers GB012883-PA
, GB009065-PA
, GB007275-PA
(and fix the WorkflowError)?
Thank you for your answer! How would the snakemake script look like in practice? I have managed to split the multifasta file into single fasta files using the
gather_proteins_with_no_BL62_hits.R
script ingather_proteins_with_no_BL62_hits
rule but the challenge I've struggled with is running BLASTp on these dynamically created single fasta files.these don't really sound dynamically-created in the sense you have no idea what they'll be called - don't they simply correspond to the deflines in your fasta file? Then you can derive that target from a function that reads the deflines and transforms them into .tsv's (Correct Way To Parse A Fasta File In Python)
Thank you for taking the time to answer my question! The task I try to accomplish is the following. I have a list of protein sequences which I'm interested in. I first ran BLASTp on them with certain more stringent parameters and a narrower taxonomic group as the target and got a list of matches in tsv format. However, some of the proteins didn't get any significant matches on this run and now I wish to run another BLASTp query on them with less stringent parameters and a broader taxonomic group as the target.
My
gather_proteins_with_no_BL62_hits.R
script successfully finds the proteins with no hits from the tsv file retained on the first run and outputs each of them as single fastafiles to the directoryno_BL62_hits/
. The challenge is that I don't know in advance which of the protein sequences get significant matches on the first run so after thegather_proteins_with_no_BL62_hits
rule has run I have to somehow detect which singlefasta files are there and run the second BLASTp on them only.I also tried running a BLASTp query on a multifasta file containing the set of sequences which didn't get any significant matches on the first run but I continually encountered issues with the query. As I understood, the issue was that the query was too broad to handle in one batch. Therefore, I wish to run BLASTp on each of the sequences separately instead.
you can try the new
dynamic
operator in Snakemake but it might be easier to have an intermediate target (fasta_expansion.done) and a final target that usesglob.glob('no_BL62_hits/*')
to see which files exist in no_BL62_hits. Then run the targets sequentiallySnakemake fasta_expansion.done && Snakemake finaltarget
While acknowledging that dynamic-flags will be deprecated in Snakemake 6.0, I tried the following:
But here, I'm don't know how to target the dynamically created
BL45_headerless/{seq_name}.tsv
andBL45_headerless/{seq_name}_search_strategy.asn
files in any following rules since I don't know their names in advance.I tried also the latter approach. Here, Snakemake seems to jump over the rule
BLAST_missing_protein_seqs
all together. I'm not sure whether I have misunderstood how theglob
function works. Here is the second snakefile:i mean run your rule finished should glob those fa's and generate the corresponding tsv's rule finished:
Thanks for giving a try in answering my question! I asked this same question in Stackoverflow and got an answer that solved the problem. Here is link to the answer.
oh I was totally unaware of checkpoints and must have just ignored that - that's new. very cool.