NCBI BLAST+ via Biopython fails if I use a long name for the outfile?
1
0
Entering edit mode
6.7 years ago
rmartson • 0

Here's the relevant part of my code:

query = raw_input("Enter query (fasta)\n")
database2 = raw_input("Enter organism 2 and genome name (fasta) in the format 'organism, genome'\n")
organism2 = database2[0:database1.find(",")]
genomehandle2 = database2[database1.find(",")+2:]

blast2 = NcbiblastnCommandline(query= str(query) + ".fasta", db=str(genomehandle2)+".fa", evalue=1e-10, outfmt=7, out="blast" + str(query) + str(organism2) + ".txt", num_threads=4)
stdout, stderr = blast2()

This works normally for any organism which has a name with 5 characters or fewer. However it's not that there's a limit on the total output handle length, because if I remove the "+str(query)+" it STILL tries to cut the organism name down to 5 characters, and if I keep the query which is way larger than 5 characters in there this still works with any organism name given that it is 5 characters or smaller.

"Enter organism 2 and genome name (fasta) in the format 'organism, genome'
macaq, macfas5" -> This works and starts BLAST

"Enter organism 2 and genome name (fasta) in the format 'organism, genome'
macaque, macfas5" -> This fails and returns the error:

 

"Traceback (most recent call last):
  File "pipeline.py", line 30, in <module>
    stdout, stderr = blast2()
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/Application/__init__.py", line 523, in __call__
    stdout_str, stderr_str)
Bio.Application.ApplicationError: Non-zero return code 1 from 'blastn -out blastmacaq.txt -outfmt 7 -query HERV-K113.fasta -db , macfas5.fa -evalue 1e-10 -num_threads 4', message 'USAGE'"
biopython blast • 2.0k views
ADD COMMENT
1
Entering edit mode
6.7 years ago

Stop using raw_input in programs altoghether. It leads to tedious programs that stop and wait for input. Make your program take parameters from the command line.

my_program.py  param1  param2   param3

As for the current program the problem here is that you are adding a comma into the command line. See:

-db , macfas5.fa

Since you are using raw_input we don't know what you typed in there hence we cannot troubleshoot it. Suffice to say you have typed in things with commas in them. You can see why using raw_input is a bad habit to have. People can't help you troubleshoot things.

Once you rewrite it with external parameters post them both, the command line and the relevant sections of the code.

ADD COMMENT
0
Entering edit mode

Stop using raw_input in programs altoghether. It leads to tedious programs that stop and wait for input. Make your program take parameters from the command line.

I give all the input at the start and it never asks again. I don't get what your point is here.

Actually I've given my input in the post but someone went and modified my post so it's given in code format for whatever reason.

"macaq, macfas5" works as input

"macaque, macfas5" does not work

Thanks for identifying the issue. I'm gonna try to fix that bit now.

ADD REPLY
0
Entering edit mode

The issue was this specific part in my script:

organism2 = database2[0:database1.find(",")]
genomehandle2 = database2[database1.find(",")+2:]

The script begins by BLASTing two different databases. So far my first database has always been for the human genome. "human" happens to begin with 5 characters. When I find "," in "human, hg38" I get a value that is smaller than the position of the comma in "macaque, macfas5". Instead of taking "macfas5", it was beginning 2 characters further back than it should and taking ", macfas5" because it was looking at the length of "human" instead of the length of "macaque".

I still don't agree with your comments on raw_input() though. I might be new to this and it was the first way I found out how to assign terminal input to variables, but I give all the parameters at the start of the script and then it carries out the whole thing for half an hour without any further questions. The fact that I specifically used raw_input() isn't the source of the issue. Someone just reformatted my post for no reason and you somehow skipped over the part you wanted for troubleshooting. If I used parameters in the commandline I would have entered them into the terminal. When I use raw_input(), I still enter everything into the terminal. If I copy what I see in the terminal and paste it here, you get the same information regardless of what I used except when I use parameters you have no idea what they're referring to unless I tell you.

ADD REPLY
0
Entering edit mode

Ther problem with using raw_input is that it does not preserve the input that you put in. So it may not cause the problem but it hides the data from you.

Nobody else knows what you have typed in there - you may think you know but it is easy to mistype it.

It is much simpler to list it at the command line that way the whole thing is explicit. Isn't it a lot easier to rerun something as

program foo bar this that the other

then use the arrow key to recall the previous command rather than having to retype.

ADD REPLY
0
Entering edit mode

What do you mean it doesn't preserve it? It's visible in the terminal long after it's entered and it's saved into the log my script generates. I don't see how it's hidden. Here's a screenshot of me entering the first few parameters into the terminal using raw_input():

enter image description here

The benefit of running it the way you describe is that I can enter everything in one line which is just one step faster than if I put all my variables into a single raw_input() question and split the input by " ", but the whole point of having it separated out is so that it makes sense and acts more like a menu rather than a command where the parameters and their order need to be memorised. This script requires 7 parameters right now. Is it really that user-friendly to have them all required when you're calling the script? And you say "retype", but the entire thing takes 20-30min to run, I'm not exactly repeating it over and over and I only need to enter these parameters once at the start.

And the reason I'm probably defensive about this is that I don't really understand how I can get Python to recognise the parameters I enter along with the script filename. If I ran pipeline.py param1 param2 param3 in the terminal, what function would I use to retrieve those parameters in Python? I think you're assuming the parameters are only used by the commandline BLAST search. This script does more than just run two local BLAST searches, it also has to retrieve flanks from each hit, reBLAST the flanks, align two sets of flanks against each other and some other stuff. The parameters need to be used in Python. The reason I did not give this context or the entire script is because I thought it wasn't necessary to fix the issue specific to the initial BLAST searches, and it really wasn't.

ADD REPLY
0
Entering edit mode

the command line parameters will be stored in the module attribute sys.argv as a list

import sys
print (sys.argv)

then

python demo.py foo bar this and that

will print

['demo.py', 'foo', 'bar', 'this', 'and', 'that']
ADD REPLY
0
Entering edit mode

Thanks, that's kinda nice. I'll rewrite it for the next time I run the script.

ADD REPLY
0
Entering edit mode

The next step is to use the argparse library that will allow you to write nice programs that take named parameters like so:

demo.py --organism human  --genome hg38  --LTR_cutoff 95  --flanking_cutoff 95

and so on. At that point your program's usage becomes self-documenting.

https://docs.python.org/3/library/argparse.html

ADD REPLY

Login before adding your answer.

Traffic: 3088 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6