Hi,
Is there a way to peek at blast temporary file? I would like to know what sequence is analyzed in order to foresee how long the program will take to finish. top command shows that it uses around 4 gigs of memory, but I can't get hold of the file in the ram folders (/dev/shm). I could kill the job, but the temporary file would be erased. Anybody would know?
My blast inputs are rather basic :
blastn -query sequences.fasta -db nt.db -out output.xml -evalue 1e-4 -outfmt 5 -num_threads 14 -gapopen 5 -gapextend 2 -word_size 7
I have 2000 sequences inside the fasta file.
nt database file is enromous, so I'm expecting blastn to hang long...
EDIT: Although the file is working consuming more and more ram as the program is running, ls -l on my output folder gives:
-rw-rw-r-- 1 sara sara 0 Aug 16 15:57 output.xml
so I can't use grep
or sed
or wc -l
... I have no idea where the information is put! That's why I was looking into the /dev/shm
folder...
MORE EDITING: Although the file is empty (no octet in it, as shows the previous command), all the output information is store into the RAM (RAM's usage for the blastn is growing everyday) . When the job is finished, Blast transfers the content of the RAM onto the disk, i.e. the output.xml. At least, that's what I figure. My question is how to peek into the blast temporary output file that is on the RAM...
Hi,
Inputs on what program you're trying to run, an approximate input file size as well as the exact commands you're using to profile the BLAST run would help you get an answer best suited to your situation.
Something to remember is that BLAST uses many heuristics to decide when to stop searching sequences. There is a relationship between the entropy of the query sequences and the runtime of blast. Additionally, the degree of homology for each query will impact run time. The more seeds blast is able to find per hit, the longer it may take to complete the query search. In general, the length of the sequence can predict runtime, but there are many other factors that impact the runtime.
2000 queries is a small job, you'll probably spend more time figuring out how to properly estimate the runtime than it will take for blast to finish.
If there's some reason you are concerned about the amount of time it will take (maybe you're paying for a cluster), you should look into reducing the size of the database being searched. For example, if you're working with bacterial sequences, the nt database will contain tons of eukaryotic and viral sequence data you don't need to search.
Thanks, I could look into that to lower the time running! Do you know of a quick way to remove, let's say... all eukaryotes?
It might take a bit of work but the idea is to get a list of GIs matching the taxonomy you want, and then extracting the sequences with those GIs from the current db and making a new blast db.
It is clear that BLAST isn't writing out the XML file as it runs, so I don't think you have much of an option. Don't forget that BLAST isn't holding the full XML file in memory waiting till it is time to write to the disk. It is only storing the values needed to generate the XML result file after the run is completed.
It make sense that it doesn't write as it goes, an XML file for results from multiple queries contains information about the whole run. Plus I'm not sure it is safe to assume that BLAST runs the whole process from start to finish for each query individually, again it's a highly optimized program so it might be doing things in some other order.
Not to be rude, but I think trying to figure out a way to watch this is a waste of time. Why spend so much time figuring this out, when it is clear that it is taking a while? Focus on trying to cut down the size of the database and seeing if you can run any of these jobs in parallel. Why spend two days to figure out it will take ten days when you can spend two days to save a few days every time you run this? Fine tuning optimizations require that you really understand runtime, but when its clearly taking too long you don't have to worry as much.
You're right, it's just that it was running for a long time, and I didn't want to waste all that was already been done. I just didn't know if it would take another day or 2 or 2 months! But thanks, we all pretty much answered my question.