Preview temporary outputfile blast
2
0
Entering edit mode
9.7 years ago
sara • 0

Hi,

Is there a way to peek at blast temporary file? I would like to know what sequence is analyzed in order to foresee how long the program will take to finish. top command shows that it uses around 4 gigs of memory, but I can't get hold of the file in the ram folders (/dev/shm). I could kill the job, but the temporary file would be erased. Anybody would know?

My blast inputs are rather basic :

blastn -query sequences.fasta -db nt.db -out output.xml -evalue 1e-4 -outfmt 5 -num_threads 14 -gapopen 5 -gapextend 2 -word_size 7

I have 2000 sequences inside the fasta file.

nt database file is enromous, so I'm expecting blastn to hang long...

EDIT: Although the file is working consuming more and more ram as the program is running, ls -l on my output folder gives:

-rw-rw-r-- 1 sara sara    0 Aug 16 15:57 output.xml

so I can't use grep or sed or wc -l... I have no idea where the information is put! That's why I was looking into the /dev/shm folder...

MORE EDITING: Although the file is empty (no octet in it, as shows the previous command), all the output information is store into the RAM (RAM's usage for the blastn is growing everyday) . When the job is finished, Blast transfers the content of the RAM onto the disk, i.e. the output.xml. At least, that's what I figure. My question is how to peek into the blast temporary output file that is on the RAM...

ram blast • 2.8k views
ADD COMMENT
0
Entering edit mode

Hi,

Inputs on what program you're trying to run, an approximate input file size as well as the exact commands you're using to profile the BLAST run would help you get an answer best suited to your situation.

ADD REPLY
2
Entering edit mode

Something to remember is that BLAST uses many heuristics to decide when to stop searching sequences. There is a relationship between the entropy of the query sequences and the runtime of blast. Additionally, the degree of homology for each query will impact run time. The more seeds blast is able to find per hit, the longer it may take to complete the query search. In general, the length of the sequence can predict runtime, but there are many other factors that impact the runtime.

2000 queries is a small job, you'll probably spend more time figuring out how to properly estimate the runtime than it will take for blast to finish.

If there's some reason you are concerned about the amount of time it will take (maybe you're paying for a cluster), you should look into reducing the size of the database being searched. For example, if you're working with bacterial sequences, the nt database will contain tons of eukaryotic and viral sequence data you don't need to search.

ADD REPLY
0
Entering edit mode

Thanks, I could look into that to lower the time running! Do you know of a quick way to remove, let's say... all eukaryotes?

ADD REPLY
1
Entering edit mode

It might take a bit of work but the idea is to get a list of GIs matching the taxonomy you want, and then extracting the sequences with those GIs from the current db and making a new blast db.

ADD REPLY
0
Entering edit mode

It is clear that BLAST isn't writing out the XML file as it runs, so I don't think you have much of an option. Don't forget that BLAST isn't holding the full XML file in memory waiting till it is time to write to the disk. It is only storing the values needed to generate the XML result file after the run is completed.

It make sense that it doesn't write as it goes, an XML file for results from multiple queries contains information about the whole run. Plus I'm not sure it is safe to assume that BLAST runs the whole process from start to finish for each query individually, again it's a highly optimized program so it might be doing things in some other order.

Not to be rude, but I think trying to figure out a way to watch this is a waste of time. Why spend so much time figuring this out, when it is clear that it is taking a while? Focus on trying to cut down the size of the database and seeing if you can run any of these jobs in parallel. Why spend two days to figure out it will take ten days when you can spend two days to save a few days every time you run this? Fine tuning optimizations require that you really understand runtime, but when its clearly taking too long you don't have to worry as much.

ADD REPLY
0
Entering edit mode

You're right, it's just that it was running for a long time, and I didn't want to waste all that was already been done. I just didn't know if it would take another day or 2 or 2 months! But thanks, we all pretty much answered my question.

ADD REPLY
0
Entering edit mode
9.7 years ago
5heikki 11k

With tabular blast output it's as easy as:

cut -f1 outputFile | sort -u | wc -l

I don't how it's with the xml output, probably:

grep something outputFile | sort -u | wc -l

p.s. Shouldn't the temp file be in the dir where you're running blast? Why would it be in /dev/shm?

ADD COMMENT
0
Entering edit mode

I believe for the XML format, you could grep for something like "Iteration_iter-num".

ADD REPLY
0
Entering edit mode

Thanks, but that wouldn't solve my problem. See EDITING remark on my post.

ADD REPLY
0
Entering edit mode

Since the output size is 0 bytes, I think not a single one of your queries has finished.

ADD REPLY
0
Entering edit mode

Have you tried setting the -out switch, and checking that file? Honestly, I'm not sure how blast handles storing results while working on more queries. For a smaller job, it might be waiting till all queries are finished before writing to disk, but I have no idea.

ADD REPLY
0
Entering edit mode

That's the main issue I believe. Blast is storing my results on the RAM. It is noticeable as RAM's usage is growing. I added the -out argument in the command line (see above). Does anybody knows how to tell blast not to store it on the RAM, but on the disk instead?

ADD REPLY
0
Entering edit mode

See "more editing"...

ADD REPLY
0
Entering edit mode
9.7 years ago
pld 5.1k

Using tabular output format, and if you limit the number of hits returned for each query, you could use:

echo "`wc -l <resultfile> | grep -o "[0-9]*"` / <num query>" | bc -l

This will give you the number of queries that have completed.

ADD COMMENT

Login before adding your answer.

Traffic: 2439 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6