Question: Preview temporary outputfile blast
0
gravatar for sara
4.6 years ago by
sara0
Canada
sara0 wrote:

Hi,

Is there a way to peek at blast temporary file? I would like to know what sequence is analyzed in order to foresee how long the program will take to finish. top command shows that it uses around 4 gigs of memory, but I can't get hold of the file in the ram folders (/dev/shm). I could kill the job, but the temporary file would be erased. Anybody would know?

My blast inputs are rather basic :

blastn -query sequences.fasta -db nt.db -out output.xml -evalue 1e-4 -outfmt 5 -num_threads 14 -gapopen 5 -gapextend 2 -word_size 7

I have 2000 sequences inside the fasta file.

nt database file is enromous, so I'm expecting blastn to hang long...

EDITING: Although the file is working consuming more and more ram as the program is running, ls -l on my output folder gives:

 

-rw-rw-r-- 1 sara sara    0 Aug 16 15:57 output.xml

so I can't use grep or sed or wc -l... I have no idea where the information is put! That's why I was looking into the /dev/shm folder...

 

MORE EDITING: Although the file is empty (no octet in it, as shows the previous command), all the output information is store into the RAM (RAM's usage for the blastn is growing everyday) . When the job is finished, Blast transfers the content of the RAM onto the disk, i.e. the output.xml. At least, that's what I figure. My question is how to peek into the blast temporary output file that is on the RAM...

blast ram • 1.3k views
ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by sara0

Hi,

Inputs on what program you're trying to run, an approximate input file size as well as the exact commands you're using to profile the BLAST run would help you get an answer best suited to your situation.

ADD REPLYlink written 4.6 years ago by RamRS20k
2

Something to remember is that BLAST uses many heuristics to decide when to stop searching sequences. There is a relationship between the entropy of the query sequences and the runtime of blast. Additionally, the degree of homology for each query will impact run time. The more seeds blast is able to find per hit, the longer it may take to complete the query search. In general, the length of the sequence can predict runtime, but there are many other factors that impact the runtime.

2000 queries is a small job, you'll probably spend more time figuring out how to properly estimate the runtime than it will take for blast to finish.

If there's some reason you are concerned about the amount of time it will take (maybe you're paying for a cluster), you should look into reducing the size of the database being searched. For example, if you're working with bacterial sequences, the nt database will contain tons of eukaryotic and viral sequence data you don't need to search.

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by pld4.8k

Thanks, I could look into that to lower the time running! Do you know of a quick way to remove, let's say... all eukaryotes?

ADD REPLYlink written 4.6 years ago by sara0
1

It might take a bit of work but the idea is to get a list of GIs matching the taxonomy you want, and then extracting the sequences with those GIs from the current db and making a new blast db.

ADD REPLYlink written 4.6 years ago by pld4.8k

It is clear that BLAST isn't writing out the XML file as it runs, so I don't think you have much of an option. Don't forget that BLAST isn't holding the full XML file in memory waiting till it is time to write to the disk. It is only storing the values needed to generate the XML result file after the run is completed.

It make sense that it doesn't write as it goes, an XML file for results from multiple queries contains information about the whole run. Plus I'm not sure it is safe to assume that BLAST runs the whole process from start to finish for each query individually, again it's a highly optimized program so it might be doing things in some other order.

Not to be rude, but I think trying to figure out a way to watch this is a waste of time. Why spend so much time figuring this out, when it is clear that it is taking a while? Focus on trying to cut down the size of the database and seeing if you can run any of these jobs in parallel. Why spend two days to figure out it will take ten days when you can spend two days to save a few days every time you run this? Fine tuning optimizations require that you really understand runtime, but when its clearly taking too long you don't have to worry as much.

ADD REPLYlink written 4.6 years ago by pld4.8k

You're right, it's just that it was running for a long time, and I didn't want to waste all that was already been done. I just didn't know if it would take another day or 2 or 2 months! But thanks, we all pretty much answered my question.

ADD REPLYlink written 4.6 years ago by sara0
0
gravatar for 5heikki
4.6 years ago by
5heikki8.2k
Finland
5heikki8.2k wrote:

With tabular blast output it's as easy as: 

cut -f1 outputFile | sort -u | wc -l 

I don't how it's with the xml output, probably:

grep something outputFile | sort -u | wc -l

 

p.s. Shouldn't the temp file be in the dir where you're running blast? Why would it be in /dev/shm?

ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by 5heikki8.2k

I believe for the XML format, you could grep for something like "Iteration_iter-num".
 

ADD REPLYlink written 4.6 years ago by pld4.8k

Thanks, but that wouldn't solve my problem. See EDITING remark on my post.

ADD REPLYlink written 4.6 years ago by sara0

Since the output size is 0 bytes, I think not a single one of your queries has finished.

ADD REPLYlink written 4.6 years ago by 5heikki8.2k

Have you tried setting the -out switch, and checking that file? Honestly, I'm not sure how blast handles storing results while working on more queries. For a smaller job, it might be waiting till all queries are finished before writing to disk, but I have no idea.

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by pld4.8k

That's the main issue I believe. Blast is storing my results on the RAM. It is noticeable as RAM's usage is growing. I added the -out argument in the command line (see above). Does anybody knows how to tell blast not to store it on the RAM, but on the disk instead?  

ADD REPLYlink written 4.6 years ago by sara0

See "more editing"...

ADD REPLYlink written 4.6 years ago by sara0
0
gravatar for pld
4.6 years ago by
pld4.8k
United States
pld4.8k wrote:

Using tabular output format, and if you limit the number of hits returned for each query, you could use:

 echo "`wc -l <resultfile> | grep -o "[0-9]*"` / <num query>" | bc -l

This will give you the number of queries that have completed.

 

 

ADD COMMENTlink written 4.6 years ago by pld4.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1223 users visited in the last hour