Question

Batching PSIBLAST calls and obtaining individual PSSMs

0

Entering edit mode

4.5 years ago

wjn0 • 0

I'm currently using PSIBLAST on a single query string to produce a PSSM which is used as output to a number of ML tools (PSIPRED, DISOPRED, etc.). I'd like to make this more efficient for running about 500k such sequences, but I need an individual PSSM output for each query string.

I know that the BLAST algorithm is made more efficient when running multiple query strings at the same time, but the psiblast CLI program only creates a single PSSM output for all query sequences when run this way. Is there a way around this?

Thanks for reading!

psiblast blast pssm • 1.9k views

ADD COMMENT • link updated 4.5 years ago by Mensur Dlakic ★ 27k • written 4.5 years ago by wjn0 • 0

score 0 · Answer 1 · 2019-10-09

0

Entering edit mode

4.5 years ago

Mensur Dlakic ★ 27k

There is no way around this. You have to submit sequences individually. Once you have individual sequences, you can run multiple psiblast instances on each of those sequences. Note that you will need lots of memory for that unless your database is small(ish). I suggest you run a single psiblast query and find out what peak memory usage is before attempting to do multiple sequences simultaneously. Even if you have large RAM (512+ Gb), I still would suggest running at most 4-5 searches simultaneously as you will run into I/O problems.

ADD COMMENT • link 4.5 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

Hello, I have the same problem, I need to use psi-blast to obtain a PSSM matrix of about 20,000 protein sequences. If I generate under the NR database, a sequence will take about 1 hour. Is this run time normal? My instructions are as follows: psiblast -query ./test.fasta -db ./blast/db/nr/nr -num_iterations 3 -out ./test_out -out_ascii_pssm ./test.pssm -evalue 0.001

ADD REPLY • link 3.6 years ago by lvguofeng • 0

0

Entering edit mode

Depending on your computer speed, number of processors and memory, this could be normal. NR database is very large. You may want to consider using a database such as UniProt90 where lots of redundancy has been removed at 90% identity:

https://www.uniprot.org/downloads

ADD REPLY • link 3.6 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

Thank you for your reply! I will try the UniRef90 database. I have read a lot of literature about deep learning applications of protein sequences, and they all quote the PSSM matrix and use NR database. I want to know what is the relationship between NR and UniRef90? Is there a big gap between the PSSM obtained from the two databases?

ADD REPLY • link 3.6 years ago by lvguofeng • 0

0

Entering edit mode

Not sure what you mean by the relationship between the two databases. If you are talking about the size, NR is 318 million sequences right now, while the latest UniRef90 is 116 million sequences. That will translate into major memory and time savings.

I have not used NR for at least 10 years, and all my machine learning applications work just fine. In fact, if you Google machine learning uniref90 you will find that lots of papers use this database. I seem to remember that PSI-BLAST weighs the sequence by removing sequences that are >= 94%, so that is not much different from using a database that is already trimmed at 90% identity. In terms of "signal" that helps find distant homologs - which is what PSI-BLAST does - including 1 or 100 sequences that are >90% identical will make very little difference. I think the same is true in terms of signal for machine learning applications, where it is more important to capture the "breadth" (divergence) of sequences rather than "depth" (number of sequences, similar or otherwise).

ADD REPLY • link 3.6 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

Thank you for your helpful answers. According to your answer, it seems that it is better to use uniref90 database. The relationship between uniref90 and NR database is seem like uniref90 has only one sequence >90% identity, while NR database has 100. For the results of running with PSI-blast, there is almost no difference between the two. The uniref90 database is smaller, so it will run faster！ I especially agree with your words. For the input of machine learning applications, the "breadth" of the data is more important than the "depth".

I still have a small question. I doubt whether the running time of psiblast is reasonable because it only takes about 10 minutes to run psiblast on the NR database on the ncbi website. But why does it tak longer run time in local

ADD REPLY • link 3.6 years ago by lvguofeng • 0

0

Entering edit mode

I still have a small question. I doubt whether the running time of psiblast is reasonable because it only takes about 10 minutes to run psiblast on the NR database on the ncbi website. But why does it tak longer run time in local

It is difficult to answer this question properly without knowing your computer configuration. I am going to make an educated guess that NCBI computers are faster, have faster disks and more memory than your computer. A computer with enough memory can hold the whole NR without having to unload it from RAM, and then it is a search within memory which is generally very fast. I am guessing that your computer doesn't have enough memory to hold the whole NR database - most computers don't - so it has to read in chunks of database at a time.

I think you should run a search with a smaller database and see how that goes. There is no substitute for directly finding out what kind of memory and time savings will be achieved on your computer. Getting PSSMs for 20000 sequences on a single computer will take a long time no matter what, but 2 months is better than 6 months. After finding out how long an average search takes with UniRef90, you may wish to find additional resources or rethink your overall strategy.

ADD REPLY • link 3.6 years ago by Mensur Dlakic ★ 27k