I'm currently using PSIBLAST on a single query string to produce a PSSM which is used as output to a number of ML tools (PSIPRED, DISOPRED, etc.). I'd like to make this more efficient for running about 500k such sequences, but I need an individual PSSM output for each query string.
I know that the BLAST algorithm is made more efficient when running multiple query strings at the same time, but the psiblast
CLI program only creates a single PSSM output for all query sequences when run this way. Is there a way around this?
Thanks for reading!
Hello, I have the same problem, I need to use psi-blast to obtain a PSSM matrix of about 20,000 protein sequences. If I generate under the NR database, a sequence will take about 1 hour. Is this run time normal? My instructions are as follows:
psiblast -query ./test.fasta -db ./blast/db/nr/nr -num_iterations 3 -out ./test_out -out_ascii_pssm ./test.pssm -evalue 0.001
Depending on your computer speed, number of processors and memory, this could be normal. NR database is very large. You may want to consider using a database such as UniProt90 where lots of redundancy has been removed at 90% identity:
https://www.uniprot.org/downloads
Thank you for your reply! I will try the UniRef90 database. I have read a lot of literature about deep learning applications of protein sequences, and they all quote the PSSM matrix and use NR database. I want to know what is the relationship between NR and UniRef90? Is there a big gap between the PSSM obtained from the two databases?
Not sure what you mean by the relationship between the two databases. If you are talking about the size, NR is 318 million sequences right now, while the latest UniRef90 is 116 million sequences. That will translate into major memory and time savings.
I have not used NR for at least 10 years, and all my machine learning applications work just fine. In fact, if you Google
machine learning uniref90
you will find that lots of papers use this database. I seem to remember that PSI-BLAST weighs the sequence by removing sequences that are >= 94%, so that is not much different from using a database that is already trimmed at 90% identity. In terms of "signal" that helps find distant homologs - which is what PSI-BLAST does - including 1 or 100 sequences that are >90% identical will make very little difference. I think the same is true in terms of signal for machine learning applications, where it is more important to capture the "breadth" (divergence) of sequences rather than "depth" (number of sequences, similar or otherwise).Thank you for your helpful answers. According to your answer, it seems that it is better to use uniref90 database. The relationship between uniref90 and NR database is seem like uniref90 has only one sequence >90% identity, while NR database has 100. For the results of running with PSI-blast, there is almost no difference between the two. The uniref90 database is smaller, so it will run faster! I especially agree with your words. For the input of machine learning applications, the "breadth" of the data is more important than the "depth".
I still have a small question. I doubt whether the running time of psiblast is reasonable because it only takes about 10 minutes to run psiblast on the NR database on the ncbi website. But why does it tak longer run time in local
It is difficult to answer this question properly without knowing your computer configuration. I am going to make an educated guess that NCBI computers are faster, have faster disks and more memory than your computer. A computer with enough memory can hold the whole NR without having to unload it from RAM, and then it is a search within memory which is generally very fast. I am guessing that your computer doesn't have enough memory to hold the whole NR database - most computers don't - so it has to read in chunks of database at a time.
I think you should run a search with a smaller database and see how that goes. There is no substitute for directly finding out what kind of memory and time savings will be achieved on your computer. Getting PSSMs for 20000 sequences on a single computer will take a long time no matter what, but 2 months is better than 6 months. After finding out how long an average search takes with UniRef90, you may wish to find additional resources or rethink your overall strategy.