Clarification for using PSI-BLAST to generate a PSSM (am I doing it right?)
1
0
Entering edit mode
20 months ago
DNAlias ▴ 30

If I run psiblast on the command line with the following command:

psiblast -query myfasta.fasta -db mydb -num_iterations 3 -out_pssm mypssm.smp

does this make a pssm based solely on the sequences in myfasta.fasta, or does it create it based on the blast hits?

psiblast • 1.4k views
0
Entering edit mode

When I run it I see this at the top:

PssmWithParameters ::= { pssm { isProtein TRUE, numRows 28, numColumns 2291, byRow FALSE, query seq { id { local str "Query_147" }, descr { title "419612_0:004b79" },

And "419612_0:004b79" is the name of my last sequence, out of 147 queries. What does this mean? It doesn't mean that the matrix is only based on the last sequence does it?

2
Entering edit mode
20 months ago
Mensur Dlakic ★ 15k

From what I can tell, there are at least two things that you are doing wrong.

First, it seems that your query file has multiple sequences. If that's indeed the case, psiblast will search with each sequence individually (and sequentially), and each search will overwrite the results of previous. That is extremely wasteful because it will take lots of time and in the end you will get results only for your last sequence. If you have multiple sequences, split them into individual files and have the results stored into different files. The search will take the same amount of time, but you will end up with results for all sequences rather than just the last one.

Second, the way you formulated the command will save thePSSM after the second iteration. Yes, the PSSM file contains the converted multiple alignment of BLAST hits rather than your starting sequences. However, there is a -save_pssm_after_last_round switch that does exactly what it sounds like. If you don't invoke it, the PSSM will be from the penultimate iteration, which is again wasteful because the results of last iteration will not count for anything. In fact, the following command will produce exactly the same PSSM as yours while running one fewer iteration:

 psiblast -query myfasta.fasta -db mydb -num_iterations 2 -out_pssm mypssm.smp -save_pssm_after_last_round


By the way, what you posted in the other thread:

Warning: [psiblast] Query_1: Composition-based score adjustment conditioned on sequence properties and unconditional composition-based score adjustment is not supported with PSSMs, resetting to default value of standard composition-based statistics


As it says, it is only a warning and you can safely ignore it. When running more than one iteration, psiblast will use the newly created PSSM and therefore can't apply composition-based statistics because those are pre-calculated only for single-iteration searches that use fixed substitution matrices. The warning will not appear if you run the same command as you did but with a single iteration.

0
Entering edit mode

Thank you for this information. I too am running PSI Blast and also have questions regarding num of iterations, wouldn't you want to set that parameter to 0? I am under the impression that this allows PSI-BLAST to iteratively search until convergence or until no new sequences are found?

If you set the parameter -num_iterations 2, wouldn't PSI-BLAST miss protein sequences? Essentially is setting the parameter to 0 ensure that all sequences are found?

Thank you so much. Any thoughts or guidance on this is greatly appreciated.

2
Entering edit mode

There is no solution that fits all applications. Running PSI-BLAST until convergence is warranted if your goal is to squeeze out absolutely all potential homologs. Even in that case, one should always keep in mind that the longer PSI-BLAST goes, the more likely is to pull in false positives. I would inspect visually any run that went until convergence.

When using PSI-BLAST to get PSSMs for other applications, say for machine learning of various properties, it is not advisable to go beyond 2-3 iterations because the signal-to-noise ratio goes down.

In short, the number of iterations depends on what you are trying to achieve.

0
Entering edit mode

Hi Mensur, could you please elaborate on the signal-to-noise ratio drop? I'm using PSI-Blast to retrieve homologs and then calculate certain features to apply ML. Although I won't be using the PSSMs directly, I'm interested in what you say! I have a couple of proteins with >300 iterations until convergence.

0
Entering edit mode

I have a couple of proteins with >300 iterations until convergence.

Please don't take this the wrong way, but this is almost guaranteed to be wrong. I have done my share of running PSI-BLAST until convergence, and few times I have gotten into double digits or even into 20+ iterations. But hundreds of iterations is almost guaranteed to pull in non-homologous sequences, and I would inspect it very thoroughly.

The S/N ratio drops because it is difficult to reliably align distantly related sequences that are usually added in later iterations. Aligning distant sequences is a challenge for any program, but especially so for BLAST because it uses no post-processing step once distant homologs are identified. In short: the gain from adding diverse sequences to an alignment is offset by unreliable alignments. Most ML applications of PSI-BLAST that I am aware of use 3 iterations. I have used 4 and 5 iterations for a variety of prediction tasks, and that was either neutral or worse compared to 3 iterations.

0
Entering edit mode

Thank you for your reply Mensur. I found it strange from the first moment given than the rest of my proteins were giving 1-8 iterations. Maybe even 8 is too much, I will keep it to 3 and look carefuly at the results.