I have a sequence of a GPCR (PDB code 3NY8) which I am using to create a dataset of homologs (via a homology modelling pipeline) so I can perform electrostatic calculations on them. In order to create this GPCR dataset I need an alignment which I will feed to the pipeline so that it will create one homolog per alignment member.
My problem is sequence-based: I'm doing PSI/DELTA-BLASTs on the query and I am getting some odd results. When I do the 1st iteration, I get 196 hits, but on the second I get 155. It either stays the same or keeps decreasing on each successive iteration, and I have no idea how that's happening. Does anyone have any idea how that can happen? My understanding was that PSI-BLAST searches are supposed to increase the number of hits on each successive iteration since the algorithm is using a PSSM as opposed to a sequence to detect distant members.
The parameters used to construct the alignment were as follows:
Database: Non-redundant protein sequence databases (includes GenBank CDS translations, PDB, SwissProt, PIR, PRF)
Organism: Homo sapiens
Exclude: Models/uncultured sample sequences (both excluded)
Maximum target sequences: 1000
Expect threshold: 10
Word size: 3
Maximum matches in a query range: 0
Gap Costs: Existence: 12, Extension: 1
Compositional adjustments: Composition-based statistics
Filter: Low complexity regions
PSI-BLAST Threshold: 0.005
Any ideas would be appreciated!
@terdon Thank you for the input! I understand your point, but aren't sequences returned as matches in the first time around used to build the profile matrix? If their information is incorporated in the form of probabilities of occurrence, isn't that information necessary to append sequences detected in subsequent searches? In other words, once the original PSSM is created, my understanding is that it would expand until no further members can be found? Thanks again.
@s.charonis Spyro, think of a case where the PSSM built specifies a very high score for a cysteine at position 3. Of the 100 sequences used to build the PSSM, all but one have a Cys at that position. The first time around, the one sequence with another residue at that position will be taken as a hit because it satisfies the Pblast score/e-value thresholds. The 2nd iteration however could discard the sequence because it lacks the Cys that the PSSM has shown to be important. This lack could bring its score down to below the scoring threshold used to match the matrix to a hit, even though the sequence itself was used to build the PSSM. This is obviously a very simplistic example but it illustrates the point.
@terdon Thank you very much, that clears up a lot! I can now justify my PSI-BLAST search findings as biologically plausible.
@s.charonis na 'sai kala :)
@terdon Poly Wraios ;)
This answer helped me with my own case after 9.2 years, so thanks