Question

Differences in hit number between Pfam, HMMER and searching UniProt with domain name

0

Entering edit mode

5.4 years ago

deleted user • 0

Hi there,

recently I started to use protein databases and alignments more seriously – much more interesting as I thought and also more complicated!

But, since I’m just an interested microbiologist which has to start with the basics, I came across a question/problem I don’t understand.

I’m looking for a protein which has a DUF annotated at Pfam. The dataset there is around 1000 sequences. In the download options it is also possible to download an alignment from UniProt with about 4200 sequences. If I look up the DUF in UniProt, I receive around 4300 sequences. If I’m looking up the protein sequence using phmmer online, I get around 4400 sequences, with profile search on hmmscan I get 4600 sequences.

As far, as I know, Pfam is built on HMM from the UniProt database and while it uses older versions, I don’t understand why there differences in sequence hits. Nearly all hits on looking up UniProt with the DUF as search terms are from Trembl database. So it seems, that on all three pages computational annotated data is used – and on all HMMER is used?

I’m sorry if this is a stupid question, but for me as a beginner it’s not self-explanatory.

alignment protein database pfam hmmer uniprot • 1.6k views

ADD COMMENT • link 5.4 years ago by deleted user • 0

0

Entering edit mode

Okay, I did some more research on this and I guess, the differences between the UniProt alignment from Pfam and using hmmsearch is just due to different versions of the UniProt database. I understand, that differences in phmmer and hmmsearch are due to the different query, with phmmer more acting like BLAST and hmmsearch using the motif as query.

Trembl seems to use several annotation methods, which makes me wonder, if it’s better to use the hmmsearch result or the UniProt text search result as start for phylogenetic analysis. Maybe I do both and synchronise their results to get rid of duplicates.

The only question which remains for me is, why is the full alignment from Pfam that much smaller than the one from hmmsearch or the text search on UniProt (around 900 vs 4500 hits) if it uses the UniProt database to scan with the motif from the seed alignment?

ADD REPLY • link 5.4 years ago by deleted user • 0