recently I started to use protein databases and alignments more seriously – much more interesting as I thought and also more complicated!
But, since I’m just an interested microbiologist which has to start with the basics, I came across a question/problem I don’t understand.
I’m looking for a protein which has a DUF annotated at Pfam. The dataset there is around 1000 sequences. In the download options it is also possible to download an alignment from UniProt with about 4200 sequences. If I look up the DUF in UniProt, I receive around 4300 sequences. If I’m looking up the protein sequence using phmmer online, I get around 4400 sequences, with profile search on hmmscan I get 4600 sequences.
As far, as I know, Pfam is built on HMM from the UniProt database and while it uses older versions, I don’t understand why there differences in sequence hits. Nearly all hits on looking up UniProt with the DUF as search terms are from Trembl database. So it seems, that on all three pages computational annotated data is used – and on all HMMER is used?
I’m sorry if this is a stupid question, but for me as a beginner it’s not self-explanatory.