Question

Pfam alignments: full vs uniprot

0

Entering edit mode

6.4 years ago

rayoub ▴ 110

I'm a bit confused by the alignment page of a Pfam entry. They provide the 'seed' alignment from which the initial HMM is created. They also provide the 'full' alignment that, to my understanding, is created by searching the UniProtKB and adding in those proteins significantly similar to the seed HMM. The text below is cut and pasted from their tutorial.

"Each Pfam entry is represented by a set of aligned sequences with their probabilistic representation - called a profile hidden Markov model (HMM). The profile HMM is trained on a small representative set of aligned sequences that are known to belong to the family (the 'seed' alignment). This model is then used to search exhaustively against a large sequence database (e.g. UniProtKB) to find all homologous sequences. Those sequences that are significantly similar to the model are aligned to the profile HMM in order to provide the full alignment."

In addition they provide alignments against representative proteomes, UnitProt and NCBI databases. Here is where I'm confused, the sequence count for UniProt is larger than that for full. But are these not the same? Perhaps what is happening here is that the UnitProt alignment is all sequences returned by a search of UniProt whereas full are a proper subset of those deemed appropriate for inclusion in the actual HMM.

Am I understanding this correctly?

An additional question. It seems like the word alignment often mean the HMM. Am I to understand that in this case the full alignment is the HMM created from all sequences in the full MSA?

Thanks for any guidance provided.

alignment pfam • 3.3k views

ADD COMMENT • link updated 6.4 years ago by Bill Pearson ★ 1.0k • written 6.4 years ago by rayoub ▴ 110

score 5 · Accepted Answer · 2018-03-02

To answer your question directly, the HMM models parameters are based on alignments of the model to the 27 million proteins in the Uniprot reference proteomes.

Your confusion probably comes from the fact that the most recent versions of Pfam have two different sets of tables for different sets of sequences. The "main" pfam annotates proteins from the Uniprot Reference Proteome dataset, which has around 27 million entries. In the Pfam SQL download, these are the sequences in the "pfamseq" and "pfamA_reg_full_significant" tables. In the latest Pfam NAR paper, they describe moving to focussing on the Reference Proteome proteins, because this set grows more slowly.

But they do map domains to all the Uniprot proteins (about 100 million) in the release that Pfam is based on. The additional proteins and mappings are found in the "uniprot" and "uniprot_reg_full" tables, I believe. The uniprot table includes 71 million proteins. The proteins in this table are NOT used to update the pfam model HMM's, they are only provided to map Pfam domains to non-reference proteome sequences.