Sorry if this is a really simple question, I'm just starting to self-teach myself about how to build a profile HMM and am feeling pretty swamped with all these concepts/jargons.
I'm trying to decide whether I should use raw HMMs available from Pfam, TIGRFAMs, etc. or build one myself.
Say I'm really interested in looking at soil microbiome. I searched for PF00246 in the Pfam 33.1 database, and looked at the phylogenetic tree on the "Trees" tab. The tree included a wide range of organisms - humans, mouse, cows, fruit flies, etc. But I only want to include soil yeast in my tree, so that I wouldn't need to look at really distantly related organisms. I think I have two options - 1) download the raw HMM from the Pfam database and make a "full alignment" by searching the soil microbiome database - in this case I would eventually find alignments that are from the soil microbiome, so other organisms included in the raw HMM wouldn't matter. 2) Or I can build a new raw HMM that only includes organisms in the soil microbiome from a new seed sequence, then use that HMM to search against the soil microbiome database. Which option would be better?
- An additional question - if the second option is better, I'm not sure if I understand how people determine which sequences are reliable enough to include as seed alignments. For instance, how were the seed alignments in these Pfam entries created? Do they try to include as many different organisms as possible? Or is there some sort of algorithm to "score" how reliable each alignment is?