Hello,
I am using hhblits, part of hh-suite3. I am using it to query databases from the Soeding lab, such as BFD or Uniclust30.
After having identified hits, could anyone tell me what they use to retrieve the full-length sequences of the hits (an not just the hmm domains corresponding to the query)?
This question has been asked before, both here (Fetching Sequences From Hits In Hhblits) and on the hh-suite github (https://github.com/soedinglab/hh-suite/issues/284), but in both cases there hasen't been any replies.
I know for example that there are ways to retrieve full-length sequences from a hmmer search, for example using http://cryptogenomicon.org/extracting-hmmer-results-to-sequence-files-easel-miniapplications.html But as far as I can see, the same approach is not possible with hhblits and BFD/Uniclust because the orignial sequence file is not provided as part of those databases and the files are not parsable like fasta files.
Any suggestion would be welcome.
Many thanks in advance!
Perhaps a corollary question is: why retrieveing full-length sequences seems to be an unusual request (since it doesn't come as a default feature in both hmmer and hhblits and only a few questions around this can be found on bioinformatics forum)? If one is interested in building a phylogenetic tree or look at sequence conservation, etc., my understanding is that the full-length sequences are needed (for the MSA and also for aligning additional domains that compose the full length protein). Or am I wrong and sequence analysis, MSA, phylogenetic analysis, ancestral sequence reconstruction, etc. can all be done with only the hmm domains returned by hhblits?
Right after I wrote the message below it occurred to me that there may be some kind of
ffindex
command-line switch that could do what you want. In fact, it seems that inhhsuite
there is a whole commandffindex_unpack
which is not well-documented, but it could be doing what you want.EDIT: A quick Google search indicates that
ffindex_unpack
may be unpacking ALL the clusters, so that may be more than what you want as it will take a good chunk of disk space.There is nothing wrong with doing phylogenetic trees of individual domains within the sequence as long as you clearly state that's the case. It is not common that one domain of a protein evolves dramatically differently from the rest of it, so I don't think you would be in any kind of grave danger when interpolating the domain results to the whole protein.