Full length sequences from hhblits
1
0
Entering edit mode
2.5 years ago
ejl • 0

Hello,

I am using hhblits, part of hh-suite3. I am using it to query databases from the Soeding lab, such as BFD or Uniclust30.

After having identified hits, could anyone tell me what they use to retrieve the full-length sequences of the hits (an not just the hmm domains corresponding to the query)?

This question has been asked before, both here (Fetching Sequences From Hits In Hhblits) and on the hh-suite github (https://github.com/soedinglab/hh-suite/issues/284), but in both cases there hasen't been any replies.

I know for example that there are ways to retrieve full-length sequences from a hmmer search, for example using http://cryptogenomicon.org/extracting-hmmer-results-to-sequence-files-easel-miniapplications.html But as far as I can see, the same approach is not possible with hhblits and BFD/Uniclust because the orignial sequence file is not provided as part of those databases and the files are not parsable like fasta files.

Any suggestion would be welcome.

Many thanks in advance!

hhblits uniclust BFD • 1.6k views
ADD COMMENT
0
Entering edit mode

Perhaps a corollary question is: why retrieveing full-length sequences seems to be an unusual request (since it doesn't come as a default feature in both hmmer and hhblits and only a few questions around this can be found on bioinformatics forum)? If one is interested in building a phylogenetic tree or look at sequence conservation, etc., my understanding is that the full-length sequences are needed (for the MSA and also for aligning additional domains that compose the full length protein). Or am I wrong and sequence analysis, MSA, phylogenetic analysis, ancestral sequence reconstruction, etc. can all be done with only the hmm domains returned by hhblits?

ADD REPLY
0
Entering edit mode

Right after I wrote the message below it occurred to me that there may be some kind of ffindex command-line switch that could do what you want. In fact, it seems that in hhsuite there is a whole command ffindex_unpack which is not well-documented, but it could be doing what you want.

EDIT: A quick Google search indicates that ffindex_unpack may be unpacking ALL the clusters, so that may be more than what you want as it will take a good chunk of disk space.

ADD REPLY
0
Entering edit mode

There is nothing wrong with doing phylogenetic trees of individual domains within the sequence as long as you clearly state that's the case. It is not common that one domain of a protein evolves dramatically differently from the rest of it, so I don't think you would be in any kind of grave danger when interpolating the domain results to the whole protein.

ADD REPLY
1
Entering edit mode
2.5 years ago
Mensur Dlakic ★ 27k

It depends exactly what version of the database you use, but all the hits will usually have UniProt numbers in them. The first hit from one of my uniclust searches is A0A0P1IPI8_9RHOB, and from bfd it is A0A1F1QC50_9PSED. Both of those can be found if you download the UniRef100 database or search through UniProt. There will be some hits most likely that are obtained from the local protein prediction database, and those you may not be able to find.

ADD COMMENT
0
Entering edit mode

Thanks a lot Mensur Dlakic . This is correct indeed, Uniclust uses UniRef100 sequences, so hits can be retrieved using UniRef100. However, in the case of BFD, there are metagenomic sequences from several different databases, which makes it more complicated to retrieve the original sequence, except by downloading all the different databases and querying them, which is both time and disck space consuming and partly defeat the purpose of highly clustered databases such as BFD. I am wondering if there is an easier solution that doesn't require to download additional databases.

ADD REPLY
0
Entering edit mode

To the best of my knowledge, all sequences in hhblits databases are full length, so they are clustered in a way you'd like them to be. This is different from Pfam, which is more domain-centric and almost by design doesn't contain full sequences in alignments. The thing is that your query determines how much of those full sequences will be pulled into alignments, as there is no point in displaying the whole sequence if your query matches only half of it. It may help if you find the longest query as that will likely pull in the majority of sequence length for all clusters that interest you.

I don't know that there is a way to pull out whole sequences from packaged hhblits databases, and that is most likely because not enough people need that functionality. It may be useful to inquire directly with the authors.

ADD REPLY

Login before adding your answer.

Traffic: 2078 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6