Understanding the relationship between hhr and a3m output of HHblits
1
0
Entering edit mode
7 hours ago
Andrew • 0

I am conducting a homology search using HHBlits v3.3.0 against the Uniref30_2023_02 database. I am hoping to use the alignment generated in a3m format (-oa3m flag) while cross-referencing to information in the summary file in the hhr format (specifically I want to be able to get the score/E-val from the hhr file.)

However, I am noticing that there is not a clear correspondence between the sequences reported in the two formats. For example, my current a3m file has 537 sequences and the hhr file has 15 sequences. I understand I can get more sequences in the hhr file by changing various reporting thresholds, e.g. the -E flag. However, I noticed that not all the sequences in the hhr file are present in the a3m! So I am not sure what controls which sequences get output in the a3m vs. the hhr file.

Any insight into this would be useful! I have looked at this manual but I have not seen anything that seems to address this issue. Is the hhr just reporting representative cluster members while the a3m is reporting all sufficiently diverse cluster members? If so, why are some sequences in the hhr but not the a3m?

For reference, I am trying the current hhblits command:

hhblits -i [input_fasta] -o [output.hhr] -oa3m [output.a3m] -n 1 -d [PATH_TO_UNIREF32_2023_02] -Z 10000 -B 10000 -E 0.001
hhblits hhsuite • 313 views
ADD COMMENT
1
Entering edit mode
3 hours ago
Andrew • 0

Ok after reviewing the HHBlits source code I realize the issue is that the hhr file and the a3m file contain apples and oranges, and it doesn't actually make sense to compare them. The hhr file reports Hits to HMM profiles (duh!), while the a3m file reports individual sequences. This distinction is elided when querying the Uniref or Uniclust databases, where each Hit represent clusters of sequences that are identified by the accession of a representative sequence in the cluster (so it looks like the hhr file reports hits to individual sequences).

When generating an alignment output, HHBlits loops through the profile Hits, pulls out all the sequences in that cluster, and filters them in several ways before reporting them in the alignment. The filters include things like % coverage with the query sequence, overall dissimilarity, etc. So it is entirely possible that the representative sequence which identifies the Hit in the hhr file might get filtered out and not ultimately reported in the a3m alignment.

I knew that HHSuite was all about profile-profile matching, but the way that Uniref clusters are indexed by representative sequences got me confused; I thought somehow all the profile-profile stuff was occurring behind the scenes and the result getting expanded into individual sequences at the end, but that is only true of the a3m alignment output, not the hhr output.

ADD COMMENT

Login before adding your answer.

Traffic: 3332 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6