Varying length sequences in hhblits (hhsuite 3.3) generated MSA
1
1
Entering edit mode
5 months ago
sajid ▴ 20

Hi everyone,

I have run hhblits from hhsuite 3.3 for generating MSA for some protein sequences. However, I can see that some of the sequences have different lengths (including gaps) compared to the seed sequence. Upon further inspection, the sequences with different lengths have some lowercase letters which are responsible for this length difference. If lowercase letters are removed, all the sequences have exactly the same length. Some of the sequences are pasted here:

If you look at it, the second and third sequences have an extra lowercase r, which is making them different lengths from the seed sequence (the first one).

Could you please let me know if this is a normal thing? I would expect all the aligned sequences to have the same length when gaps are considered.

Thank you.

hhsuite MSA hhblits • 738 views
1
Entering edit mode
5 months ago
Mensur Dlakic ★ 14k

This is normal. This is the so-called A3M format, in which inserts are shown as lower case characters, matches as upper case characters, deletions as - (dashes), and gaps aligned to inserts as . (dots).

0
Entering edit mode

Thanks a lot for the clarification. I am trying to use these MSAs as inputs to a local installation of gremlin for getting pairwise residue-residue contacts to be used as input to a downstream graph-based machine learning model. However, this gremlin software is throwing errors when the sequences in the MSA have different lengths. The only solution I can think of now is removing these insertions from the sequences to match their length with the seed sequence length. Does this seem like a logical thing to do? The gremlin software will use these MSAs for calculating co-evolution statistics between residue-pairs in the seed sequence. What I am trying to understand is, if deleting the insertions will disrupt the MSA. Thank you for your help.

1
Entering edit mode

I don't think you should be deleting anything unless you are absolutely certain that you can do it without making an error.

If you installed HHsuite correctly, there is a Perl script in it called reformat.pl. Using that script you can convert .a3m alignments into other formats that have the same number of columns. For example, assuming your file is called alignment.a3m, to convert it into Clustal format like this:

reformat.pl a3m clu alignment.a3m alignment.clu


If you type just the script name, you will see other formats that are available.

I would not use PSI-BLAST alignments as those made by HHsuite are superior to them.

0
Entering edit mode

Thanks a lot. reformat.pl -M first -r a3m a3m did it for me. Initially reformat messed things up even more until I added -r to the command which basically removed the insertions and made all sequences equal in length to the first sequence. I appreciate this help.

0
Entering edit mode

Just to add a bit more to my query, I have just now generated .psi MSAs instead of .a3m MSA, and the insertions are now not there, and all the sequences seem to have lengths equal to the seed sequence. I think I can go forward with these files.