Question

hmmer core dumps, does not create proper MSA Stockholm format files

0

Entering edit mode

9.0 years ago

kdiaz17 • 0

I keep attempting to run HMMER 3.0 on an HPC/PBS system. I have two protein fasta files that I keep trying to align using phmmer. It always starts up fine, but then is inexplicably either killed or core dumped, whether I run it on the HPC system or on my own computer. Even so, if I take the alignment from phmmer into hmmbuild into order to create a profile hmm, it tells me that there is no name in the Stockholm file from phmmer's output (I did check to make sure the file ended in "//", and I do understand that I am only running a subset of data). I can't find any of these issues anywhere, and Cryptogenomicon and the HMM user guide are less than helpful. Anyone had this problem before?

For reference, the error from my phmmer search is:

Fatal exception (source file p7_omx.c, line 159):
realloc for size 80364975 failed

and my command is

phmmer -o phmmer_palsa_AbR_output.csv -A phmmer_palsa_AbR_msa.sto --domtblout phmmer_palsa_AbR_table.csv -E 1e-5 assembly71_201206_P1_polypeps.faa NR_AbR_polypeptides_full.fa

hmmsearch hmmer phmmer • 4.0k views

ADD COMMENT • link updated 22 months ago by Ram 43k • written 9.0 years ago by kdiaz17 • 0

Ram · Answer 1 · 2015-04-23

That's an allocation error. It tried to allocate 80M and failed. Your system ran out of memory.

I'm not sure what you're trying to do when you say "I have two protein fasta files that I keep trying to align", but what phmmer is going to do is align each seq in assembly71_201206_P1_polypeps.faa to each seq in NR_AbR_polypeptides_full.fa, pairwise. For pairwise comparisons that look like they're statistically significant, H3 will need to do a comparison that currently requires about 36ML bytes of memory, for query and target seqs of lengths M and L residues. For almost all proteins that's not a problem -- 400x400aa requires <6M -- but if we're talking something the size of the largest proteins (titin; 35Kaa) we can get up to requirements of ~40-50G and make most systems fall over. (HMMER4 is going to be using memory-efficient algorithms that make this issue go away.) So my guess is that you have some large homologous proteins in those two files. Either that, or something else was consuming all the RAM on your system(s), but since you say it happens on multiple systems at different times, it's probably your seq lengths.

Running multiple threads exacerbates the problem, because each thread is doing comparisons that need memory. Sometimes reducing the number of threads can get you past a limited memory problem; try --cpu 1.

Since your files are called "...polypeps..." I'm guessing you're either dealing with viral polypeptides (which can be large enough to run into memory issues as above, depending on the virus), in which case the only solution w/ HMMER3 is a large memory machine; or possibly that you've translated a genome assembly's contigs into six polypeptides, including stop codons, in which case the solution is don't do that. Assuming you can stomach the less-than-helpful Cryptogenomicon one more time, see http://cryptogenomicon.org/?p=424 for why HMMER expects protein sequences to be protein sequences, not six-frame translations into artificial long polypeptides w/ * characters.

The reason that hmmbuild tells you that there's no alignment names in your Stockholm file is that there's no alignment names in your Stockholm file. phmmer outputs the alignments as you asked for with it's -A option but doesn't name them for you; since you're running many queries, you get a concatenated Stockholm file consisting of one alignment per query sequence. If you use hmmbuild on a file that contains many alignments, hmmbuild basically assumes that you're building something like Pfam, where all the multiple alignments have been assigned names by Pfam curators. If you build a profile from a single alignment, hmmbuild will use the name of the file as the name of the profile (and you can concat profiles later) but if there's more than one alignment, using the file name doesn't work. We hadn't run into this problem before, since Pfam alignments always have names. I suppose we could have phmmer name the output alignments by the names of the query sequences, which isn't a bad idea. For now, your workaround is either a) edit your .sto file to add name annotation lines yourself to each alignment; or b) run one query seq at a time, build profile from each alignment individually, concat profiles.