Question

Trying to use hmmer to search genes over metagenome assembled genomes

2

Entering edit mode

17 months ago

v.berriosfarias ▴ 140

Hello I have a hmm file (my genes.hmm) that I want to use to search some genes over some MAGs that I elaborated. For this purpose I installed hmmer and tried to run hmmsearch as the following:

First I converted he nucleotide sequences of my MAGs into aminoacidic sequences

transeq -sequence MAG1.fasta -outseq MAG1.faa

After that I ran the hmmsearch program

hmmsearch --cpu 9 --tblout genes_table.txt mygenes.hmm MAG1.faa > genes_results.txt

But I got an error message:

Fatal exception (source file p7_pipeline.c, line 697): Target sequence length > 100K, over comparison pipeline limit. (Did you mean to use nhmmer/nhmmscan?)

It seems hmmer cannot deal with sequences greater than 100K aminoacids. I took a look to my MAG to see if there was an aminoacidic sequence greater than 100K aminoacids and there were more than one with a length over 100k.

So hmmer cannot deal with sequences larger than 100K aminoacids ? Is there a way to handle this ?

thanks for your time.

Valentín.

MAGs genes hmmer • 2.0k views

ADD COMMENT • link 15 months ago by v.berriosfarias ▴ 140

score 1 · Answer 1 · 2023-06-10

I finally solved this with the help of the hmmer devs https://github.com/EddyRivasLab/hmmer/issues/302#issuecomment-1539968198

Firstly you have to consider there is not known aminoacidic sequence longer than 100k aminoacids. The program that I mentioned in the post transeq is not optimum for this case given the fact that it does not search for ORFs, it just translates a DNA sequence without looking for ORFs.

By recommendation of the mentioned dev, I instead used the esl-translate program implemented in hmmer3. So this is what I did for a metagenomic sequencing reads set for one sample, additionally I added a 4th step to calculate abundance of genes along metagenomic samples:

1) Assembly the reads with MEGAHIT (or with your favorite assembler):

megahit -1 sample1_1.fastq.gz -2 sample1_2.fastq.gz --min-contig-len 300  -o output_dir

2) Retrieve ORFs from the contigs and translate them using esl-translate. here I specify to use translation table 11 (dedicated for prokaryotes and Archaeal)

edit: as Mensur suggested below, prodigal is the best solution for this and not esl-translate as I specify here

esl-translate -c 11 final.contigs.fa > sample1_orfs.faa

3) Once you have your translated orfs, you can run hmmsearch with your profile file (ending with .hmm) and your translated ORFs:

 hmmsearch --tblout sample1_results.txt --cpu `int` mygenes.hmm  > sample1.log

4) Once you have your output tables from hmmsearch you can calculate abundance of those genes by converting the table to a GTF-like table that can be the input file for featureCounts program. First you can download a python file that I made for that purpose: https://github.com/Valentin-Bio/HMMER2SAF/tree/main

The scripts works in the directory where you have your hmmer output table files (sample1_results.txt) , and it will output a SAF file into the same directory

Once you have your SAF file , you can run featureCounts to calculate the abundance of your genes of interest. For this you will need a bam file of your reads aligned against the assembled contigs you made on step 1

featureCounts -p -O -T 32 -F SAF -a sample1_SAF.txt -o sample1_fcounts.txt sample1_sorted.bam

The output table will have information on the last column of the abundance of each gene on your sample, you can iterate these steps over multiple samples and parse those tables to get the abundance information column to create a unique table of samples * genes abundance table.

To get that table I used an own made R script and plotted it in an R stacked barplot.