Question

How do I extract sequences to create a fasta file from an nhmmer .out file?

0

Entering edit mode

3.5 years ago

carlee.bettler ▴ 10

I ran nhmmer to see if/where a certain gene was found in a fungi genome. The results of this query give me a .out file that contains multiple gene sequences that are part of a hit.

However, the results of this hit are very unorganized. They are interspersed with other information about the alignment sequences, they aren't linked together, and there are long headers and footers on the file.

Is there a command I can use through hmmer or esl-reformat or esl-sfetch to extract a fasta file from this .out file with just the hit sequences?

Thanks!

fasta hmmer nhmmer alignment hmm • 1.6k views

ADD COMMENT • link updated 3.5 years ago by Fatima ▴ 1000 • written 3.5 years ago by carlee.bettler ▴ 10

score 1 · Answer 1 · 2020-10-23

You can use --noali and --tblout for the commandline options of nhmmer, to get a cleaner table.

Then you can extract the columns with name of the gene, and start and end of the sequence from the output table.

Here is an example:

awk '($1!~"#"){print "$4,$7,$8"}' file.out

You need to figure out which columns you need (col4,7,8 I put is just an example).

Then, if you have a file with all the gene sequences ($geneseqfile), with grep -A 1 $genename $geneseqfile you can grep the gene sequence, and use tail -n +2 | cut -c $start-$stop to get the subsequence.

Now, you can do this for all the hits using a while loop.

A simple while loop:

awk '($1!~"#"){print "$4,$7,$8"}' file.out | while read col1 col2 col3; do echo ">$col1" ;  grep -A $col1 $geneseqfile  | tail -n +2 | cut -c $col2-$col3 ; done