How do I extract sequences to create a fasta file from an nhmmer .out file?
1
0
Entering edit mode
3.5 years ago

I ran nhmmer to see if/where a certain gene was found in a fungi genome. The results of this query give me a .out file that contains multiple gene sequences that are part of a hit.

However, the results of this hit are very unorganized. They are interspersed with other information about the alignment sequences, they aren't linked together, and there are long headers and footers on the file.

Is there a command I can use through hmmer or esl-reformat or esl-sfetch to extract a fasta file from this .out file with just the hit sequences?

Thanks!

fasta hmmer nhmmer alignment hmm • 1.6k views
ADD COMMENT
1
Entering edit mode
3.5 years ago
Fatima ▴ 1000

You can use --noali and --tblout for the commandline options of nhmmer, to get a cleaner table.

Then you can extract the columns with name of the gene, and start and end of the sequence from the output table.

Here is an example:

awk '($1!~"#"){print "$4,$7,$8"}' file.out

You need to figure out which columns you need (col4,7,8 I put is just an example).

Then, if you have a file with all the gene sequences ($geneseqfile), with grep -A 1 $genename $geneseqfile you can grep the gene sequence, and use tail -n +2 | cut -c $start-$stop to get the subsequence.

Now, you can do this for all the hits using a while loop.

A simple while loop:

awk '($1!~"#"){print "$4,$7,$8"}' file.out | while read col1 col2 col3; do echo ">$col1" ;  grep -A $col1 $geneseqfile  | tail -n +2 | cut -c $col2-$col3 ; done
ADD COMMENT

Login before adding your answer.

Traffic: 1380 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6