Question: How do I extract sequences to create a fasta file from an nhmmer .out file?
0
gravatar for carlee.bettler
4 weeks ago by
carlee.bettler10 wrote:

I ran nhmmer to see if/where a certain gene was found in a fungi genome. The results of this query give me a .out file that contains multiple gene sequences that are part of a hit.

However, the results of this hit are very unorganized. They are interspersed with other information about the alignment sequences, they aren't linked together, and there are long headers and footers on the file.

Is there a command I can use through hmmer or esl-reformat or esl-sfetch to extract a fasta file from this .out file with just the hit sequences?

Thanks!

hmmer hmm alignment nhmmer fasta • 110 views
ADD COMMENTlink modified 4 weeks ago by Fatima830 • written 4 weeks ago by carlee.bettler10
1
gravatar for Fatima
4 weeks ago by
Fatima830
United states
Fatima830 wrote:

You can use --noali and --tblout for the commandline options of nhmmer, to get a cleaner table.

Then you can extract the columns with name of the gene, and start and end of the sequence from the output table.

Here is an example:

awk '($1!~"#"){print "$4,$7,$8"}' file.out

You need to figure out which columns you need (col4,7,8 I put is just an example).

Then, if you have a file with all the gene sequences ($geneseqfile), with grep -A 1 $genename $geneseqfile you can grep the gene sequence, and use tail -n +2 | cut -c $start-$stop to get the subsequence.

Now, you can do this for all the hits using a while loop.

A simple while loop:

awk '($1!~"#"){print "$4,$7,$8"}' file.out | while read col1 col2 col3; do echo ">$col1" ;  grep -A $col1 $geneseqfile  | tail -n +2 | cut -c $col2-$col3 ; done
ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by Fatima830
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1191 users visited in the last hour