Question

How Can I Efficiently Parse Hmmer Results To Find Listed Domains And Their Architecture?

2

Entering edit mode

10.8 years ago

James Ashmore ▴ 100

I am running hmmscan on a fasta file containing a proteome using Pfam-A.hmm. I would like to parse the output from this scan to collect a list of domains identified in my sequences, as well as list any novel domain architectures which may have been identified. Currently I am parsing the output file using a regex to identify and collect the domains listed above the inclusion threshold. Unfortunately, I am not sure how to go about identifying novel domain architectures. In the end, I'm hoping to be able to produce a few figures listing the percentage of different Pfam-A domains in the proteome, the percentage number of sequences with a single domain, two domains, three domains etc. and the percentage of domain architectures previously seen before and those which are novel.

Any advice would be greatly appreciated!

Thanks, James

hmmer parser • 13k views

ADD COMMENT • link updated 10.8 years ago by Travis Wheeler ▴ 110 • written 10.8 years ago by James Ashmore ▴ 100

score 7 · Answer 1 · 2013-06-14

7

Entering edit mode

10.8 years ago

Travis Wheeler ▴ 110

I would also suggest that you look at the combination of --tblout and --domtblout flags:

From 'hmmcan -h':

--tblout <f> : save parseable table of per-sequence hits to file S

--domtblout <f> : save parseable table of per-domain hits to file S

Even better, most likely, is using pfam_scan.pl, provided by the Pfam group. See: ftp://ftp.sanger.ac.uk/pub/databases/Pfam/Tools/README (about 1/2 way down)

ADD COMMENT • link 10.8 years ago by Travis Wheeler ▴ 110

1

Entering edit mode

I use --tblout instead of stdout from hmmer.

ADD REPLY • link 10.8 years ago by Leszek 4.2k

score 1 · Answer 2 · 2013-06-12

1

Entering edit mode

10.8 years ago

Neilfws 49k

The Bioperl Search::IO module provides parsing for HMMER2 and HMMER3 output. I can provide more details if required.

ADD COMMENT • link 10.8 years ago by Neilfws 49k

0

Entering edit mode

More details would be really helpful, thank you! A sample of how it works in a Python script would also be great.

ADD REPLY • link 10.8 years ago by James Ashmore ▴ 100

0

Entering edit mode

Not sure that I can provide a sample of how Bioperl works in a Python script :) However, I note that BioPython also provides a module for HMMER parsing.

ADD REPLY • link 10.8 years ago by Neilfws 49k

1

Entering edit mode

Hi neilfws,

I managed to get the Biopython Hmmer parser working. I can print out the query sequence name, and a list of domain hits for that query. However, in one example I have two domains which are both above the inclusion threshold and greatly overlap. How would I go about extracting the domain which is most probable in my sequence? I originally thought about taking the lowest e-value however when there are two genuine domains in my sequence this would only extract one of them. Thanks!

ADD REPLY • link 10.8 years ago by James Ashmore ▴ 100

0

Entering edit mode

Guess all I can suggest is that you write code to cover 2 cases. Where there is extensive overlap - use lowest e-value. Where there isn't - extract both. You'll have to make the call on what constitutes "extensive".

ADD REPLY • link 10.8 years ago by Neilfws 49k

1

Entering edit mode

Hello again, my script is working as I wanted thanks to your input. My output is now a list of sequence names with their associated Pfam domains tab delimited and written as they occur across the sequence e.g.

Seq1 PYRIN \t DEAD

I was wondering if there was any high throughput method of analysing whether the domain architectures of these sequences is novel or not? The closest I have found is to use the Pfam online tool to search for sequences with and without specified domains. Thank you.

ADD REPLY • link 10.8 years ago by James Ashmore ▴ 100

1

Entering edit mode

Can you define "novel domain architecture"? Does that mean a combination of domains (in any order) found only in one protein sequence?

ADD REPLY • link 10.8 years ago by Neilfws 49k

0

Entering edit mode

Yes exactly, it means a combination of domains (in any order or in a specific order) not identified previously. For example on the Pfam database they list the types of architectures where the DDE domain has been found (http://pfam.sanger.ac.uk//family/PF13359.1#tabview=tab1). If you scroll down the architectures on that page you can see that some of those specific architectures have only been found once (There is only 1 sequence with the following architecture).

ADD REPLY • link 10.8 years ago by James Ashmore ▴ 100

0

Entering edit mode

I thought as much, thanks a lot :)

ADD REPLY • link 10.8 years ago by James Ashmore ▴ 100

0

Entering edit mode

Yes, please! I am also interested in knowing how to parse my HMMER results. Thank you!

ADD REPLY • link 10.8 years ago by arnstrm ★ 1.8k