Question: How Can I Efficiently Parse Hmmer Results To Find Listed Domains And Their Architecture?
2
gravatar for James Ashmore
6.3 years ago by
London
James Ashmore90 wrote:

I am running hmmscan on a fasta file containing a proteome using Pfam-A.hmm. I would like to parse the output from this scan to collect a list of domains identified in my sequences, as well as list any novel domain architectures which may have been identified. Currently I am parsing the output file using a regex to identify and collect the domains listed above the inclusion threshold. Unfortunately, I am not sure how to go about identifying novel domain architectures. In the end, I'm hoping to be able to produce a few figures listing the percentage of different Pfam-A domains in the proteome, the percentage number of sequences with a single domain, two domains, three domains etc. and the percentage of domain architectures previously seen before and those which are novel.

Any advice would be greatly appreciated!

Thanks, James

hmmer parser • 7.1k views
ADD COMMENTlink modified 6.3 years ago by Travis Wheeler80 • written 6.3 years ago by James Ashmore90
5
gravatar for Travis Wheeler
6.3 years ago by
United States
Travis Wheeler80 wrote:

I would also suggest that you look at the combination of --tblout and --domtblout flags:

From 'hmmcan -h':

--tblout <f> : save parseable table of per-sequence hits to file S

--domtblout <f> : save parseable table of per-domain hits to file S

Even better, most likely, is using pfam_scan.pl, provided by the Pfam group. See: ftp://ftp.sanger.ac.uk/pub/databases/Pfam/Tools/README (about 1/2 way down)

ADD COMMENTlink modified 6.3 years ago • written 6.3 years ago by Travis Wheeler80
1

I use --tblout instead of stdout from hmmer.

ADD REPLYlink written 6.3 years ago by Leszek4.0k
1
gravatar for Neilfws
6.3 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

The Bioperl Search::IO module provides parsing for HMMER2 and HMMER3 output. I can provide more details if required.

ADD COMMENTlink written 6.3 years ago by Neilfws48k

More details would be really helpful, thank you! A sample of how it works in a Python script would also be great.

ADD REPLYlink modified 6.3 years ago • written 6.3 years ago by James Ashmore90

Not sure that I can provide a sample of how Bioperl works in a Python script :) However, I note that BioPython also provides a module for HMMER parsing.

ADD REPLYlink modified 6.3 years ago • written 6.3 years ago by Neilfws48k
1

Hi neilfws,

I managed to get the Biopython Hmmer parser working. I can print out the query sequence name, and a list of domain hits for that query. However, in one example I have two domains which are both above the inclusion threshold and greatly overlap. How would I go about extracting the domain which is most probable in my sequence? I originally thought about taking the lowest e-value however when there are two genuine domains in my sequence this would only extract one of them. Thanks!

ADD REPLYlink written 6.2 years ago by James Ashmore90

Guess all I can suggest is that you write code to cover 2 cases. Where there is extensive overlap - use lowest e-value. Where there isn't - extract both. You'll have to make the call on what constitutes "extensive".

ADD REPLYlink written 6.2 years ago by Neilfws48k
1

Hello again, my script is working as I wanted thanks to your input. My output is now a list of sequence names with their associated Pfam domains tab delimited and written as they occur across the sequence e.g.

Seq1 PYRIN \t DEAD

I was wondering if there was any high throughput method of analysing whether the domain architectures of these sequences is novel or not? The closest I have found is to use the Pfam online tool to search for sequences with and without specified domains. Thank you.

ADD REPLYlink written 6.2 years ago by James Ashmore90
1

Can you define "novel domain architecture"? Does that mean a combination of domains (in any order) found only in one protein sequence?

ADD REPLYlink written 6.2 years ago by Neilfws48k

Yes exactly, it means a combination of domains (in any order or in a specific order) not identified previously. For example on the Pfam database they list the types of architectures where the DDE domain has been found (http://pfam.sanger.ac.uk//family/PF13359.1#tabview=tab1). If you scroll down the architectures on that page you can see that some of those specific architectures have only been found once (There is only 1 sequence with the following architecture).

ADD REPLYlink written 6.2 years ago by James Ashmore90

I thought as much, thanks a lot :)

ADD REPLYlink written 6.2 years ago by James Ashmore90

Yes, please! I am also interested in knowing how to parse my HMMER results. Thank you!

ADD REPLYlink written 6.3 years ago by arnstrm1.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2621 users visited in the last hour