Question: Count Hits to Each Unique Domain in Hmmscan Results w/ Python?
gravatar for ethanabaker1
6.4 years ago by
United States
ethanabaker10 wrote:

I'm trying to get some info about the top domains that my data matched to using hmmscan...I'm using the domain table output that is packaged with hmmscan (I also have the PFAM file). I want to use some sort of script to go through and tell me how many matches to each unique domain there are and then sort that list so that I can extract the top 15-20 domains that my data matched to.  Thoughts?

hmmscan • 2.3k views
ADD COMMENTlink modified 2.9 years ago by John0 • written 6.4 years ago by ethanabaker10
gravatar for 5heikki
6.3 years ago by
5heikki9.0k wrote:

Given that domain info is in column X

cut -f X yourOutputFile | sort | uniq -c | sort -k1,1g

ADD COMMENTlink written 6.3 years ago by 5heikki9.0k
gravatar for John
2.9 years ago by
United States
John0 wrote:

One option is to load your hmmscan table results as a pandas.DataFrame, then counting the domains is easy with the value_counts() method:

# Here, `hmm_tbl` is a pandas.DataFrame with your hmmscan results.
# This dataframe has a column `accession_target`, which is the accession of the result in the table.
# Eg, the dataframe might look something like this (truncated for readability)
#     acc  accession_query accession_target  ali_coord_from  env_coord_from
# 0  0.91              NaN       PF00005.25              10               6
# 1  0.89              NaN        PF13304.4              33              21
# PF07719.15    4668
# PF00005.25    4402
# PF13304.4     3626
# PF13432.4     3601
# PF13428.4     3513
# PF00515.26    3494
# ...

Link: pandas.Series.value_counts — pandas 0.21.0 documentation

ADD COMMENTlink modified 2.9 years ago • written 2.9 years ago by John0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2048 users visited in the last hour