Question: Biopython hmmscan parser
gravatar for nvijay.1991
5.0 years ago by
nvijay.19910 wrote:

Hello everyone,

I have a hmmscan output in hmmscan3-domtab format with more then 10000 queries.

I want to filter each profile hit based on hit_coverage (hmm profile coverage).

I was scanning  queryresult--> Hits -->Hsps --> HSPfragment this iterations and calculating hit_coverage then writing to another file (filtered file) using using SearchIO in biopython write method.

but while writing i found that all the hits were going to filtered file.

below was the code I have used:

for qresult in SearchIO.parse(file_path,'hmmscan3-domtab'):
       for each_hit in qresult:
           for each_hsp in each_hit:
               if re.match("^gfam", and (100*(each_hsp.hit_end  - each_hsp.hit_start)/each_hit.seq_len) >= 75:
               elif not re.match("^gfam", and (100*(each_hsp.hit_end - each_hsp.hit_start)/each_hit.seq_len) >= 50:

Any suggestions for sending only filtered hits to the filtered file.

PS - I have noticed that I am writing qresult..i need some suggestions for refining on this.


Vijay N

hmmscan biopython searchio • 2.0k views
ADD COMMENTlink modified 3.5 years ago by O.rka210 • written 5.0 years ago by nvijay.19910
gravatar for O.rka
3.5 years ago by
O.rka210 wrote:

This is the one I wrote. I need to do some error handling for when there are no hits b/c it will break. I put stuff like this in my Jupyter profile and it speeds stuff up... a lot.

def read_hmmscan(path, set_index="query_name", sort_values="sequence|_|e-value"):
    # Build pd.DataFrame
    df_tmp = pd.read_table(path, header=None, index_col=0)
    mask_idx = x: x.startswith("#") == False)
    tmp_list = df_tmp.loc[mask_idx,:] x: [y for y in x.split(" ") if (len(y) > 0) and (y != "-")]).tolist()
    data = list()
    for line in tmp_list:
    df_hmmscan = pd.DataFrame(data)
    # Labels
    id_cols = ["target_name", "query_name"]
    sequence_cols = ["sequence|_|e-value", "sequence|_|score", "sequence|_|bias"]
    domain_cols = ["domain|_|e-value", "domain|_|score", "domain|_|bias"]
    misc_cols = ["exp", "reg", "clu", "ov", "env", "dom", "rep", "inc"]
    df_hmmscan.columns  = id_cols + sequence_cols + domain_cols + misc_cols
    # Force types
    df_hmmscan.loc[:,sequence_cols + domain_cols] = df_hmmscan.loc[:,sequence_cols + domain_cols].astype(float)
    df_hmmscan.loc[:,"exp"] = df_hmmscan.loc[:,"exp"].astype(float)
    df_hmmscan.loc[:,misc_cols[1:]] = df_hmmscan.loc[:,misc_cols[1:]].astype(int)
    # Reformat structure
    if set_index:
        df_hmmscan = df_hmmscan.set_index(set_index)
    if sort_values:
        df_hmmscan = df_hmmscan.sort_values(sort_values, ascending=True)
    return df_hmmscan
ADD COMMENTlink written 3.5 years ago by O.rka210
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2093 users visited in the last hour