Question: Biopython hmmscan parser
0
gravatar for nvijay.1991
3.9 years ago by
India
nvijay.19910 wrote:

Hello everyone,

I have a hmmscan output in hmmscan3-domtab format with more then 10000 queries.

I want to filter each profile hit based on hit_coverage (hmm profile coverage).

I was scanning  queryresult--> Hits -->Hsps --> HSPfragment this iterations and calculating hit_coverage then writing to another file (filtered file) using using SearchIO in biopython write method.

but while writing i found that all the hits were going to filtered file.

below was the code I have used:

for qresult in SearchIO.parse(file_path,'hmmscan3-domtab'):
       pass
       for each_hit in qresult:
           pass
           for each_hsp in each_hit:
               pass
               if re.match("^gfam",each_hit.id) and (100*(each_hsp.hit_end  - each_hsp.hit_start)/each_hit.seq_len) >= 75:
                   SearchIO.write(qresult,filtered_file_handler,'hmmscan3-domtab')
               elif not re.match("^gfam",each_hit.id) and (100*(each_hsp.hit_end - each_hsp.hit_start)/each_hit.seq_len) >= 50:
                    SearchIO.write(qresult,filtered_file_handler,'hmmscan3-domtab')

Any suggestions for sending only filtered hits to the filtered file.

PS - I have noticed that I am writing qresult..i need some suggestions for refining on this.

Thanks,

Vijay N

hmmscan biopython searchio • 1.6k views
ADD COMMENTlink modified 2.4 years ago by O.rka120 • written 3.9 years ago by nvijay.19910
0
gravatar for O.rka
2.4 years ago by
O.rka120
O.rka120 wrote:

This is the one I wrote. I need to do some error handling for when there are no hits b/c it will break. I put stuff like this in my Jupyter profile and it speeds stuff up... a lot.

def read_hmmscan(path, set_index="query_name", sort_values="sequence|_|e-value"):
    # Build pd.DataFrame
    df_tmp = pd.read_table(path, header=None, index_col=0)
    mask_idx = df_tmp.index.map(lambda x: x.startswith("#") == False)
    tmp_list = df_tmp.loc[mask_idx,:].index.map(lambda x: [y for y in x.split(" ") if (len(y) > 0) and (y != "-")]).tolist()
    data = list()
    for line in tmp_list:
        data.append(pd.Series(line))
    df_hmmscan = pd.DataFrame(data)
    # Labels
    id_cols = ["target_name", "query_name"]
    sequence_cols = ["sequence|_|e-value", "sequence|_|score", "sequence|_|bias"]
    domain_cols = ["domain|_|e-value", "domain|_|score", "domain|_|bias"]
    misc_cols = ["exp", "reg", "clu", "ov", "env", "dom", "rep", "inc"]
    df_hmmscan.columns  = id_cols + sequence_cols + domain_cols + misc_cols
    # Force types
    df_hmmscan.loc[:,sequence_cols + domain_cols] = df_hmmscan.loc[:,sequence_cols + domain_cols].astype(float)
    df_hmmscan.loc[:,"exp"] = df_hmmscan.loc[:,"exp"].astype(float)
    df_hmmscan.loc[:,misc_cols[1:]] = df_hmmscan.loc[:,misc_cols[1:]].astype(int)
    # Reformat structure
    if set_index:
        df_hmmscan = df_hmmscan.set_index(set_index)
    if sort_values:
        df_hmmscan = df_hmmscan.sort_values(sort_values, ascending=True)
    return df_hmmscan
ADD COMMENTlink written 2.4 years ago by O.rka120
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1102 users visited in the last hour