Question: How to separate a BLAST ouput file (tabular form) into files containing the hits for each protein searched.
0
gravatar for Becca
18 months ago by
Becca0
Becca0 wrote:

Hi there,

I was wondering if you could help me. So I have done a multi-sequence blastp search which has generated an output.tsv file. I need to separate that tsv file into separate files containing the hits for each protein search. So for protein 1 have all the information of the hits into one file. And then protein 2 into a separate file. I tried to do this by limiting the target sequences to 10 and then splitting it by line number. So 10 hits go in each file but there are some proteins with 3 or 4 hits so then it messes up the separation. And I have to do this on Python !

I am in dire need of some help!

I know you can parse a blast out file but how would I then direct all the hits for each protein into a different file.

Any help would be really appreciated!! Thank you

blast python • 429 views
ADD COMMENTlink written 18 months ago by Becca0

Providing an outline for a non-python solution.

cut the first column out and then uniq that list to get sequence ID's that have hits. Then use grep in a loop with -w option to extract lines that contain that ID.

And I have to do this on Python !

Is this an assignment?

ADD REPLYlink modified 18 months ago • written 18 months ago by genomax91k

No it's not an assignment. But I'm working with someone who only uses python. I could do it if I didn't have to use python but using python is confusing me slightly... Thank you though for your input.

ADD REPLYlink modified 18 months ago • written 18 months ago by Becca0

Is this standard -outfmt 6 tabular output? Are you python savvy or no?

It may be something like this (something I found on web) :

OR

https://www.reddit.com/r/bioinformatics/comments/4ef5p8/how_to_filter_blast_results_using_biopython/

ADD REPLYlink modified 18 months ago • written 18 months ago by genomax91k

Yeah maybe. I will have a look into that. I usually use pandas to make a dataframe to make plots and things.

Or I was thinking something like this :

from blast import parse

fh = open('blast.tsv')
for blast_record in parse(fh):
    print('query id: {}'.format(blast_record.qid))
    for hit in blast_record.hits:
        for hsp in hit:
            print('****Alignment****')
            print('sequence:', hsp.sid)
            print('length:', hsp.length)
            print('e value:', hsp.evalue)

Output would look like:

query id: cgl|CAGL0A00187g
****Alignment****
('sequence:', u'ecy|Ecym_8168')
('length:', 90)
('e value:', 0.0001)
****Alignment****
('sequence:', u'ecy|Ecym_8168')
('length:', 44)
('e value:', 0.0007)
****Alignment****
('sequence:', u'ecy|Ecym_4273')
('length:', 84)
('e value:', 0.64)

But I don't know how to all of that direct that into a file... I'll have a think...

ADD REPLYlink modified 18 months ago by genomax91k • written 18 months ago by Becca0

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

Thank you!

ADD REPLYlink written 18 months ago by genomax91k

And to add to this, as soon as you have it in a DataFrame you could use something like the following loop (untested):

for query in df["query_id"].unique():
    df.loc[df"query_id"] == query].to_csv("blast_{}.txt".format(query), sep="\t", index=False)
ADD REPLYlink written 18 months ago by WouterDeCoster44k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 718 users visited in the last hour