Blastp how to find and count duplicates?..
1
2
Entering edit mode
7.8 years ago
Buffo ★ 2.4k

Hi everybody, I did a blastp on linux command line (about 7,500 genes), outfmt6, I did a filter (e-value, max_hsps, etc), just one hit per query, and it works very well, but I found that some genes have hit with the same target and now I want to now how many genes have the same target hit, so what I did was; cut -f 2 result_blastp_outfmt6.txt > to_find_duplicates.txt, column 2 have the target accession number and then, this is my python script: It Find target names duplicates (2 or more times) and print them. It works, but I want to know how many times is present each duplicate on blast result, I'm a noob in python and i don't get it until now :(

some advices??!

#find duplicates and then write them in duplicates.txt file
file = open(sys.argv[1],'r')
list = {}
for elem in file:
    if elem in list:
        list[elem] += 1
    else:
        list[elem] = 1
dups = [x for x, y in list.items() if y > 1]

file_out = open('duplicates.txt', 'w')
for line in dups:
    file_out.write(line)
file_out.close() 
blast gene outfmt6 python script • 2.4k views
ADD COMMENT
2
Entering edit mode

what about just using cut | sort | uniq -d ?

ADD REPLY
2
Entering edit mode

No, uniq -d does the same that my script, anyways I didnĀ“t know that command, thank`s!

ADD REPLY
2
Entering edit mode
7.8 years ago
piet ★ 1.8k

I want to know how many times is present each duplicate on blast result

your python script needs only minor changes to accomplish that.

for line, count in list.iteritems() if count > 1:
    sys.stdout.write("%i\t%s" % (count,line,))

In this case you want to "aggregate" (collect) data across several lines of the blast output. Problems like this (sorting and aggregating) can be solved with python efficiently. I would recommend the "Python Cookbook" by Alex Martelli to learn more about how to write short and efficient python code for string processing and sorting.

Coding style: 'list' and 'file' are names build-in into python. You should never use build-in names for your own variables.

I would also recommend to read the blast output directly with python:

acc_count = {}
for line in open("result_blastp_outfmt6.txt"):
    fields = line.split("\t")
    acc = fields[1] # zero-based indexing
    acc_count[acc] = acc_count.get(acc,0) + 1 # this is the TYPICAL pythonic idiom for counting
ADD COMMENT
1
Entering edit mode

Thanks piet, I tried to add your suggestion but it doesn`t works:

for line, count in list.iteritems() if count > 1:
SyntaxError: invalid syntax

I added that after

dups = [x for x, y in list.items() if y > 1

That`s right ? Or it was instead of?

Thanks piet!

ADD REPLY
2
Entering edit mode

Yes, I see. The syntax is wrong. The if clause has to be moved into the for loop. Try out the variants, this will help you understand programming.

ADD REPLY
0
Entering edit mode
for line, count in list.iteritems():
    if count > 1:
        sys.stdout.write("%i\t%s" % (count,line,))

I love it, it works perfect! I will try also read directly the blast output! thank you so much piet!

ADD REPLY

Login before adding your answer.

Traffic: 2580 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6