Question: Blastp how to find and count duplicates?..
1
gravatar for Buffo
3.9 years ago by
Buffo1.8k
Buffo1.8k wrote:

Hi everybody, I did a blastp on linux command line (about 7,500 genes), outfmt6, I did a filter (e-value, max_hsps, etc), just one hit per query, and it works very well, but I found that some genes have hit with the same target and now I want to now how many genes have the same target hit, so what I did was; cut -f 2 result_blastp_outfmt6.txt > to_find_duplicates.txt, column 2 have the target accession number and then, this is my python script: It Find target names duplicates (2 or more times) and print them. It works, but I want to know how many times is present each duplicate on blast result, I'm a noob in python and i don't get it until now :(

some advices??!

#find duplicates and then write them in duplicates.txt file
file = open(sys.argv[1],'r')
list = {}
for elem in file:
    if elem in list:
        list[elem] += 1
    else:
        list[elem] = 1
dups = [x for x, y in list.items() if y > 1]

file_out = open('duplicates.txt', 'w')
for line in dups:
    file_out.write(line)
file_out.close() 
script blast outfmt6 python gene • 1.2k views
ADD COMMENTlink modified 3.9 years ago by piet1.7k • written 3.9 years ago by Buffo1.8k
2

what about just using cut | sort | uniq -d ?

ADD REPLYlink written 3.9 years ago by Pierre Lindenbaum128k
2

No, uniq -d does the same that my script, anyways I didnĀ“t know that command, thank`s!

ADD REPLYlink written 3.9 years ago by Buffo1.8k
2
gravatar for piet
3.9 years ago by
piet1.7k
planet earth
piet1.7k wrote:

I want to know how many times is present each duplicate on blast result

your python script needs only minor changes to accomplish that.

for line, count in list.iteritems() if count > 1:
    sys.stdout.write("%i\t%s" % (count,line,))

In this case you want to "aggregate" (collect) data across several lines of the blast output. Problems like this (sorting and aggregating) can be solved with python efficiently. I would recommend the "Python Cookbook" by Alex Martelli to learn more about how to write short and efficient python code for string processing and sorting.

Coding style: 'list' and 'file' are names build-in into python. You should never use build-in names for your own variables.

I would also recommend to read the blast output directly with python:

acc_count = {}
for line in open("result_blastp_outfmt6.txt"):
    fields = line.split("\t")
    acc = fields[1] # zero-based indexing
    acc_count[acc] = acc_count.get(acc,0) + 1 # this is the TYPICAL pythonic idiom for counting
ADD COMMENTlink written 3.9 years ago by piet1.7k
1

Thanks piet, I tried to add your suggestion but it doesn`t works:

for line, count in list.iteritems() if count > 1:
SyntaxError: invalid syntax

I added that after

dups = [x for x, y in list.items() if y > 1

That`s right ? Or it was instead of?

Thanks piet!

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by Buffo1.8k
2

Yes, I see. The syntax is wrong. The if clause has to be moved into the for loop. Try out the variants, this will help you understand programming.

ADD REPLYlink written 3.9 years ago by piet1.7k
for line, count in list.iteritems():
    if count > 1:
        sys.stdout.write("%i\t%s" % (count,line,))

I love it, it works perfect! I will try also read directly the blast output! thank you so much piet!

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by Buffo1.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2235 users visited in the last hour