How can I find the same record in two file? Use python
2
0
Entering edit mode
3.6 years ago

I have two file, named *.gb and *.txt , I want to find the same record between the two files, here is the script I wrote,but it does not work, what 's wrong with it ,can you help me?

from Bio import SeqIO
i=0
f = open("E:/WSA2/blast_gb.txt", "r", encoding="utf-8")
record = SeqIO.parse("E://WSA2/filted_77468.gb","genbank")
for aci in record:
id = aci.id
for line in f:
if line in id:
i=i+1
SeqIO.write(id,"aa.gb","genbank")
print(i)
f.close()

sequence • 917 views
0
Entering edit mode

Hi, please can you use "code sample" when you paste code to your post? It is much more readable.

0
Entering edit mode
from Bio import SeqIO
i=0
f = open("E:/学业生涯/研究生/WSA2/blast_gb.txt", "r", encoding="utf-8")
record = SeqIO.parse("E:/学业生涯/研究生/WSA2/filted_77468.gb","genbank")
for aci in record:
id = aci.id
for line in f:
if line in id:
i=i+1
SeqIO.write(id,"aa.gb","genbank")
continue
print(i)
f.close()

0
Entering edit mode

I am sorry,I am not familar with this ,here is the source code, and the result shows 0 ,I don't know why, could you please help me，thx!

3
Entering edit mode
3.6 years ago
Shred ▴ 460

The problem is that you didn't split the record by line. You can easly do it without using SeqIO module, but easly with regular python syntax. Credits: https://stackoverflow.com/questions/9585218/python-find-common-text-in-two-files

> file1 = set(line.strip() for line in open('file1.txt'))
> file2 = set(line.strip() for line in open('file2.txt'))
>
>   for line in file1 & file2:
>      if line:
>          print line


As explained also in the linked question, why don't you use just "comm" in bash?

I'm sorry for the lack of precision while writing code, but this forum sometimes does strange things with code.

0
Entering edit mode

I added code markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

0
Entering edit mode

Thank you so much, I use the SeqIO only for extra Sequence information from the genbank modul file.

1
Entering edit mode
3.6 years ago

We need more information about the .txt file. I assume that you have a list of ids in it. You don't need for each record to re-read the txt file. You can stock txt file information in an array and then for each record test the aci.id on the array. Something like this :

txt_array=[]
for line in f:
txt_array.append(line)
f.close()


Then after that you can do :

for aci in record:
if aci.id in txt_array:
i=i+1
SeqIO.write(id,"aa.gb","genbank")
##continue (why do you use "continue" ?)


Otherwise, "print" is a savior, print your aci.id to look at its shape, maybe you have the id + the description under aci.id, then the program will never find an id in the txt_array. Just do :

for aci in record:
printaci.id) ###Biostars editor just eat the "(" in the print...


And see if it fits with ids in your txt file.

0
Entering edit mode

yeah, you are right, thank you so much,I change may stategy to do this work, and I extarct the record I wanted. Here is the code,thank all of you!!

from Bio import SeqIO
i=0
f_1 = open("abc.gb","w",encoding="utf-8")
f = open("E:/blast_gb.txt", "r", encoding="utf-8")
record = SeqIO.parse("E:/filted_77468.gb","genbank")
for aci in record:
for line in f:
if aci.id in line:
f_1.write(aci.format("genbank"))
i=i+1

f.seek(0)
print(i)
f.close()
f_1.close()

0
Entering edit mode

Once again you have a double nested "for" loop. For each record in your genbank file you read all your "f" file, which is time consuming if you have a huge "f" file.

Save all lines from "f" in an array first :

for line in f:
array.append(line)


You can close your "f" file here.

Then, use the array :

for aci in record:
if aci.id in array: