Question: How can I find the same record in two file? Use python
0
gravatar for changhaiduan
9 months ago by
changhaiduan0 wrote:

I have two file, named *.gb and *.txt , I want to find the same record between the two files, here is the script I wrote,but it does not work, what 's wrong with it ,can you help me?

from Bio import SeqIO
i=0
f = open("E:/WSA2/blast_gb.txt", "r", encoding="utf-8")
record = SeqIO.parse("E://WSA2/filted_77468.gb","genbank")
for aci in record:
    id = aci.id
for line in f:
        if line in id:
            i=i+1
            SeqIO.write(id,"aa.gb","genbank")
print(i)
f.close()
sequence • 342 views
ADD COMMENTlink modified 9 months ago by genomax59k • written 9 months ago by changhaiduan0

Hi, please can you use "code sample" when you paste code to your post? It is much more readable.

ADD REPLYlink written 9 months ago by Paul1.2k
from Bio import SeqIO
i=0
f = open("E:/学业生涯/研究生/WSA2/blast_gb.txt", "r", encoding="utf-8")
record = SeqIO.parse("E:/学业生涯/研究生/WSA2/filted_77468.gb","genbank")
for aci in record:
    id = aci.id
    for line in f:
        if line in id:
            i=i+1
            SeqIO.write(id,"aa.gb","genbank")
        continue
print(i)
f.close()
ADD REPLYlink written 9 months ago by changhaiduan0

I am sorry,I am not familar with this ,here is the source code, and the result shows 0 ,I don't know why, could you please help me,thx!

ADD REPLYlink written 9 months ago by changhaiduan0
3
gravatar for Shred
9 months ago by
Shred70
Shred70 wrote:

The problem is that you didn't split the record by line. You can easly do it without using SeqIO module, but easly with regular python syntax. Credits: https://stackoverflow.com/questions/9585218/python-find-common-text-in-two-files

> file1 = set(line.strip() for line in open('file1.txt'))   
> file2 = set(line.strip() for line in open('file2.txt'))
>  
>   for line in file1 & file2:
>      if line:
>          print line

As explained also in the linked question, why don't you use just "comm" in bash?

I'm sorry for the lack of precision while writing code, but this forum sometimes does strange things with code.

ADD COMMENTlink modified 9 months ago by WouterDeCoster35k • written 9 months ago by Shred70

I added code markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLYlink written 9 months ago by WouterDeCoster35k

Thank you so much, I use the SeqIO only for extra Sequence information from the genbank modul file.

ADD REPLYlink written 9 months ago by changhaiduan0
1
gravatar for Bastien Hervé
9 months ago by
Bastien Hervé2.7k
Limoges, CBRS, France
Bastien Hervé2.7k wrote:

We need more information about the .txt file. I assume that you have a list of ids in it. You don't need for each record to re-read the txt file. You can stock txt file information in an array and then for each record test the aci.id on the array. Something like this :

txt_array=[]
for line in f:
    txt_array.append(line)
f.close()

Then after that you can do :

for aci in record:
    if aci.id in txt_array:
        i=i+1
        SeqIO.write(id,"aa.gb","genbank")
        ##continue (why do you use "continue" ?)

Otherwise, "print" is a savior, print your aci.id to look at its shape, maybe you have the id + the description under aci.id, then the program will never find an id in the txt_array. Just do :

for aci in record:
    printaci.id) ###Biostars editor just eat the "(" in the print...

And see if it fits with ids in your txt file.

ADD COMMENTlink modified 9 months ago • written 9 months ago by Bastien Hervé2.7k

yeah, you are right, thank you so much,I change may stategy to do this work, and I extarct the record I wanted. Here is the code,thank all of you!!

from Bio import SeqIO
i=0
f_1 = open("abc.gb","w",encoding="utf-8")
f = open("E:/blast_gb.txt", "r", encoding="utf-8")
record = SeqIO.parse("E:/filted_77468.gb","genbank")
for aci in record:
    for line in f:
        if aci.id in line:
            f_1.write(aci.format("genbank"))
            i=i+1

    f.seek(0)
print(i)
f.close()
f_1.close()
ADD REPLYlink modified 9 months ago • written 9 months ago by changhaiduan0

Once again you have a double nested "for" loop. For each record in your genbank file you read all your "f" file, which is time consuming if you have a huge "f" file.

Save all lines from "f" in an array first :

for line in f:
    array.append(line)

You can close your "f" file here.

Then, use the array :

for aci in record:
    if aci.id in array:
ADD REPLYlink written 9 months ago by Bastien Hervé2.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1038 users visited in the last hour