Question: Searching for a sequence string in python?
1
gravatar for auryndb
3.1 years ago by
auryndb60
Canada
auryndb60 wrote:

Hey all,

So I'm using plain python (I'm not using BioPython) to search for a string in E. Coli genome. How I do it is that I read each line of a fasta sequence, and I'll do an if sequence return thing on it, a pseudocode like this:

ecoli_sequence = open('ecolik12.fasta', 'r')
a = ecoli_sequence.readlines()[1:]
for y in a:
    if "TATAAA" in y:
        print ("it's here"+y)
    else:
        print("it wasn't in the genome, stupid code albeit")
ecoli_sequence.close()

however, there is one big problem. if my sequence is at the interface of each line, it can't recognize it. What do you guys suggest?

Please help, I really will appreciate it.

Truly yours,

python genome • 3.9k views
ADD COMMENTlink modified 3.1 years ago by John12k • written 3.1 years ago by auryndb60
2

In your program you should store the whole sequence of the Ecoli chromosome as a single large string of about 5 million characters. The FASTA format was developed in the age of punch cards when line length was restricted to 80 characters. Please note, that bacterial chromosomes are circular. It is common praxis to represent them as linear stings, but there remains one corner case, if your pattern partially overlaps the start and the end of the linear sequence.

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by piet1.6k
1

If you want to be able to parse a variable sequence line fasta file without biopython, there are a couple of posts on biostars with example code of how to do that. Here is a great blog post on how to use python's itertools.groupby to do that:

https://drj11.wordpress.com/2010/02/22/python-getting-fasta-with-itertools-groupby/

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by Damian Kao15k

Umm, sorry, but what do you mean by "interface" here?

ADD REPLYlink written 3.1 years ago by RamRS20k

okay imagine the fasta sequence like (fasta sequences are divided in lines):

 atatatacggcgcgagaatatctctc**atata**
 **atat**ctctctatattatagccctctctctctctaa

with my code, it can't find the sequence if it's at the end of the first line and at the beginning of the second line.

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by auryndb60
1
gravatar for geek_y
3.1 years ago by
geek_y9.3k
Barcelona/CRG/London/Imperial
geek_y9.3k wrote:

use biopython to properly handle fasta file.

from Bio import SeqIO

for rec in SeqIO.parse("ecolik12.fasta","fasta"):
    if "TATAAA" in rec.seq:
        . . . .

It looks like you were already suggested to use BioPython to read a fasta file. Importing big sequences into Python

ADD COMMENTlink modified 3.1 years ago • written 3.1 years ago by geek_y9.3k
1
gravatar for dariober
3.1 years ago by
dariober9.9k
WCIP | Glasgow | UK
dariober9.9k wrote:

If you don't want to add the BioPython dependency, you just need to stitch together the lines with ''.join(...):

a = ecoli_sequence.readlines()[1:]
a= ''.join([x.strip() for x in a])

if "TATAAA" in a:
    print ("Found")
else:
    print("it wasn't in the genome, stupid code albeit")

And if you want to test whether the substring is at the start-end junction of the circular genome:

x= "TATAAA"
xl= len(x)-1
junct= a[len(a)-xl:] + a[0:xl]
if x in junct:
    ...
ADD COMMENTlink modified 3.1 years ago • written 3.1 years ago by dariober9.9k
1
gravatar for John
3.1 years ago by
John12k
Germany
John12k wrote:

The fastest algorithm I can think of without reading the whole file into memory first would be:

seq = 'TGCACTGATG'
seq_len = len(seq)
with open('ecoli.fa', 'r') as ecoli_sequence:
    previous = ''
    for line,data in enumerate(ecoli_sequence):
        if seq not in previous + data:
            previous = data[-seq_len:-1]
        else:
            if seq in previous + data[:seq_len]:
                print "It's split over lines:", line, "and", line+1
            else:
                print "It's in line: ", line+1
            break

I'm not sure for small genomes it's any faster than reading the whole file into memory. Actually I'd say reading the whole file into memory is probably significantly faster than this. But then you have to deal with parsing out the \n's Speaking of which, if you have \r\n at the end of each line, not \n, the -1 has to be a -2. You can use strip put meh

ADD COMMENTlink modified 3.1 years ago • written 3.1 years ago by John12k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1226 users visited in the last hour