Searching for a sequence string in python?
3
1
Entering edit mode
8.5 years ago
auryndb ▴ 70

Hey all,

So I'm using plain python (I'm not using BioPython) to search for a string in E. Coli genome. How I do it is that I read each line of a fasta sequence, and I'll do an if sequence return thing on it, a pseudocode like this:

ecoli_sequence = open('ecolik12.fasta', 'r')
a = ecoli_sequence.readlines()[1:]
for y in a:
    if "TATAAA" in y:
        print ("it's here"+y)
    else:
        print("it wasn't in the genome, stupid code albeit")
ecoli_sequence.close()

however, there is one big problem. if my sequence is at the interface of each line, it can't recognize it. What do you guys suggest?

Please help, I really will appreciate it.

Truly yours,

python genome • 9.1k views
ADD COMMENT
2
Entering edit mode

In your program you should store the whole sequence of the Ecoli chromosome as a single large string of about 5 million characters. The FASTA format was developed in the age of punch cards when line length was restricted to 80 characters. Please note, that bacterial chromosomes are circular. It is common praxis to represent them as linear stings, but there remains one corner case, if your pattern partially overlaps the start and the end of the linear sequence.

ADD REPLY
1
Entering edit mode

If you want to be able to parse a variable sequence line fasta file without biopython, there are a couple of posts on biostars with example code of how to do that. Here is a great blog post on how to use python's itertools.groupby to do that:

https://drj11.wordpress.com/2010/02/22/python-getting-fasta-with-itertools-groupby/

ADD REPLY
0
Entering edit mode

Umm, sorry, but what do you mean by "interface" here?

ADD REPLY
0
Entering edit mode

okay imagine the fasta sequence like (fasta sequences are divided in lines):

 atatatacggcgcgagaatatctctc**atata**
 **atat**ctctctatattatagccctctctctctctaa

with my code, it can't find the sequence if it's at the end of the first line and at the beginning of the second line.

ADD REPLY
1
Entering edit mode
8.5 years ago

use biopython to properly handle fasta file.

from Bio import SeqIO

for rec in SeqIO.parse("ecolik12.fasta","fasta"):
    if "TATAAA" in rec.seq:
        . . . .

It looks like you were already suggested to use BioPython to read a fasta file. Importing big sequences into Python

ADD COMMENT
1
Entering edit mode
8.5 years ago

If you don't want to add the BioPython dependency, you just need to stitch together the lines with ''.join(...):

a = ecoli_sequence.readlines()[1:]
a= ''.join([x.strip() for x in a])

if "TATAAA" in a:
    print ("Found")
else:
    print("it wasn't in the genome, stupid code albeit")

And if you want to test whether the substring is at the start-end junction of the circular genome:

x= "TATAAA"
xl= len(x)-1
junct= a[len(a)-xl:] + a[0:xl]
if x in junct:
    ...
ADD COMMENT
1
Entering edit mode
8.5 years ago
John 13k

The fastest algorithm I can think of without reading the whole file into memory first would be:

seq = 'TGCACTGATG'
seq_len = len(seq)
with open('ecoli.fa', 'r') as ecoli_sequence:
    previous = ''
    for line,data in enumerate(ecoli_sequence):
        if seq not in previous + data:
            previous = data[-seq_len:-1]
        else:
            if seq in previous + data[:seq_len]:
                print "It's split over lines:", line, "and", line+1
            else:
                print "It's in line: ", line+1
            break

I'm not sure for small genomes it's any faster than reading the whole file into memory. Actually I'd say reading the whole file into memory is probably significantly faster than this. But then you have to deal with parsing out the \n's Speaking of which, if you have \r\n at the end of each line, not \n, the -1 has to be a -2. You can use strip put meh

ADD COMMENT

Login before adding your answer.

Traffic: 1103 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6