Entering edit mode
8.5 years ago
auryndb
▴
70
I wrote a code in python to read DNA sequences and do a motif alignment on them but I'm looking for a more efficient way to do this. See below if you can help:
handle = open("a.fas.txt", "r")
a = handle.readlines()[1:]
a = ''.join([x.strip() for x in a])
with open("Output.txt", "w") as text_file:
text_file.write(a)
f = 0
z = 100
b = ''
while f < len(a):
b += a[f:z]+'\n'
f += 1
z += 1
with open("2.txt", "w") as runner_mtfs:
runner_mtfs.write(b)
I want to do a bunch of analysis on each line of b, but I don't know of any more efficient way to do this. The out put file is more than 500 megabytes. Any suggestions, the first file is just a DNA sequence, and it the first line of code I'm joining all the lines together, and I'm departing 100 base pairs every time so I could do analysis on it.
Python is pretty slow, particularly at tasks involving I/O and string processing. You'd get a huge speedup using C to process arrays of strings and running an analysis on the substrings within memory (if possible).
So you are generating 100bp fragments from the initial string, with a sliding window of 1bp, to find motifs?
You might be interested in ACGTrie.
It's written in python, but does most of it's work in C arrays using either ctypes / Numpy / CFFI. It's pretty fast if you can use pypy. The end result is table/trie that contains all possible DNA substrings and their counts in a very space efficient format. You could also use a kmer tool for 100bp kmers. That might be the more established path, since ACGTrie is only a proof of concept.
Is there a specific reason you are not using something like the MEME suite (http://meme-suite.org/) for motif finding? In my experience, both the web submit and local version were fast and easy to use.