Question

Biopython Motif To Find Patterns In Sequences

1

Entering edit mode

11.7 years ago

dami.gupta ▴ 10

I have a sequence file where for each read, I want to check for instances of a pattern(s). I will demultiplex the file into several files based on which pattern is found in each read. I am trying out the Biopython Motif module as -

motifs=Motif.Motif(alphabet=IUPAC.unambiguous_dna)  
motif.add_instance(Seq('ATAGCATAG',motifs.alphabet))
motif.add_instance(Seq('AAGCATAAG',motifs.alphabet))
motif.add_instance(Seq('CTAGCAGGG',motifs.alphabet))

etc. Then I loop over the file, and print out to the files as -

for read in SeqIO.parse(r_file,"fastq"):       
    for motif in motifs.search_instances(read.seq):                                                 
        motif_seq= str(motif[1])
        out_handle=dict_out_file_handles[motif_seq]
        out_handle.write(read.format("fastq"))

This is turning out to be rather slow. Is there are faster way to do this? Should I use plain regex?

Thanks very much.

biopython motif next-gen • 8.5k views

ADD COMMENT • link 11.7 years ago by dami.gupta ▴ 10

0

Entering edit mode

You're doing a lot of IO during the loops! Perhaps you could read the FASTQ file into memory first, push any identified sequences with the motifs to a list and then output to disk?

ADD REPLY • link 11.7 years ago by Steve Moss 2.3k

score 1 · Answer 1 · 2012-08-27

1

Entering edit mode

11.7 years ago

Whetting ★ 1.6k

Hi, have you checked

MotifMetrics -- Kitchen sink of routines for evaluating motifs on genomes This module is used both as a command-line script and as a repository for motif metrics. For a summary of the command-line usage, just type
python MotifMetrics.py

It may be able to help.

ADD COMMENT • link 11.7 years ago by Whetting ★ 1.6k

score 1 · Answer 2 · 2012-08-28

The first rule about optimisation is to profile the code to find out where it is slow. However, based on experience I would guess that most of the time taken is overhead is reading FASTQ records into SeqRecord objects (including decoding the quality scores), and doing the reverse when writing them. You could stick with parsing the FASTQ records as strings instead, see http://news.open-bio.org/news/2009/09/biopython-fast-fastq/

Istvan Albert · Answer 3 · 2012-08-27

0

Entering edit mode

11.7 years ago

dami.gupta ▴ 10

Thanks - TAMO.MotifMetrics - not free :(

ADD COMMENT • link 11.7 years ago by dami.gupta ▴ 10

0

Entering edit mode

From: http://bioinformatics.oxfordjournals.org/content/21/14/3164.full.pdf

"Availability: TAMO is a Python/C++ package and requires Python 2.3 or higher. Source code and documentation are available at http://web.wi.mit.edu/fraenkel/TAMO/ maybe contact the corresponding author?

EDIT: http://fraenkel.mit.edu/TAMO/

ADD REPLY • link updated 11.7 years ago by Istvan Albert 100k • written 11.7 years ago by Whetting ★ 1.6k

score 0 · Answer 4 · 2012-08-27

0

Entering edit mode

11.7 years ago

Steve Moss 2.3k

Further to my comment, something like this would remove the disk IO during the for loops:

reads = list(SeqIO.parse(r_file, "fastq"))

for i in range(len(reads)):
    for motif in motifs.search_instances(reads[i].seq):
        motif_seq = str(motif[1])
        outseqs.append([motif_seq, i])

for seq in outseqs:
    out_handle = dict_out_file_handles[seq[0]]
    out_handle.write(reads[seq[1]].format("fastq"))

ADD COMMENT • link 11.7 years ago by Steve Moss 2.3k

0

Entering edit mode

Not very memory friendly however :S

ADD REPLY • link 11.7 years ago by Steve Moss 2.3k

0

Entering edit mode

My read file is huge. Creating a list would be too expensive

ADD REPLY • link 11.7 years ago by dami.gupta ▴ 10

0

Entering edit mode

How huge is huge?

ADD REPLY • link 11.7 years ago by Steve Moss 2.3k

score 0 · Answer 5 · 2012-08-28

0

Entering edit mode

11.7 years ago

dami.gupta ▴ 10

I think that what is taking time is all_motifs.search_instances(read.seq) - which is the searching on the motif. I have taken that off, and am using

if bcode in str(rseq):
... and it is much faster

but, let me look at the FastqGeneralIterator now. Thanks everyone.

ADD COMMENT • link 11.7 years ago by dami.gupta ▴ 10