Biopython Motif To Find Patterns In Sequences
5
1
Entering edit mode
11.7 years ago
dami.gupta ▴ 10

I have a sequence file where for each read, I want to check for instances of a pattern(s). I will demultiplex the file into several files based on which pattern is found in each read. I am trying out the Biopython Motif module as -

motifs=Motif.Motif(alphabet=IUPAC.unambiguous_dna)  
motif.add_instance(Seq('ATAGCATAG',motifs.alphabet))
motif.add_instance(Seq('AAGCATAAG',motifs.alphabet))
motif.add_instance(Seq('CTAGCAGGG',motifs.alphabet))

etc. Then I loop over the file, and print out to the files as -

for read in SeqIO.parse(r_file,"fastq"):       
    for motif in motifs.search_instances(read.seq):                                                 
        motif_seq= str(motif[1])
        out_handle=dict_out_file_handles[motif_seq]
        out_handle.write(read.format("fastq"))

This is turning out to be rather slow. Is there are faster way to do this? Should I use plain regex?

Thanks very much.

biopython motif next-gen • 8.5k views
ADD COMMENT
0
Entering edit mode

You're doing a lot of IO during the loops! Perhaps you could read the FASTQ file into memory first, push any identified sequences with the motifs to a list and then output to disk?

ADD REPLY
1
Entering edit mode
11.7 years ago
Whetting ★ 1.6k

Hi, have you checked

MotifMetrics -- Kitchen sink of routines for evaluating motifs on genomes This module is used both as a command-line script and as a repository for motif metrics. For a summary of the command-line usage, just type

python MotifMetrics.py

It may be able to help.

ADD COMMENT
1
Entering edit mode
11.7 years ago
Peter 6.0k

The first rule about optimisation is to profile the code to find out where it is slow. However, based on experience I would guess that most of the time taken is overhead is reading FASTQ records into SeqRecord objects (including decoding the quality scores), and doing the reverse when writing them. You could stick with parsing the FASTQ records as strings instead, see http://news.open-bio.org/news/2009/09/biopython-fast-fastq/

ADD COMMENT
0
Entering edit mode
11.7 years ago
dami.gupta ▴ 10

Thanks - TAMO.MotifMetrics - not free :(

ADD COMMENT
0
Entering edit mode

From: http://bioinformatics.oxfordjournals.org/content/21/14/3164.full.pdf

"Availability: TAMO is a Python/C++ package and requires Python 2.3 or higher. Source code and documentation are available at http://web.wi.mit.edu/fraenkel/TAMO/ maybe contact the corresponding author?

EDIT: http://fraenkel.mit.edu/TAMO/

ADD REPLY
0
Entering edit mode
11.7 years ago

Further to my comment, something like this would remove the disk IO during the for loops:

reads = list(SeqIO.parse(r_file, "fastq"))

for i in range(len(reads)):
    for motif in motifs.search_instances(reads[i].seq):
        motif_seq = str(motif[1])
        outseqs.append([motif_seq, i])

for seq in outseqs:
    out_handle = dict_out_file_handles[seq[0]]
    out_handle.write(reads[seq[1]].format("fastq"))
ADD COMMENT
0
Entering edit mode

Not very memory friendly however :S

ADD REPLY
0
Entering edit mode

My read file is huge. Creating a list would be too expensive

ADD REPLY
0
Entering edit mode

How huge is huge?

ADD REPLY
0
Entering edit mode
11.7 years ago
dami.gupta ▴ 10

I think that what is taking time is all_motifs.search_instances(read.seq) - which is the searching on the motif. I have taken that off, and am using

if bcode in str(rseq):
... and it is much faster

but, let me look at the FastqGeneralIterator now. Thanks everyone.

ADD COMMENT

Login before adding your answer.

Traffic: 2315 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6