Question

Regular expression matching with Python and biopython SeqIO

0

Entering edit mode

9.4 years ago

Ian 6.0k

After many years of using Perl I am starting to learn Python. As an example I want to perform regular expression matching in sequences extracted from a FASTA file. The FASTA files being parsed with Biopython's SeqIO module. In the following code re.findall fails to find iupac in seq_record.seq, however if the latter is replaced with a string, e.g. 'TTAATT', a match is found. Error = TypeError: expected string or buffer.

# biopython
from Bio import SeqIO
# regex library
import re

# file with FASTA sequence
infile = "fasta.fa"

# pattern to search for
iupac = "taat"

# look through each FASTA sequence in the file
for seq_record in SeqIO.parse(infile, "fasta"):
    print "Sequence ID: ", seq_record.id, "; ", len(seq_record), "bp"
    print seq_record.seq

    # scan for IUPAC; re.I makes search case-insensitive
    matches = re.findall( iupac, seq_record.seq, re.I)
    if matches:
        print "Matches = ", len(matches)

Thanks for any guidance!

regular-expression python biopython • 8.7k views

ADD COMMENT • link updated 3.0 years ago by Ram 43k • written 9.4 years ago by Ian 6.0k

0

Entering edit mode

Hey!

How do I get to print the co-ordinates of the match?

ADD REPLY • link updated 3.0 years ago by Ram 43k • written 5.4 years ago by shubhra.bhattacharya ▴ 140

Ram · Accepted Answer · 2014-11-27

4

Entering edit mode

9.4 years ago

Peter 6.0k

The Biopython Seq object is string-like, but is not a string. Replace re.findall( iupac, seq_record.seq, re.I) with re.findall( iupac, str(seq_record.seq), re.I)

ADD COMMENT • link updated 3.0 years ago by Ram 43k • written 9.4 years ago by Peter 6.0k

0

Entering edit mode

Thank you! I thought I had already tried that, but it is now working.

ADD REPLY • link updated 3.0 years ago by Ram 43k • written 9.4 years ago by Ian 6.0k