Question: Regular expression matching with Python and biopython SeqIO
0
gravatar for Ian
4.9 years ago by
Ian5.5k
University of Manchester, UK
Ian5.5k wrote:

After many years of using Perl I am starting to learn Python.  As an example I want to perform regular expression matching in sequences extracted from a FASTA file.  The FASTA files being parsed with Biopython's SeqIO module.  In the following code 're.findall' fails to find 'iupac' in 'seq_record.seq', however if the latter is replaced with a string, e.g. 'TTAATT', a match is found.  Error = "TypeError: expected string or buffer".

# biopython
from Bio import SeqIO
# regex library
import re

# file with FASTA sequence
infile = "fasta.fa"

# pattern to search for
iupac = "taat"

# look through each FASTA sequence in the file
for seq_record in SeqIO.parse(infile, "fasta"):
    print "Sequence ID: ", seq_record.id, "; ", len(seq_record), "bp"
    print seq_record.seq

    # scan for IUPAC; re.I makes search case-insensitive
    matches = re.findall( iupac, seq_record.seq, re.I)
    if matches:
        print "Matches = ", len(matches)

Thanks for any guidance!

ADD COMMENTlink modified 4.9 years ago • written 4.9 years ago by Ian5.5k

Hey! How do I get to print the co-ordinates of the match?

ADD REPLYlink written 10 months ago by shubhra.bhattacharya120
3
gravatar for Peter
4.9 years ago by
Peter5.8k
Scotland, UK
Peter5.8k wrote:

The Biopython Seq object is string-like, but is not a string. Replace re.findall( iupac, seq_record.seq, re.I) with re.findall( iupac, str(seq_record.seq), re.I)

ADD COMMENTlink written 4.9 years ago by Peter5.8k

Thank you!  I thought I had already tried that, but it is now working.

ADD REPLYlink written 4.9 years ago by Ian5.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 3416 users visited in the last hour