Question: biopython string object
0
gravatar for bsp017
4 weeks ago by
bsp01730
Denmark, Copenhagen, UCPH
bsp01730 wrote:

Hi all,

I'm trying to incorporate a regular expression command in a biopython script. This prodcues an error:

AttributeError: 'str' object has no attribute 'id'

What I would like to do is to match a pattern within a Fasta file and replace the matching characters with other characters.

From this:

>BA_03462|gyrB Brenneria alni strain NCPPB
ATGTCGAATTCTTATGACTCCTCAAGTATCAAGGTATTGAAAGGGCTGGATGCGGTACGT

To this:

>BA|gyrB Brenneria alni strain NCPPB
ATGTCGAATTCTTATGACTCCTCAAGTATCAAGGTATTGAAAGGGCTGGATGCGGTACGT

Using the re module I can find and replace the pattern with this command:

matches = re.findall(r'_(.....)', str(seq_record))
for m in matches:
    change = str(seq_record), faa_filename.replace('_%s' % m, ' ')

The complete function is here:

   def change_string():
        with open('outfile_padded.fasta')as f:
            for seq_record in SeqIO.parse(f, "fasta"):
                    seq_record.id = seq_record.description = matches = re.findall(r'_(.....)', str(seq_record))
                    for m in matches:
                        change = str(seq_record), faa_filename.replace('_%s' % m, ' ')
        SeqIO.write(change, 'string.fasta', "fasta")
    change_string()

However the attribute error arises as biopython wants a string like object, but re wants a string. I've tried to modify the script but cannot find a way to please both modules.

Does anyone know a solution to this?

Thanks,

James

python --version Python 3.6.8 :: Anaconda, Inc. biopython==1.73 Red Hat 4.8.5-36

ADD COMMENTlink modified 4 weeks ago by Joe14k • written 4 weeks ago by bsp01730
1

Do you absolutely need to use python? Would it not be easier to just use sed? Also, why not use re.sub(..., count=0)?

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by RamRS24k

Building on RamRS's comment, why even use Biopython/SeqIO? Can't you just treat your data as a standard text file and blow through it line-by-line, avoiding any overhead from SeqIO.parse() (only really matters if your fasta is large)? I would also use sed for a quick turnaround.

ADD REPLYlink written 4 weeks ago by Brice Sarver2.9k

While it is probably fine to do so in this case, I would contend that the better general advice is to always use a well trusted parser whenever possible...

ADD REPLYlink written 4 weeks ago by Joe14k

Yes it would probably be easier to use a sed or awk command. I was trying to keep this part of my pipeline to python to avoid having to go out of a single python script and also I want to learn more python.

Would the re.sub command aviod using findall and replace?

ADD REPLYlink written 4 weeks ago by bsp01730

Find matches to a regular expression + substitute = re.sub is the first thing that comes to my mind, as the substitute operation is not complex enough to warrant a find/match followed by a bunch of steps. From a cursory glance at re documentation (I don't use python), it seems like the substitution argument can also be a method, which would address even complicated substitution problems. I see no reason to not use re.sub.

ADD REPLYlink written 4 weeks ago by RamRS24k
3
gravatar for Joe
4 weeks ago by
Joe14k
United Kingdom
Joe14k wrote:

There are a couple of problems here I think.

Firstly, the error you're getting isn't saying what you think it is. It's saying that somewhere, you're trying to call the attribute id from an object which has no such attribute, not that there is an unexpected string or otherwise.

I'm guessing this has something to do with this line where there's a lot going on and kind of asking for trouble: seq_record.id = seq_record.description = matches = re.findall(r'_(.....)', str(seq_record))

All you really need to do is the following (assuming your fasta formatting never deviates). I've also changed your regex to be a bit more stringent.

import re, sys
from Bio import SeqIO

regex = re.compile(r"(_\d{5})")

for rec in SeqIO.parse(sys.argv[1], 'fasta'):
    match = regex.search(rec.description).group()
    rec.description = rec.description.replace(str(match), "")
    print(">" + rec.description)
    print(str(rec.seq))

Input:

>BA_03462|gyrB Brenneria alni strain NCPPB
ATGTCGAATTCTTATGACTCCTCAAGTATCAAGGTATTGAAAGGGCTGGATGCGGTACGT

Script: python scriptname.py sequences.fasta

Output:

>BA|gyrB Brenneria alni strain NCPPB
ATGTCGAATTCTTATGACTCCTCAAGTATCAAGGTATTGAAAGGGCTGGATGCGGTACGT

Or even more simply, using re.sub:

import re, sys
from Bio import SeqIO

regex = re.compile(r"(_\d{5})")

for rec in SeqIO.parse(sys.argv[1], 'fasta'):
    rec.description = re.sub(regex, "", rec.description)
    print(">" + rec.description)
    print(str(rec.seq))
ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by Joe14k
1

Works perfectly! Thanks

ADD REPLYlink written 4 weeks ago by bsp01730

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one answer if they all work.

Upvote|Bookmark|Accept

ADD REPLYlink written 4 weeks ago by RamRS24k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1140 users visited in the last hour