Question: How to iterate using regex over output of SeqIO -> using output of seq_record.seq as string to Regex over
1
gravatar for zebudavid
3.9 years ago by
zebudavid10
zebudavid10 wrote:

Hi, I am new to Biopython and I've been trying to explore the capabilities of the SeqIO function to iterate over a FASTA file, more specifically on a Regex( regular expressions) task. What I need to do is find PolyQ aglomerations in Human proteome. Here is what I have right now (using latest NCBI proteome ftp):

import re
from Bio import SeqIO

def reader():
    for seq_record in SeqIO.parse("Gnomon_prot_micro.fsa", "fasta"):

          sequence = (str(seq_record.seq)) 
          print sequence   #just to verify
          gene_name = seq_record.id)
          print gene_name   #just to verify 
          compiler = re.compile('QQQ+')
          while sequence:  #do I need to start a new cycle to iterate over sequence ?  
                read = re.finditer(compiler,sequence)
                for m in read:
                      print m.start(), m.group()  # need to get bulk position and how many Qs

Is there a better way to iterate over the document to obtain only the sequences? ( I am relating every sequence with its respective gene name later). Sorry for being a noob and correct me if anything.

regex biopython seqio • 1.5k views
ADD COMMENTlink written 3.9 years ago by zebudavid10

It's better to compile the pattern for re out of the for loop, you only need to do that once.

I'm not sure why you use the while loop... Which output do you aim to obtain?

ADD REPLYlink written 3.9 years ago by WouterDeCoster44k

Hey, thanks for answering! The while loop is just there for the idea, I aim to obtain a document that will display for every gene :

-gene name
-gene size ( size of the sequence displayed in the original FASTA)
- PolyQ's bulks ->How many Bulks found in the gene,  position ( in reference to the sequence in the FASTA file), How many Q's for each bulk.

Since I do not want 'code service' , could you please just reference what your suggestions would be ?

ADD REPLYlink modified 3.9 years ago by WouterDeCoster44k • written 3.9 years ago by zebudavid10

I think you should remove the while loop or adapt it. Since sequence will always be "True" you have here effectively an endless loop. What about using m.span() for getting both start and end (which then also gives you the length).

While looping over your iterator read you will have to increment a counter to track how many matches you have.

Am I making sense?

ADD REPLYlink written 3.9 years ago by WouterDeCoster44k

I found a working solution! :

    import re
    from Bio import SeqIO
    compiler = re.compile('QQ+')

    def reader(x):
        for seq_record in SeqIO.parse(x, "fasta"):

            gene_name = seq_record.id)
            print gene_name
            sequence = (str(seq_record.seq)) 

            read = re.finditer(compiler,sequence)

            for m in read:
                  print m.start(), len(m.group())

any added suggestions or comments to improve ?

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by zebudavid10

I'm not sure why you add brackets around seq_record.id and str(seq_record.seq) but okay probably shouldn't matter. It's not crucial, but I would add the compiler = re.compile('QQ+') line just above the for loop but in the reader function (perhaps a more precise function name is a good idea but that's a detail).

You will now need to nicely format your output but that's straightforward I guess.

ADD REPLYlink written 3.9 years ago by WouterDeCoster44k

Thanks! Yeah, I'll format it .

ADD REPLYlink written 3.9 years ago by zebudavid10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1089 users visited in the last hour