Question

How to iterate using regex over output of SeqIO -> using output of seq_record.seq as string to Regex over

1

Entering edit mode

7.4 years ago

zebudavid ▴ 10

Hi, I am new to Biopython and I've been trying to explore the capabilities of the SeqIO function to iterate over a FASTA file, more specifically on a Regex( regular expressions) task. What I need to do is find PolyQ aglomerations in Human proteome. Here is what I have right now (using latest NCBI proteome ftp):

import re
from Bio import SeqIO

def reader():
    for seq_record in SeqIO.parse("Gnomon_prot_micro.fsa", "fasta"):

          sequence = (str(seq_record.seq)) 
          print sequence   #just to verify
          gene_name = seq_record.id)
          print gene_name   #just to verify 
          compiler = re.compile('QQQ+')
          while sequence:  #do I need to start a new cycle to iterate over sequence ?  
                read = re.finditer(compiler,sequence)
                for m in read:
                      print m.start(), m.group()  # need to get bulk position and how many Qs

Is there a better way to iterate over the document to obtain only the sequences? ( I am relating every sequence with its respective gene name later). Sorry for being a noob and correct me if anything.

biopython SeqIO Regex regex • 2.7k views

ADD COMMENT • link 7.4 years ago by zebudavid ▴ 10

0

Entering edit mode

It's better to compile the pattern for re out of the for loop, you only need to do that once.

I'm not sure why you use the while loop... Which output do you aim to obtain?

ADD REPLY • link 7.4 years ago by WouterDeCoster 47k

0

Entering edit mode

Hey, thanks for answering! The while loop is just there for the idea, I aim to obtain a document that will display for every gene :

-gene name
-gene size ( size of the sequence displayed in the original FASTA)
- PolyQ's bulks ->How many Bulks found in the gene,  position ( in reference to the sequence in the FASTA file), How many Q's for each bulk.

Since I do not want 'code service' , could you please just reference what your suggestions would be ?

ADD REPLY • link updated 7.4 years ago by WouterDeCoster 47k • written 7.4 years ago by zebudavid ▴ 10

0

Entering edit mode

I think you should remove the while loop or adapt it. Since sequence will always be "True" you have here effectively an endless loop. What about using m.span() for getting both start and end (which then also gives you the length).

While looping over your iterator read you will have to increment a counter to track how many matches you have.

Am I making sense?

ADD REPLY • link 7.4 years ago by WouterDeCoster 47k

0

Entering edit mode

I found a working solution! :

    import re
    from Bio import SeqIO
    compiler = re.compile('QQ+')

    def reader(x):
        for seq_record in SeqIO.parse(x, "fasta"):

            gene_name = seq_record.id)
            print gene_name
            sequence = (str(seq_record.seq)) 

            read = re.finditer(compiler,sequence)

            for m in read:
                  print m.start(), len(m.group())

any added suggestions or comments to improve ?

ADD REPLY • link 7.4 years ago by zebudavid ▴ 10

0

Entering edit mode

I'm not sure why you add brackets around seq_record.id and str(seq_record.seq) but okay probably shouldn't matter. It's not crucial, but I would add the compiler = re.compile('QQ+') line just above the for loop but in the reader function (perhaps a more precise function name is a good idea but that's a detail).

You will now need to nicely format your output but that's straightforward I guess.