Hi, I am new to Biopython and I've been trying to explore the capabilities of the SeqIO function to iterate over a FASTA file, more specifically on a Regex( regular expressions) task. What I need to do is find PolyQ aglomerations in Human proteome. Here is what I have right now (using latest NCBI proteome ftp):
import re from Bio import SeqIO def reader(): for seq_record in SeqIO.parse("Gnomon_prot_micro.fsa", "fasta"): sequence = (str(seq_record.seq)) print sequence #just to verify gene_name = seq_record.id) print gene_name #just to verify compiler = re.compile('QQQ+') while sequence: #do I need to start a new cycle to iterate over sequence ? read = re.finditer(compiler,sequence) for m in read: print m.start(), m.group() # need to get bulk position and how many Qs
Is there a better way to iterate over the document to obtain only the sequences? ( I am relating every sequence with its respective gene name later). Sorry for being a noob and correct me if anything.