Question: How to extract DNA sequences for individual genes from GenBank complete genome files using biopython?
0
gravatar for cq151
4.5 years ago by
cq15110
China
cq15110 wrote:

Dear all,

I have downloaded complete mitochondrial genomes for one species from genbank and have written them to a file called myseqs.gbk (in genbank format - like a flat file). I now want to parse these genomes and extract only the sequence data relating to the cox3 gene, and save them as fasta files. In my first ever experience with biopython I tried to adapt the script kindly posted on this site last year (by microbeatic), but I am getting the error:

Traceback (most recent call last):
  File "parser2.py", line 21, in <module>
    print ">GeneId|%s|COX3|%s\n%s" % (gi,genome.description,genome.seq[start:end])
NameError: name 'start' is not defined

I am sure my mistakes are obvious to the regulars on this site and I would be very thankful for some advice. The script is below. I am guessing that I am not using the correct qualifiers and thereby filtering everything out.

Thank you very much.

 

TB

from Bio import SeqIO, SeqFeature
import sys

gbank=SeqIO.parse('myseqs.gbk','genbank')

for genome in gbank:

    for gene in genome.features:
        if(gene.type =="CDS"):
            if 'gene' in gene.qualifiers:
                if 'cox3' in gene.qualifiers['gene'][0].lower():
                    start=gene.location.start.position
                    end=gene.location.end.position
                if 'db_xref' in gene.qualifiers:
                    gi=[]
                    gi=str(gene.qualifiers['db_xref'])
                    gi=gi.split(":")[1]
                    gi=gi.split("'")[0]
                    print ">GeneId|%s|COX3|%s\n%s" % (gi,genome.description,genome.seq[start:end])
                else:
                    print ">GeneId|NoGenID|COX3|%s\n%s" % (genome.description,genome.seq[start:end])
ADD COMMENTlink modified 4.5 years ago • written 4.5 years ago by cq15110
3
gravatar for Tariq Daouda
4.5 years ago by
Tariq Daouda210
IRIC | Institute for Research in Immunology and Cancer
Tariq Daouda210 wrote:

Hey,

I'm the developer of pyGeno, If you don't mind switching to another python package, you could try it.

If you're not working on humans, you will have to make you own pyGeno datawrap, but the process is easy if the data you need is available from Ensembl.

Then these few lines might do the trick:

form pyGeno.Genome import *

ref = Genome(name = "GRCh37.75")
cox = ref.get(Gene, name = "COX3")[0]

for trans in cox.get(Transcript) :

  print trans.cDNA, trans.sequence #get the sequences

pyGeno's doc: http://bioinfo.iric.ca/~daoudat/pyGeno/

Cheers,

ADD COMMENTlink modified 4.5 years ago • written 4.5 years ago by Tariq Daouda210
2
gravatar for Peter
4.5 years ago by
Peter5.8k
Scotland, UK
Peter5.8k wrote:

The NameError is due to you only defining start and end if it was a cox gene. Probably you wanted to indent the last block of code to also only happen for cox genes?

Also try reading http://www.warwick.ac.uk/go/peter_cock/python/genbank/ which has examples.

In particular, your approach of getting the sequence using start and end will only work on nice genes - for anything complicated like a ribosomal slippage or an intron/exon gene model, I would suggest you use gene.extract(genome.seq)

Try:

from Bio import SeqIO

for genome in SeqIO.parse('myseqs.gbk','genbank'):
    for gene in gnome.features:
        if gene.type != "CDS":
            continue
        if 'gene' not in gene.qualifiers:
            continue
        if 'cox3' not in gene.qualifiers['gene'][0].lower():
            continue
        gene_seq = gene.extract(genome.seq)
        if 'db_xref' in gene.qualifiers:
            gi = str(gene.qualifiers['db_xref’]).split(“:”)[1].split("'")[0]
        else:
            gi = "NoGenID"
        print ">GeneId|%s|COX3|%s\n%s" % (gi, genome.description, gene_seq)
ADD COMMENTlink modified 4.5 years ago • written 4.5 years ago by Peter5.8k
1
gravatar for cq151
4.5 years ago by
cq15110
China
cq15110 wrote:

Thank you very much Peter and Tariq!

As Peter pointed out, I had forgotten to indent the final block of code - after indenting, the original script worked perfectly. I am dealing with "nice" genes at the moment; however, I will bear in mind the gene.extract(genome.seq) approach should I have to deal with more messy genes - thank you Peter.

Thank you Tariq for the suggestion of pyGeno - I will look into that, it looks neat, particularly when I have to deal with human sequences.

The original problem is now solved - and so quickly.

Many thanks

TB

ADD COMMENTlink written 4.5 years ago by cq15110

You're welcome. Let me know if you need it for another species I can make you a datawrap that I'll then put on the website. You can send me an email or reach me via twitter (http://bioinfo.iric.ca/~daoudat). 

Cheers,

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by Tariq Daouda210
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 667 users visited in the last hour