Question: How To Extract Rrna Sequences (In Fasta Format) From Genbank (Bacterial Genome)?
gravatar for microbeatic
7.2 years ago by
hanover, NH
microbeatic80 wrote:

In the genbank files rRNA only have following features to it:

   rRNA              67149..68657
                        /product="16S ribosomal RNA (operon 1)"

unlike CDS which has,

  CDS             74072..74647
                 /product="hypothetical protein"

so i can just use


to get the amino acid sequences.

But, how do i get the nucleotide sequences for any gene or rRNA genes?

I realize the whole nucleotide is listed at the bottom of the genbank file, and probably location information can be used to extract the sequence. But, i think there should be much simpler way.

Any help would be great.

ADD COMMENTlink modified 7.2 years ago by Devon Ryan97k • written 7.2 years ago by microbeatic80

it would surprise me if biopython didn't have genbank parser. Otherwise, just parse the nuc sequence (e.g. from fasta file) and use substring, to extract the location. Mind the gap, that is substring coordinates are possibly 0 based, and genbank possibly 1 based

ADD REPLYlink written 7.2 years ago by Michael Dondrup48k

yes, maybe its because there are different variants of genbank. The one i am dealing with are downloaded from NCBI . The second option is possible but something possibly should be in the parser that i am obviously missing.

ADD REPLYlink written 7.2 years ago by microbeatic80
gravatar for Devon Ryan
7.2 years ago by
Devon Ryan97k
Freiburg, Germany
Devon Ryan97k wrote:

Have a look at how to extract just 'CDS' from genbank file into another genbank file? from earlier today. The only major change you would have to make is to check for rRNA (as a feature) instead of CDS and then write the record only if it wasn't found.

Edit: This is based off of the script you wrote in the other post that you made. There are likely faster ways to do this, but this is the easiest modification that I could think of given the structure of that script. I don't know what constitutes a useful ID or description for you, so you may have to change those (I also never checked which qualifiers are always present in the Genbank format).

#!/usr/bin/env python
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
import sys

gbank = SeqIO.parse(open(sys.argv[1], "r"), "genbank")

rRNAs = []
for genome in gbank :
    for feature in genome.features :
        if(feature.type == "rRNA") :
            ID = feature.qualifiers['db_xref'][0]
            desc = feature.qualifiers['locus_tag'][0]
            seq = feature.extract(genome.seq)
            record = SeqRecord(seq, id=ID, description=desc)

SeqIO.write(rRNAs, output_handle, "fasta")
ADD COMMENTlink modified 7.2 years ago • written 7.2 years ago by Devon Ryan97k

btw: you can paste a URL like so How To Extract Just 'Cds' From Genbank File Into Another Genbank File? and it will be autolinked by title

ADD REPLYlink written 7.2 years ago by Istvan Albert ♦♦ 85k

Thanks! I just noticed that from your revision, not sure how I didn't figure that out earlier.

ADD REPLYlink written 7.2 years ago by Devon Ryan97k

Actually, I asked that question as well :p.

I did that but it prints out the whole genome, not just one gene. Does it have to do with the sequence being not associated directly with the rrna feature?

ADD REPLYlink written 7.2 years ago by microbeatic80

I updated my answer to include a modification of the script you posted previously. This will likely work, though you may need to modify it to actually be useful.

ADD REPLYlink written 7.2 years ago by Devon Ryan97k

Great. Thanks. here is the code that i have which was modified to extract 16S specifically.

from Bio import SeqIO, SeqFeature
import sys


for genome in gbank:
for gene in genome.features:
    if gene.type=="rRNA": 
        if 'product' in gene.qualifiers:
            if '16S' in gene.qualifiers['product'][0]:
                start = gene.location.nofuzzy_start
                end = gene.location.nofuzzy_end
                if 'db_xref' in gene.qualifiers:
                    print ">GeneId|%s|16S rRNA|%s\n%s" % (gi,genome.description,genome.seq[start:end])
                    print ">GeneId|NoGenID|16S rRNA|%s\n%s" % (genome.description,genome.seq[start:end])
ADD REPLYlink modified 7.2 years ago • written 7.2 years ago by microbeatic80

Cool, glad that helped! BTW, should you ever need the fasta output from that to be formatted such that all of the lines are the same length (this is required by some programs, such as samtools), you can just convert your print ... lines to create SeqRecords instead and then have SeqIO.write deal with the formatting for you (see the last chunk of my code).

ADD REPLYlink written 7.2 years ago by Devon Ryan97k

Sounds good. I am pretty new to parsing using python and biopython, but it has been fun. Do you know a link/book/blog that explains parsing other than biopython webpage...just curious.

ADD REPLYlink written 7.2 years ago by microbeatic80

I don't, unfortunately (the only programming books I have are for Fortran, C, and C++). You might search this site since there have been a few posts regarding book recommendations.

ADD REPLYlink written 7.2 years ago by Devon Ryan97k

How to convert print line to SeqRecord in order to save fasta file?

I tried replacing print line with the following code

genome =, genome.description.split(',')[0], genome.seq[start:end])
SeqIO.write(genome, 'test.fasta', "fasta")

getting the error

AttributeError: 'str' object has no attribute 'id'
ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by youki0

Recently I am trying to learn biopython working with GenBank file but I am very confused by the relationship between GenBank format and biopython parser...For example, my genbank file has "locus_tag" info but why biopython can not get access to it? A example: gene complement(3015679..3037424) /gene=ENSDARG00000043854 /locus_tag="ppil4" /note="peptidylprolyl isomerase (cyclophilin)-like 4 [Source:ZFIN;Acc:ZDB-GENE-030131-6251]" mRNA join(complement(3037328..3037424), complement(3036319..3036386),complement(3035701..3035765), complement(3035066..3035183),complement(3031417..3031559), complement(3029838..3029934),complement(3028920..3029036), complement(3028691..3028815),complement(3025931..3025997), complement(3023570..3023681),complement(3020984..3021080), complement(3020081..3020198),complement(3015679..3016986)) /gene="ENSDARG00000043854" /note="transcript_id=ENSDART00000064390" CDS join(complement(3037328..3037397), complement(3036319..3036386),complement(3035701..3035765), complement(3035066..3035183),complement(3031417..3031559), complement(3029838..3029934),complement(3028920..3029036), complement(3028691..3028815),complement(3025931..3025997),

When I try to use feature.qualifiers["locus_tag"], there is KeyError. How can I get /locus_tag="ppil4" by using biopython? Sorry the comment format is a little mess. The GenBank file I use:

By the way, is there any website explain more about biopython genbank parser? I searched a lot but failed in finding in-depth documentation. Did you learn by reading source code?

ADD REPLYlink written 3.5 years ago by AlicePsyche30
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1809 users visited in the last hour