How To Extract Rrna Sequences (In Fasta Format) From Genbank (Bacterial Genome)?
1
1
Entering edit mode
8.3 years ago
microbeatic ▴ 80

In the genbank files rRNA only have following features to it:

   rRNA              67149..68657
/product="16S ribosomal RNA (operon 1)"
/db_xref="GeneID:7371836"


unlike CDS which has,

  CDS             74072..74647
/codon_start=1
/transl_table=11
/product="hypothetical protein"
/protein_id="YP_002542787.1"
/db_xref="GI:222084261"
/db_xref="GeneID:7371837"
/translation="MTARGIARLVELRDAGVTAATMSRMERDGEVLRLARGLYQLSDA
PLDANHSLAEAAKRLPKGVVCLVSALAFHGLTDQLPKQVWLAIGRKDWAPKPDSTPIR
KATPGEIARQAERGGVATVIRPYIEALTANG"


so i can just use

seq_feature.qualifiers['translation'][0]


to get the amino acid sequences.

But, how do i get the nucleotide sequences for any gene or rRNA genes?

I realize the whole nucleotide is listed at the bottom of the genbank file, and probably location information can be used to extract the sequence. But, i think there should be much simpler way.

Any help would be great.

biopython genbank rrna nucleotide fasta • 9.7k views
1
Entering edit mode

it would surprise me if biopython didn't have genbank parser. Otherwise, just parse the nuc sequence (e.g. from fasta file) and use substring, to extract the location. Mind the gap, that is substring coordinates are possibly 0 based, and genbank possibly 1 based

0
Entering edit mode

yes, maybe its because there are different variants of genbank. The one i am dealing with are downloaded from NCBI ftp://ftp.ncbi.nih.gov/genomes/Bacteria/ . The second option is possible but something possibly should be in the parser that i am obviously missing.

1
Entering edit mode
8.3 years ago

Have a look at how to extract just 'CDS' from genbank file into another genbank file? from earlier today. The only major change you would have to make is to check for rRNA (as a feature) instead of CDS and then write the record only if it wasn't found.

Edit: This is based off of the script you wrote in the other post that you made. There are likely faster ways to do this, but this is the easiest modification that I could think of given the structure of that script. I don't know what constitutes a useful ID or description for you, so you may have to change those (I also never checked which qualifiers are always present in the Genbank format).

#!/usr/bin/env python
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
import sys

gbank = SeqIO.parse(open(sys.argv[1], "r"), "genbank")
output_handle=open("rRNAs.fa","w")

rRNAs = []
for genome in gbank :
for feature in genome.features :
if(feature.type == "rRNA") :
ID = feature.qualifiers['db_xref'][0]
desc = feature.qualifiers['locus_tag'][0]
seq = feature.extract(genome.seq)
record = SeqRecord(seq, id=ID, description=desc)
rRNAs.append(record)

SeqIO.write(rRNAs, output_handle, "fasta")
output_handle.close()

0
Entering edit mode

btw: you can paste a URL like so How To Extract Just 'Cds' From Genbank File Into Another Genbank File? and it will be autolinked by title

0
Entering edit mode

Thanks! I just noticed that from your revision, not sure how I didn't figure that out earlier.

0
Entering edit mode

Actually, I asked that question as well :p.

I did that but it prints out the whole genome, not just one gene. Does it have to do with the sequence being not associated directly with the rrna feature?

0
Entering edit mode

I updated my answer to include a modification of the script you posted previously. This will likely work, though you may need to modify it to actually be useful.

1
Entering edit mode

Great. Thanks. here is the code that i have which was modified to extract 16S specifically.

from Bio import SeqIO, SeqFeature
import sys

gbank=SeqIO.parse(open(sys.argv[1],"rU"),"genbank")

for genome in gbank:
for gene in genome.features:
if gene.type=="rRNA":
if 'product' in gene.qualifiers:
if '16S' in gene.qualifiers['product'][0]:
start = gene.location.nofuzzy_start
end = gene.location.nofuzzy_end
if 'db_xref' in gene.qualifiers:
gi=[]
gi=str(gene.qualifiers['db_xref'])
gi=gi.split(":")[1]
gi=gi.split("'")[0]
print ">GeneId|%s|16S rRNA|%s\n%s" % (gi,genome.description,genome.seq[start:end])
else:
print ">GeneId|NoGenID|16S rRNA|%s\n%s" % (genome.description,genome.seq[start:end])

0
Entering edit mode

Cool, glad that helped! BTW, should you ever need the fasta output from that to be formatted such that all of the lines are the same length (this is required by some programs, such as samtools), you can just convert your print ... lines to create SeqRecords instead and then have SeqIO.write deal with the formatting for you (see the last chunk of my code).

0
Entering edit mode

Sounds good. I am pretty new to parsing using python and biopython, but it has been fun. Do you know a link/book/blog that explains parsing other than biopython webpage...just curious.

0
Entering edit mode

I don't, unfortunately (the only programming books I have are for Fortran, C, and C++). You might search this site since there have been a few posts regarding book recommendations.

0
Entering edit mode

How to convert print line to SeqRecord in order to save fasta file?

I tried replacing print line with the following code

genome = genome.name, genome.description.split(',')[0], genome.seq[start:end])
SeqIO.write(genome, 'test.fasta', "fasta")


getting the error

AttributeError: 'str' object has no attribute 'id'

0
Entering edit mode

Recently I am trying to learn biopython working with GenBank file but I am very confused by the relationship between GenBank format and biopython parser...For example, my genbank file has "locus_tag" info but why biopython can not get access to it? A example: gene complement(3015679..3037424) /gene=ENSDARG00000043854 /locus_tag="ppil4" /note="peptidylprolyl isomerase (cyclophilin)-like 4 [Source:ZFIN;Acc:ZDB-GENE-030131-6251]" mRNA join(complement(3037328..3037424), complement(3036319..3036386),complement(3035701..3035765), complement(3035066..3035183),complement(3031417..3031559), complement(3029838..3029934),complement(3028920..3029036), complement(3028691..3028815),complement(3025931..3025997), complement(3023570..3023681),complement(3020984..3021080), complement(3020081..3020198),complement(3015679..3016986)) /gene="ENSDARG00000043854" /note="transcript_id=ENSDART00000064390" CDS join(complement(3037328..3037397), complement(3036319..3036386),complement(3035701..3035765), complement(3035066..3035183),complement(3031417..3031559), complement(3029838..3029934),complement(3028920..3029036), complement(3028691..3028815),complement(3025931..3025997),

When I try to use feature.qualifiers["locus_tag"], there is KeyError. How can I get /locus_tag="ppil4" by using biopython? Sorry the comment format is a little mess. The GenBank file I use: ftp://ftp.ensembl.org/pub/release-79/genbank/danio_rerio/Danio_rerio.Zv9.79.chromosome.20.dat.gz

By the way, is there any website explain more about biopython genbank parser? I searched a lot but failed in finding in-depth documentation. Did you learn by reading source code?