Question: More file parsing :) ***EDIT*** how do I make a fast and bed file from Genbank file - SOLVED
gravatar for skbrimer
4.9 years ago by
United States
skbrimer650 wrote:

Hi All, 

I'm trying to make a script that will allow me to take a given genbank file and convert it a fasta and a bed file in the same script so I can then use it with the mapper and viewers downstream. The file conversion itself is pretty straight forward. I'm not sure which of the developers for Biopython made SeqIO.convert but they have all my gratitude. In fact all of Biopython I have found to be really empowering and enjoyable, however as I'm still a inexperienced I'm having some trouble extracting the information I want. 

So what I want is the name of the gene, start, stop, and strand just the basics. and when I read in the file I can see this. 


[SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(1192), strand=1), type='source'),SeqFeature(FeatureLocation(ExactPosition(23), ExactPosition(1127), strand=1), type='gene'), SeqFeature(FeatureLocation(ExactPosition(23), ExactPosition(1127), strand=1), type='CDS')]

which I know is a list of the SeqFeatures but I'm having trouble getting to that information so I was exploring and printed what each item was:

for i in genome.features:

and got:

type: source
location: [0:1192](+)
    Key: country, Value: ['USA']
    Key: db_xref, Value: ['taxon:38170']
    Key: mol_type, Value: ['genomic RNA']
    Key: organism, Value: ['Avian orthoreovirus']
    Key: segment, Value: ['S4']
    Key: strain, Value: ['AVS-B']
<class 'Bio.SeqFeature.SeqFeature'>
type: gene
location: [23:1127](+)
    Key: gene, Value: ['sigma-NS']
<class 'Bio.SeqFeature.SeqFeature'>
type: CDS
location: [23:1127](+)
    Key: codon_start, Value: ['1']
    Key: db_xref, Value: ['GI:315466580']
    Key: gene, Value: ['sigma-NS']
    Key: product, Value: ['sigma-NS protein']
    Key: protein_id, Value: ['CBX25032.1']
<class 'Bio.SeqFeature.SeqFeature'>

I know that the Key Value pair means its a dictionary and when I try to just print the genome.features.type I get a list error so I think each feature is a list with two lists and a dictionary inside them but I can not seem to figure out how to extract the information. 

Can anybody point me in the right direction in either an explanation or the correct documentation I need to read I would be very grateful. 

biopython • 1.5k views
ADD COMMENTlink modified 4.9 years ago • written 4.9 years ago by skbrimer650

Can you change the title to something better and more meaningful? (keeping in mind about the people visiting the site, how could the get benefit from this question).

ADD REPLYlink written 4.9 years ago by Sukhdeep Singh10k

sure, no problem

ADD REPLYlink written 4.9 years ago by skbrimer650
gravatar for skbrimer
4.9 years ago by
United States
skbrimer650 wrote:

HAHAHAHA!!! after a lot of trial and error and reading I got a script that works. Please find it below is anyone could use it for themselves.

''' file conversion for from genbank to fasta
    and to make a bedfile as well. Code modified from 
    Brant Faircloth's "" for the bed file
    you can find the original code here
    https://gist [dot] github [dot] com/brantfaircloth/893580'''

from Bio import SeqIO
import sys

# Lets gets some files 
gb_file = sys.argv[1]
lables = gb_file.split(".")

# Producing the bed file, first make a new writeable file and creat header
# Added the line break \n for a cleaner look with a text editor
# sorry my OCD. 

out_bedfile = open(lables[0]+".bed", 'w')
bed_header = "Track name="+ lables[0] + " description="+lables[0]+" genes\n"

# loop will work with multi genbank files. 

for record in SeqIO.parse(open(gb_file,"rU"),"genbank"):
    for feature in record.features:
        if feature.type == 'gene':
            start = int(feature.location.start)
            stop = int(feature.location.end)
                name = feature.qualifiers['gene'][0]
                #some features only have locus tags
                name = feature.qualifiers['locus_tag'][0]
            if feature.strand < 0:
                strand = "-"
                strand = "+"
            bed_line = lables[0]+" dna\t{0}\t{1}\t{2}\t1000\t{3}\t{0}\t{1}\n".format(start, stop, name, strand)


#convert to fasta - so easy (thank you biopython)


correct a typo, code should be good now :)

ADD COMMENTlink modified 11 months ago by _r_am31k • written 4.9 years ago by skbrimer650
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2271 users visited in the last hour