Question: More file parsing :) ***EDIT*** how do I make a fast and bed file from Genbank file - SOLVED
1
gravatar for skbrimer
3.7 years ago by
skbrimer550
United States
skbrimer550 wrote:

Hi All, 

I'm trying to make a script that will allow me to take a given genbank file and convert it a fasta and a bed file in the same script so I can then use it with the mapper and viewers downstream. The file conversion itself is pretty straight forward. I'm not sure which of the developers for Biopython made SeqIO.convert but they have all my gratitude. In fact all of Biopython I have found to be really empowering and enjoyable, however as I'm still a inexperienced I'm having some trouble extracting the information I want. 

So what I want is the name of the gene, start, stop, and strand just the basics. and when I read in the file I can see this. 

genome.features

[SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(1192), strand=1), type='source'),SeqFeature(FeatureLocation(ExactPosition(23), ExactPosition(1127), strand=1), type='gene'), SeqFeature(FeatureLocation(ExactPosition(23), ExactPosition(1127), strand=1), type='CDS')]

which I know is a list of the SeqFeatures but I'm having trouble getting to that information so I was exploring and printed what each item was:

for i in genome.features:
    print(i,type(i))

and got:

type: source
location: [0:1192](+)
qualifiers:
    Key: country, Value: ['USA']
    Key: db_xref, Value: ['taxon:38170']
    Key: mol_type, Value: ['genomic RNA']
    Key: organism, Value: ['Avian orthoreovirus']
    Key: segment, Value: ['S4']
    Key: strain, Value: ['AVS-B']
<class 'Bio.SeqFeature.SeqFeature'>
type: gene
location: [23:1127](+)
qualifiers:
    Key: gene, Value: ['sigma-NS']
<class 'Bio.SeqFeature.SeqFeature'>
type: CDS
location: [23:1127](+)
qualifiers:
    Key: codon_start, Value: ['1']
    Key: db_xref, Value: ['GI:315466580']
    Key: gene, Value: ['sigma-NS']
    Key: product, Value: ['sigma-NS protein']
    Key: protein_id, Value: ['CBX25032.1']
    Key: translation, Value: ['MDNTVRVGVSRNTSGAAGQTVFRNYYLLRCNISADGRNATKAVQSHFPFLSRAVRCLSPLAAHCADRTLRRDNVKQILTRELPFPSDLINYAHHVNSSSLTTSQGVEAARLVAQVYGEQLSFDHIYPTGSATYCPGAIANAISRIMAGFVPHEGDNFTPDGAIDYLAADLVAYKFVLPYMLDIVDGRPQIVLPSHTVEEMLSNTSLLNSIDASFGIESKSDQRMTRDAAEMSSRSLNELEDHEQRGRMPWKIMTAMFAAQLKVELDALADERVESQANAHVTSFGSRLFNQMSAFVPIDRELMELALLIKEQGFAMNPGQVASKWSLIRRSGPTRPLSGARLEIRNGNWTIREGDQTLLSVSPARMA']
<class 'Bio.SeqFeature.SeqFeature'>

I know that the Key Value pair means its a dictionary and when I try to just print the genome.features.type I get a list error so I think each feature is a list with two lists and a dictionary inside them but I can not seem to figure out how to extract the information. 

Can anybody point me in the right direction in either an explanation or the correct documentation I need to read I would be very grateful. 

biopython • 1.3k views
ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by skbrimer550

Can you change the title to something better and more meaningful? (keeping in mind about the people visiting the site, how could the get benefit from this question).

ADD REPLYlink written 3.7 years ago by Sukhdeep Singh9.9k

sure, no problem

ADD REPLYlink written 3.7 years ago by skbrimer550
2
gravatar for skbrimer
3.7 years ago by
skbrimer550
United States
skbrimer550 wrote:

HAHAHAHA!!! after a lot of trial and error and reading I got a script that works. Please find it below is anyone could use it for themselves.

''' file conversion for from genbank to fasta
    and to make a bedfile as well. Code modified from 
    Brant Faircloth's "bed_from_genbank.py" for the bed file
    you can find the original code here
    https://gist.github.com/brantfaircloth/893580 '''

from Bio import SeqIO
import sys

# Lets gets some files 
gb_file = sys.argv[1]
lables = gb_file.split(".")

# Producing the bed file, first make a new writeable file and creat header
# Added the line break \n for a cleaner look with a text editor
# sorry my OCD. 

out_bedfile = open(lables[0]+".bed", 'w')
bed_header = "Track name="+ lables[0] + " description="+lables[0]+" genes\n"
out_bedfile.write(bed_header)

# loop will work with multi genbank files. 

for record in SeqIO.parse(open(gb_file,"rU"),"genbank"):
    for feature in record.features:
        if feature.type == 'gene':
            start = int(feature.location.start)
            stop = int(feature.location.end)
            try:
                name = feature.qualifiers['gene'][0]
            except:
                #some features only have locus tags
                name = feature.qualifiers['locus_tag'][0]
            if feature.strand < 0:
                strand = "-"
            else:
                strand = "+"
            bed_line = lables[0]+" dna\t{0}\t{1}\t{2}\t1000\t{3}\t{0}\t{1}\n".format(start, stop, name, strand)
            out_bedfile.write(bed_line)

out_bedfile.close()

#convert to fasta - so easy (thank you biopython)
SeqIO.convert(gb_file,"genbank",lables[0]+".fasta","fasta")

 

##### EDIT ####

correct a typo, code should be good now :)

 

ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by skbrimer550
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2061 users visited in the last hour