Splitting and Extracting Features in fasta format from Genbank Files using Biopython
1
2
Entering edit mode
7.3 years ago
swknudsen ▴ 40

With a genbank file (tobis_list_of_acc_no.gbk) that contains full mitochondrial genomes for 200 mitochondrial genomes, I would like to obtain fasta-files (one for each genome), but with the the fasta-sequences in each of the files parsed into individual fasta-sequences for each gene in the mitochondrial genome. My file : 'tobis_list_of_acc_no.gbk'-file is similar in setup to the 'ls_orchid.gbk' file described on the biopython tutorial (http://biopython.org/DIST/docs/tutorial/Tutorial.html) My input file 'tobis_list_of_acc_no.gbk'-file is a little bit different from the 'ls_orchid.gbk' file, as each entry has several subsections for each gene in the genome. My 'tobis_list_of_acc_no.gbk'-file can easily be prepared by making a 'gbk'-file for a long list of accession numbers referring to complete mitochondrial genomes. I have been able to modify a bit of python-code that uses Biopython to write one file that contains the full mitochondrial genome for each accession number in my 'tobis_list_of_acc_no.gbk'-file.

Python code:

____________________________________________________________________________________________
from Bio import SeqIO
gbk_filename = "tobis_list_of_acc_no.gbk"
faa_filename = "tobis_list_of_acc_no.fna"
input_handle  = open(gbk_filename, "r")
output_handle = open(faa_filename, "w")

for seq_record in SeqIO.parse(input_handle, "genbank") :
    print "Dealing with GenBank record %s" % seq_record.id
    output_handle.write(">%s %s\n%s\n" % (
           seq_record.id,
           seq_record.description,
           seq_record.seq))

output_handle.close()
input_handle.close()
____________________________________________________________________________________________

I got this code from this webpage: http://www2.warwick.ac.uk/fac/sci/moac/people/students/peter_cock/python/genbank2fasta/

The code above prepares a .'fna'-file, that has each complete mitochondrial genome as one long continuous fasta-sequence. Like this:

>NC_010689.1 Chionodraco myersi mitochondrion, complete genome.
GCTGGCGTGGTTTATTT...
>NC_015654.1 Chaenocephalus aceratus mitochondrion, complete genome.
ATGCTATCAGCGCTTAT...
>NC_026578.1 Parachaenichthys charcoti mitochondrion, complete genome.
ATGCTCTCAGCGCTTAT...

Notice that I have shortened the nucleotide sequences for this post, and inserted '...' in the nucleotide sequence. I saw no need to trouble you with nucleotide sequences more than 16000 characters long. Although this is a nice output, I want a file for each mitochondrial genome in my input 'tobis_list_of_acc_no.gbk' and with every gene for every entry parsed out as individual fasta-sequences. So that I get a file looking like this:

>lcl|NC_010689.1_cds_ND1_22325_1 [gene=ND1] [protein=NADH dehydrogenase subunit 1] [protein_id=YP_001905874.1] [location=2856..3830]
ATGCTATCAACGCTTATAACACA...ATTCTAA
>lcl|NC_010689.1_cds_ND2_22325_2 [gene=ND2] [protein=NADH dehydrogenase subunit 2] [protein_id=YP_001905875.1] [location=4042..5088]
ATGAGCCCATATGTCTTAGCCCTT...CCTCTAA
>lcl|NC_010689.1_cds_COX1_22325_3 [gene=COX1] [protein=cytochrome c oxidase subunit I] [protein_id=YP_001905876.1] [location=5473..7023]
GTGGCCATCACACGTTGAT...TGAAACCT
....

I have also shortened the nucleotide sequences in this example, and inserted '...' in the nucleotide sequence. I got this part above by manually downloading the 'coding sequences' under the tab 'send', after having manually looked up NC_010689 on NCBI's nucleotide database. I have also shortened the output i manually looked up, and only shown the first 3 genes I got from the full genbank entry. The idea is of course to get all genes for each of the 200 full genbank entry in my 'tobis_list_of_acc_no.gbk'-file. i.e. 200 files that each have a similar setup as the output-example I prepared manually through the NCBI's nucleotide database.

I think the answer to my question might be found on the two webpages I have mentioned above, but as I am quite new to Biopython I am not sure where to look on these 2 webpages for information on how I can adjust my code to make SeqIO write the pieces I am after. Is it a question of inserting the right 'handles' for SeqIO to recognize, and write afterwards? In that case is there a webpage (or a specific section in these 2 webpages) where I can find details on what options and handles SeqIO is able to recognize and parse out frmo my '.gbk'-file. I am sorry if I have overlooked something obvious, or a similar post.

Thanks in advance for any help and advice you might be able to provide.

gene sequence genome • 13k views
ADD COMMENT
1
Entering edit mode

Can I suggest the thread title gets renamed to something generic like "Splitting and Extracting Features from Genbank Files"? This is a pretty common thing people want to do so it would be good for people googling in the future to be able to find it easily

ADD REPLY
0
Entering edit mode

Good suggestion, I renamed the thread.

ADD REPLY
0
Entering edit mode

I couldn't quite wrap my head around this. You have a genbank file with several genbanks in it (one for each of the mitochondrial genomes)? And you just want the fasta sequences for each genome, or you want the fasta sequences of all of the features separated out in to a file that corresponds to each genome?

Are you able to obtain each of the genbanks for each mitochondrial genome as a separate genbank? If so I'd be inclined to do it with a basic loop using Peter Cock's script (or one of your own):

for file in *.gbk ; do python genbank2fasta.py -i ${file} -o ${file%.*}.fasta ; done

It's much easier to concatenate files back together afterwards than split them first!

ADD REPLY
0
Entering edit mode

Yes. I have a gbk-file with 200 full genbanks in it. Here is an example with 3 full genbanks in it. https://www.dropbox.com/s/7fu69897ua66voh/tobis_list_of_acc_no_shortlist.gbk?dl=0 This is a gbk-file for the 3 accession numbers: NC_010689, NC_015654 and NC_026578 I do not want the full fasta-sequence for each accession number. I already have this in my '.fna'-file. If you use the first piece of code on the the gbk-file I have deposited in the dropbox, you will get this '.fna'-file.

What I am after are 3 files just like these: https://www.dropbox.com/s/472qz31l7q98co6/NC_010689_sequence.txt?dl=0 https://www.dropbox.com/s/yjctq7maabaawe4/NC_015654_sequence.txt?dl=0 https://www.dropbox.com/s/yyzb4nqrko33kmr/NC_026578_sequence.txt?dl=0

I have prepared these by manually searching for each the accession numbers on the NCBI GenBank database. I guess I can just use all my 200 accession numbers in one go on the NCBI nucleotide database, and download a single file with all my fasta-sequences concatenated in to the same file, and then split them up afterwards in bash or python. But since I was trying to get more familiar with Biopython and SeqIO I thought I could do the same thing with these tools.

I do not mind having multiple '.gbk-files as input files. But I do not know how to use biopython to produce individual gbk-files for every accession number in a query. I am only able to merge all gbk-entries in to the same gbk-file - for all 200 accession numbers. I am pretty sure Peter Cock script above (the one I inserted in my first post) can be modified to this, but I do not know how to.

ADD REPLY
0
Entering edit mode

How did you download the original multi-GBK file? if it was from NCBI I think you usually get a choice of splitting each file, or downloading as one (but I don't do it often so could be wrong).

ADD REPLY
1
Entering edit mode

I prepared the multi-gbk with this:

from Bio import Entrez
Entrez.email = "myemail@email.com" 
f1 = open("/home/forbio/python_test_2016/tobis_list_of_acc_no_shortlist.txt")
f2 = open('tobis_list_of_acc_no_shortlist.gbk', 'w')
for line in iter(f1):
    handle = Entrez.efetch(db="nucleotide", id=line, rettype="gb", retmode="text")
    f2.write (handle.read())
f2.close()

My txt-input-file just contains a list of the accession numbers. Like this:

NC_010689
NC_015654
NC_026578
ADD REPLY
5
Entering edit mode
7.3 years ago
Joe 21k

Here are 2 short python scripts that will split, and then retreive gene features (protein sequences) from your genbanks:

Split a multigenbank

python splitgenbank.py multigenbank.gbk

# splitgenbank.py

from Bio import SeqIO
import sys

for rec in SeqIO.parse(sys.argv[1], "genbank"):
    SeqIO.write([rec], openrec.id + ".gbk", "w"), "genbank")

Get all the features (slight modification of Peter Cock's code)

(you'll need to edit this if you want different information in the fastas):

Invoke like so to get all the fastas names sensibly:

for file in *.gbk ; do python genbank2fasta.py ${file} ${file%.*}.faa ; done

# genbank2fasta.py

from Bio import SeqIO
import sys

with open(sys.argv[2], 'w') as ofh:
        for seq_record in SeqIO.parse(sys.argv[1], 'genbank'):
                for seq_feature in seq_record.features:
                        if seq_feature.type=="CDS":
                                assert len(seq_feature.qualifiers['translation'])==1
                                ofh.write(">%s from %s\n%s\n" % (
                                seq_feature.qualifiers['gene'][0],
                                seq_record.name,
                                seq_feature.qualifiers['translation'][0]))

All these scripts are just going to dump the files they make in the same directory I think, so my advice would be to have a backup of your genbank somewhere incase things start overwriting (shouldn't happen but you never know).


EDIT: Here's the script to extract NA features:

python extractNAfeatures.py genbank.gbk outfile.ffn

# extractNAfeatures.py

from Bio import SeqIO
import sys

with open(sys.argv[2], 'w') as nfh:
        for rec in SeqIO.parse(sys.argv[1], "genbank"):
                if rec.features:
                        for feature in rec.features:
                                if feature.type == "CDS":
                                        nfh.write(">%s from %s\n%s\n" % (
                                        feature.qualifiers['gene'][0],
                                        rec.name,
                                        feature.location.extract(rec).seq))
ADD COMMENT
0
Entering edit mode

I think the variable openrec is not declared?

ADD REPLY
0
Entering edit mode

Hmm.. thats weird. There should be a ( open parenthesis between open and rec.id but its not showing in the markup for some reason despite the fact it is correct in the raw content of my post.

It's here in correct form anyway (i've renamed it for myself however) https://github.com/jrjhealey/bioinfo-tools/blob/master/splitmultigenbank.py

ADD REPLY
1
Entering edit mode

Github Gist is also convenient for sharing code (and if you paste the link here it's also displayed in your post directly)

ADD REPLY
0
Entering edit mode

Good to know! I've been meaning to start composing some gists.

ADD REPLY
0
Entering edit mode

Brilliant this almost works. With a bash-code on the side I was able to move the resulting files forth and back. Just one thing is missing. The # genbank2fasta.py part produces the '.faa'-files with amino-acid codes in fasta-format. I would prefer to have nucleotide information in the resulting '.faa'-files. I can see 'translation' in the # genbank2fasta.py part in a couple of the lines. Which one do I have to adjust to get the resulting '.faa'-files with nucleotide information instead of amino-acid codes? Thanks

ADD REPLY
0
Entering edit mode

You cannot "retro-translate" amino acids to nucleotides due to redundancy in codon usage.

ADD REPLY
0
Entering edit mode

Oh. No. This was not what meant. I am fully aware that I cannot retro-translate. I am aware of the redundancy in the AA-translation code. Retro translation is certainly not the idea. As a full scale phylogenetic analysis is the ultimate goal further down the line, I need the original nucleotide data. Amino acid data will prevent me from getting a proper resolution of taxa at a species level. My original idea was to get the nucleotide-sequences, not the AA-sequences. Please compare with my original question posted above. The gbk-files carry the original nucleotide-sequences and with the biopython code I presented in my original question (the code I got from here: http://www2.warwick.ac.uk/fac/sci/moac/people/students/peter_cock/python/genbank2fasta/), I was already able to to produce nucleotide sequences in fasta-format for each accession-number. The 2 pieces of code kindly provided by jrj.healey are almost creating the right kind of output, at least the setup an arrangement inside the file appears right. Only thing that is wrong is that I wanted the original nucleotide data, not the AA-data.

ADD REPLY
0
Entering edit mode

You can take the fastas of protein sequences and blast them to get the corresponding nucleotide sequences.

If I have a little bit more time later I'll see if I can figure out a way to get the nucleotide info but the basic problem is:

Genbanks contain the FULL nucleic acid sequence at the end of the file, so you can convert GBK -> genome fasta no problem, which is what Peters script does. They don't contain nucleic acid information on a gene by gene basis. Thus the only feature fasta you can extract is the protein sequence for a given gene directly.

It should be possible to hack something together using the Base pair indices however... but will require some more thought!

ADD REPLY
0
Entering edit mode

So after a bit more thought, it turns out theres a handy 'extract' method in the later versions of biopython. See my edited answer for the code you'll need.

I advise you to sanity check a few of the genes to make sure it has output what you expect. Particularly genes you would expect to find on the reverse strand.

ADD REPLY
1
Entering edit mode

Yes. This is it. Thank you so much for the last piece of code: extractNAfeatures.py. I checked a couple of the resulting nd6-genes in the mt-genome. They appear to be on the same strand as the other ones I have from my assembled mt-genomes. The genes on the forward strand also appear to match the reading frame of my other assembled mt-genomes. I realized I also wanted to add the species name to each header for each sequence in the resulting output-file. So I modified your code a little bit:

from Bio import SeqIO
import sys

with open(sys.argv[2], 'w') as nfh:
        for rec in SeqIO.parse(sys.argv[1], "genbank"):
                if rec.features:
                        for feature in rec.features:
                                if feature.type == "CDS":
                                        nfh.write(">%s %s from %s\n%s\n" % (
                                        feature.qualifiers['gene'][0],
                                        rec.name,
                                        rec.description,
                                        feature.location.extract(rec).seq))

Thanks for your help

ADD REPLY
0
Entering edit mode

You're welcome. Parsing genbanks is a pretty detestable task, so it was good practice for me anyway ;)

ADD REPLY

Login before adding your answer.

Traffic: 2630 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6