With a genbank file (tobis_list_of_acc_no.gbk) that contains full mitochondrial genomes for 200 mitochondrial genomes, I would like to obtain fasta-files (one for each genome), but with the the fasta-sequences in each of the files parsed into individual fasta-sequences for each gene in the mitochondrial genome. My file : 'tobis_list_of_acc_no.gbk'-file is similar in setup to the 'ls_orchid.gbk' file described on the biopython tutorial (http://biopython.org/DIST/docs/tutorial/Tutorial.html) My input file 'tobis_list_of_acc_no.gbk'-file is a little bit different from the 'ls_orchid.gbk' file, as each entry has several subsections for each gene in the genome. My 'tobis_list_of_acc_no.gbk'-file can easily be prepared by making a 'gbk'-file for a long list of accession numbers referring to complete mitochondrial genomes. I have been able to modify a bit of python-code that uses Biopython to write one file that contains the full mitochondrial genome for each accession number in my 'tobis_list_of_acc_no.gbk'-file.
Python code:
____________________________________________________________________________________________
from Bio import SeqIO
gbk_filename = "tobis_list_of_acc_no.gbk"
faa_filename = "tobis_list_of_acc_no.fna"
input_handle = open(gbk_filename, "r")
output_handle = open(faa_filename, "w")
for seq_record in SeqIO.parse(input_handle, "genbank") :
print "Dealing with GenBank record %s" % seq_record.id
output_handle.write(">%s %s\n%s\n" % (
seq_record.id,
seq_record.description,
seq_record.seq))
output_handle.close()
input_handle.close()
____________________________________________________________________________________________
I got this code from this webpage: http://www2.warwick.ac.uk/fac/sci/moac/people/students/peter_cock/python/genbank2fasta/
The code above prepares a .'fna'-file, that has each complete mitochondrial genome as one long continuous fasta-sequence. Like this:
>NC_010689.1 Chionodraco myersi mitochondrion, complete genome.
GCTGGCGTGGTTTATTT...
>NC_015654.1 Chaenocephalus aceratus mitochondrion, complete genome.
ATGCTATCAGCGCTTAT...
>NC_026578.1 Parachaenichthys charcoti mitochondrion, complete genome.
ATGCTCTCAGCGCTTAT...
Notice that I have shortened the nucleotide sequences for this post, and inserted '...' in the nucleotide sequence. I saw no need to trouble you with nucleotide sequences more than 16000 characters long. Although this is a nice output, I want a file for each mitochondrial genome in my input 'tobis_list_of_acc_no.gbk' and with every gene for every entry parsed out as individual fasta-sequences. So that I get a file looking like this:
>lcl|NC_010689.1_cds_ND1_22325_1 [gene=ND1] [protein=NADH dehydrogenase subunit 1] [protein_id=YP_001905874.1] [location=2856..3830]
ATGCTATCAACGCTTATAACACA...ATTCTAA
>lcl|NC_010689.1_cds_ND2_22325_2 [gene=ND2] [protein=NADH dehydrogenase subunit 2] [protein_id=YP_001905875.1] [location=4042..5088]
ATGAGCCCATATGTCTTAGCCCTT...CCTCTAA
>lcl|NC_010689.1_cds_COX1_22325_3 [gene=COX1] [protein=cytochrome c oxidase subunit I] [protein_id=YP_001905876.1] [location=5473..7023]
GTGGCCATCACACGTTGAT...TGAAACCT
....
I have also shortened the nucleotide sequences in this example, and inserted '...' in the nucleotide sequence. I got this part above by manually downloading the 'coding sequences' under the tab 'send', after having manually looked up NC_010689 on NCBI's nucleotide database. I have also shortened the output i manually looked up, and only shown the first 3 genes I got from the full genbank entry. The idea is of course to get all genes for each of the 200 full genbank entry in my 'tobis_list_of_acc_no.gbk'-file. i.e. 200 files that each have a similar setup as the output-example I prepared manually through the NCBI's nucleotide database.
I think the answer to my question might be found on the two webpages I have mentioned above, but as I am quite new to Biopython I am not sure where to look on these 2 webpages for information on how I can adjust my code to make SeqIO write the pieces I am after. Is it a question of inserting the right 'handles' for SeqIO to recognize, and write afterwards? In that case is there a webpage (or a specific section in these 2 webpages) where I can find details on what options and handles SeqIO is able to recognize and parse out frmo my '.gbk'-file. I am sorry if I have overlooked something obvious, or a similar post.
Thanks in advance for any help and advice you might be able to provide.
Can I suggest the thread title gets renamed to something generic like "Splitting and Extracting Features from Genbank Files"? This is a pretty common thing people want to do so it would be good for people googling in the future to be able to find it easily
Good suggestion, I renamed the thread.
I couldn't quite wrap my head around this. You have a genbank file with several genbanks in it (one for each of the mitochondrial genomes)? And you just want the fasta sequences for each genome, or you want the fasta sequences of all of the features separated out in to a file that corresponds to each genome?
Are you able to obtain each of the genbanks for each mitochondrial genome as a separate genbank? If so I'd be inclined to do it with a basic loop using Peter Cock's script (or one of your own):
It's much easier to concatenate files back together afterwards than split them first!
Yes. I have a gbk-file with 200 full genbanks in it. Here is an example with 3 full genbanks in it. https://www.dropbox.com/s/7fu69897ua66voh/tobis_list_of_acc_no_shortlist.gbk?dl=0 This is a gbk-file for the 3 accession numbers: NC_010689, NC_015654 and NC_026578 I do not want the full fasta-sequence for each accession number. I already have this in my '.fna'-file. If you use the first piece of code on the the gbk-file I have deposited in the dropbox, you will get this '.fna'-file.
What I am after are 3 files just like these: https://www.dropbox.com/s/472qz31l7q98co6/NC_010689_sequence.txt?dl=0 https://www.dropbox.com/s/yjctq7maabaawe4/NC_015654_sequence.txt?dl=0 https://www.dropbox.com/s/yyzb4nqrko33kmr/NC_026578_sequence.txt?dl=0
I have prepared these by manually searching for each the accession numbers on the NCBI GenBank database. I guess I can just use all my 200 accession numbers in one go on the NCBI nucleotide database, and download a single file with all my fasta-sequences concatenated in to the same file, and then split them up afterwards in bash or python. But since I was trying to get more familiar with Biopython and SeqIO I thought I could do the same thing with these tools.
I do not mind having multiple '.gbk-files as input files. But I do not know how to use biopython to produce individual gbk-files for every accession number in a query. I am only able to merge all gbk-entries in to the same gbk-file - for all 200 accession numbers. I am pretty sure Peter Cock script above (the one I inserted in my first post) can be modified to this, but I do not know how to.
How did you download the original multi-GBK file? if it was from NCBI I think you usually get a choice of splitting each file, or downloading as one (but I don't do it often so could be wrong).
I prepared the multi-gbk with this:
My txt-input-file just contains a list of the accession numbers. Like this: