With a genbank file (tobis_list_of_acc_no.gbk) that contains full mitochondrial genomes for 200 mitochondrial genomes, I would like to obtain fasta-files (one for each genome), but with the the fasta-sequences in each of the files parsed into individual fasta-sequences for each gene in the mitochondrial genome. My file : 'tobis_list_of_acc_no.gbk'-file is similar in setup to the 'ls_orchid.gbk' file described on the biopython tutorial (http://biopython.org/DIST/docs/tutorial/Tutorial.html) My input file 'tobis_list_of_acc_no.gbk'-file is a little bit different from the 'ls_orchid.gbk' file, as each entry has several subsections for each gene in the genome. My 'tobis_list_of_acc_no.gbk'-file can easily be prepared by making a 'gbk'-file for a long list of accession numbers referring to complete mitochondrial genomes. I have been able to modify a bit of python-code that uses Biopython to write one file that contains the full mitochondrial genome for each accession number in my 'tobis_list_of_acc_no.gbk'-file.
____________________________________________________________________________________________ from Bio import SeqIO gbk_filename = "tobis_list_of_acc_no.gbk" faa_filename = "tobis_list_of_acc_no.fna" input_handle = open(gbk_filename, "r") output_handle = open(faa_filename, "w") for seq_record in SeqIO.parse(input_handle, "genbank") : print "Dealing with GenBank record %s" % seq_record.id output_handle.write(">%s %s\n%s\n" % ( seq_record.id, seq_record.description, seq_record.seq)) output_handle.close() input_handle.close() ____________________________________________________________________________________________
I got this code from this webpage: http://www2.warwick.ac.uk/fac/sci/moac/people/students/peter_cock/python/genbank2fasta/
The code above prepares a .'fna'-file, that has each complete mitochondrial genome as one long continuous fasta-sequence. Like this:
>NC_010689.1 Chionodraco myersi mitochondrion, complete genome. GCTGGCGTGGTTTATTT... >NC_015654.1 Chaenocephalus aceratus mitochondrion, complete genome. ATGCTATCAGCGCTTAT... >NC_026578.1 Parachaenichthys charcoti mitochondrion, complete genome. ATGCTCTCAGCGCTTAT...
Notice that I have shortened the nucleotide sequences for this post, and inserted '...' in the nucleotide sequence. I saw no need to trouble you with nucleotide sequences more than 16000 characters long. Although this is a nice output, I want a file for each mitochondrial genome in my input 'tobis_list_of_acc_no.gbk' and with every gene for every entry parsed out as individual fasta-sequences. So that I get a file looking like this:
>lcl|NC_010689.1_cds_ND1_22325_1 [gene=ND1] [protein=NADH dehydrogenase subunit 1] [protein_id=YP_001905874.1] [location=2856..3830] ATGCTATCAACGCTTATAACACA...ATTCTAA >lcl|NC_010689.1_cds_ND2_22325_2 [gene=ND2] [protein=NADH dehydrogenase subunit 2] [protein_id=YP_001905875.1] [location=4042..5088] ATGAGCCCATATGTCTTAGCCCTT...CCTCTAA >lcl|NC_010689.1_cds_COX1_22325_3 [gene=COX1] [protein=cytochrome c oxidase subunit I] [protein_id=YP_001905876.1] [location=5473..7023] GTGGCCATCACACGTTGAT...TGAAACCT ....
I have also shortened the nucleotide sequences in this example, and inserted '...' in the nucleotide sequence. I got this part above by manually downloading the 'coding sequences' under the tab 'send', after having manually looked up NC_010689 on NCBI's nucleotide database. I have also shortened the output i manually looked up, and only shown the first 3 genes I got from the full genbank entry. The idea is of course to get all genes for each of the 200 full genbank entry in my 'tobis_list_of_acc_no.gbk'-file. i.e. 200 files that each have a similar setup as the output-example I prepared manually through the NCBI's nucleotide database.
I think the answer to my question might be found on the two webpages I have mentioned above, but as I am quite new to Biopython I am not sure where to look on these 2 webpages for information on how I can adjust my code to make SeqIO write the pieces I am after. Is it a question of inserting the right 'handles' for SeqIO to recognize, and write afterwards? In that case is there a webpage (or a specific section in these 2 webpages) where I can find details on what options and handles SeqIO is able to recognize and parse out frmo my '.gbk'-file. I am sorry if I have overlooked something obvious, or a similar post.
Thanks in advance for any help and advice you might be able to provide.