I have a genbank that I got from someone that I'm doing some analysis for, and somewhere along the line it was either dodgy to start with, or has been borked.
The gbk is the correct format, and has all information, except the /translations are missing, like so:
LOCUS Sakai_contig000001 4952793 bp DNA linear UNC 05-JAN-2016
DEFINITION [gcode=11] [organism=Escherichia coli] [strain=Sakai].
FEATURES Location/Qualifiers
CDS concatenate_genome:85..6084
/inference="ab initio prediction:Prodigal:2.60,protein
motif:CLUSTERS:PRK09751"
/locus_tag="PROKKA_00001"
/product="putative ATP-dependent helicase Lhr"
CDS concatenate_genome:6081..8195
/EC_number="3.6.4.12"
/gene="pcrA"
/inference="ab initio prediction:Prodigal:2.60,similar to
AA sequence:UniProtKB:P64319"
/locus_tag="PROKKA_00002"
/product="ATP-dependent DNA helicase PcrA"
CDS complement(concatenate_genome:9148..9393)
/inference="ab initio prediction:Prodigal:2.60"
/locus_tag="PROKKA_00003"
/product="hypothetical protein"
Given that I still have the locus-tags, and the co-ordinates for the each CDS in the file, as well as most header information such as the inferences etc. Does anyone know of a way I can read this in to a program or script (So far I've fiddled with CLC and Artemis but without any luck), such that it puts the CDS's in the correct positions and I can then write a new GBK which will take this information and give me the translations as well.
It's important that whatever method doesn't alter the locus tags in any way else it will screw up some RNAseq analysis I've done prior to discovering this issue.
Nice find!
Funnily enough I already follow him on github and never saw this code!