Hey guys, I do phylogenetics of viruses and I'm currently working on an outbreak analysis. So I'm doing some phylogeography too. Obviously if there is no country of origin or collection date I have to take the sequence out of my dataset. I have >200 sequences per dataset and I really don't want to waste my precious time by going through genbank manually. I haven't been successful making or editing any biopython scripts to extract the country from the genbank file. Any help would be appreciated! Thanks!
I'm adding a Python solution that you may later modify to include more data. You just need to specify the location of a file with the accessions (one per line), where it says
The output is separated by commas, so you can later read it as a CSV. Using your IDs, I got:
KT279761,Haiti KC692509,Argentina: Buenos Aires KC692496,Argentina: Buenos Aires
using this simple XSLT stylesheet:
$ curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=KT279761,KC692509,KC692496&retmode=xml" |\ xsltproc --novalid transform.xsl - KT279761 Haiti KC692509 Argentina: Buenos Aires KC692496 Argentina: Buenos Aires