Question: Extracting the Country from a >200 sequences Genbank file
0
gravatar for tpaisie
2.3 years ago by
tpaisie70
University of Florida
tpaisie70 wrote:

Hey guys, I do phylogenetics of viruses and I'm currently working on an outbreak analysis. So I'm doing some phylogeography too. Obviously if there is no country of origin or collection date I have to take the sequence out of my dataset. I have >200 sequences per dataset and I really don't want to waste my precious time by going through genbank manually. I haven't been successful making or editing any biopython scripts to extract the country from the genbank file. Any help would be appreciated! Thanks!

ADD COMMENTlink modified 2.2 years ago by Juan Manuel Berros80 • written 2.3 years ago by tpaisie70
1

can you give a couple of examples of genbank entries (accession numbers) and the field which contains country annotation? Also are you looking for python-only solution?

ADD REPLYlink written 2.3 years ago by Santosh Anand4.6k

Here are a couple accession numbers: KT279761 KC692509 KC692496

The country annotation is in the Features, then source, for example:

FEATURES Location/Qualifiers source 1..10735 /organism="Dengue virus 1" /mol_type="genomic RNA" /serotype="1" /isolate="HNRG14635" /isolation_source="serum" /host="Homo sapiens" /db_xref="taxon:11053" /country="Argentina: Buenos Aires" /collection_date="05-May-2009"

I'm looking for any solution, but i thought python was my best bet, with Biopython and all.

ADD REPLYlink written 2.3 years ago by tpaisie70
3
gravatar for Juan Manuel Berros
2.2 years ago by
Buenos Aires, Argentina
Juan Manuel Berros80 wrote:

I'm adding a Python solution that you may later modify to include more data. You just need to specify the location of a file with the accessions (one per line), where it says accessions.txt:

The output is separated by commas, so you can later read it as a CSV. Using your IDs, I got:

KT279761,Haiti
KC692509,Argentina: Buenos Aires
KC692496,Argentina: Buenos Aires
ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by Juan Manuel Berros80

What can I add to get the 'date' ?

ADD REPLYlink modified 20 months ago • written 20 months ago by l.souza60
0
gravatar for Pierre Lindenbaum
2.3 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum116k wrote:

using this simple XSLT stylesheet:

run:

$ curl -s  "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=KT279761,KC692509,KC692496&retmode=xml" |\
xsltproc --novalid transform.xsl -

KT279761 Haiti
KC692509 Argentina: Buenos Aires
KC692496 Argentina: Buenos Aires
ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by Pierre Lindenbaum116k

Thank you! Instead of a link can I replace that with the xml file I have?

Also I am getting this error when I try to run it:

"transform.xsl:1: namespace error : xmlns:xsl: '

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by tpaisie70
1

yes, I've replaced my text with a gist on github.

ADD REPLYlink written 2.3 years ago by Pierre Lindenbaum116k

Thank you for your help!

ADD REPLYlink written 2.3 years ago by tpaisie70
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1365 users visited in the last hour