Question: Extracting the Country from a >200 sequences Genbank file
0
gravatar for tpaisie
3.9 years ago by
tpaisie70
University of Florida
tpaisie70 wrote:

Hey guys, I do phylogenetics of viruses and I'm currently working on an outbreak analysis. So I'm doing some phylogeography too. Obviously if there is no country of origin or collection date I have to take the sequence out of my dataset. I have >200 sequences per dataset and I really don't want to waste my precious time by going through genbank manually. I haven't been successful making or editing any biopython scripts to extract the country from the genbank file. Any help would be appreciated! Thanks!

ADD COMMENTlink modified 3.9 years ago by Juan Manuel Berros90 • written 3.9 years ago by tpaisie70
1

can you give a couple of examples of genbank entries (accession numbers) and the field which contains country annotation? Also are you looking for python-only solution?

ADD REPLYlink written 3.9 years ago by Santosh Anand5.1k

Here are a couple accession numbers: KT279761 KC692509 KC692496

The country annotation is in the Features, then source, for example:

FEATURES Location/Qualifiers source 1..10735 /organism="Dengue virus 1" /mol_type="genomic RNA" /serotype="1" /isolate="HNRG14635" /isolation_source="serum" /host="Homo sapiens" /db_xref="taxon:11053" /country="Argentina: Buenos Aires" /collection_date="05-May-2009"

I'm looking for any solution, but i thought python was my best bet, with Biopython and all.

ADD REPLYlink written 3.9 years ago by tpaisie70
3
gravatar for Juan Manuel Berros
3.9 years ago by
Buenos Aires, Argentina
Juan Manuel Berros90 wrote:

I'm adding a Python solution that you may later modify to include more data. You just need to specify the location of a file with the accessions (one per line), where it says accessions.txt:

The output is separated by commas, so you can later read it as a CSV. Using your IDs, I got:

KT279761,Haiti
KC692509,Argentina: Buenos Aires
KC692496,Argentina: Buenos Aires
ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by Juan Manuel Berros90

What can I add to get the 'date' ?

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by l.souza70

Hi,

As tpaisie I also want to extract specific information ("collection_date" for my case) from a genbank file. I tried to slightly modify your code in order to get the "collection_date", but it seems to not working. Any idea? Here is my code below :

from Bio import Entrez

# Read the accessions from a file
accessions_file = 'accession_nuber_list.txt'
with open(accessions_file) as f:
    ids = f.read().split('\n')

# Fetch the entries from Entrez
Entrez.email = 'emailadress'  # Insert your email here
handle = Entrez.efetch('nuccore', id=ids, retmode='xml')
response = Entrez.read(handle)

# Parse the entries to get the country
def extract_date(entry):
    sources = [feature for feature in entry['GBSeq_feature-table']
               if feature['GBFeature_key'] == 'source']

    for source in sources:
        qualifiers = [qual for qual in source['GBFeature_quals']
                      if qual['GBQualifier_name'] == 'collection']

        for qualifier in qualifiers:
            yield qualifier['GBQualifier_value']

for entry in response:
    accession = entry['GBSeq_primary-accession']
    for date in extract_date(entry):
        print(accession, date, sep=',')
ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by gabrieldupre280

It seems the key for the date is collection_date, not collection. Try changing that.

ADD REPLYlink written 7 weeks ago by Juan Manuel Berros90
0
gravatar for Pierre Lindenbaum
3.9 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum130k wrote:

using this simple XSLT stylesheet:

run:

$ curl -s  "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=KT279761,KC692509,KC692496&retmode=xml" |\
xsltproc --novalid transform.xsl -

KT279761 Haiti
KC692509 Argentina: Buenos Aires
KC692496 Argentina: Buenos Aires
ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by Pierre Lindenbaum130k

Thank you! Instead of a link can I replace that with the xml file I have?

Also I am getting this error when I try to run it:

"transform.xsl:1: namespace error : xmlns:xsl: '

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by tpaisie70
1

yes, I've replaced my text with a gist on github.

ADD REPLYlink written 3.9 years ago by Pierre Lindenbaum130k

Thank you for your help!

ADD REPLYlink written 3.9 years ago by tpaisie70
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1946 users visited in the last hour