Question: How To Extract Title And /Isolation_Source From A List Of Genbank Accession Numbers
0
gravatar for samlambrechts299
5.6 years ago by
samlambrechts299120 wrote:

Mighty Bioinformaticians,

I have a text file (txt format) with roughly 350 GenBank accession numbers (each representing a specific 16S sequence) and I want to extract the title of the corresponding paper ("TITLE" in genbank files) and the isolation source ("/isolation_source" under "FEATURE" in gb records) for each accession number.

Input example:

HM789874
JN225528

Desired output:

TITLE, /isolation_source of HM789874
TITLE, /isolation_source of JN225528

Now I have searched previous questions and found all sorts of useful information on how I might be able to solve this problem with BioPerl or BioPython modules, but as I am a biologist and not yet a bioinformatician, I find it a bit overwhelming and fail to compile all the potentially useful tricks and tips into a script or command. FYI, I am familiar with Perl/BioPerl (beginner), but I still have to make my first steps in Python.

Any script or help would be greatly appreciated,

Sam

bioperl biopython genbank • 4.2k views
ADD COMMENTlink modified 5.6 years ago by Leszek4.0k • written 5.6 years ago by samlambrechts299120
1

A shell one liner:

wget -q -O temporaryfile "http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=JN225528&rettype=gb&retmode=text" ; sed 's/^ *\//%/;s/ */ /;s/TITLE/%TITLE/;s/JOURNAL/%J/' temporaryfile | tr -d "\n"| tr "%" "\n" | grep -e "isolation_source" -e TITLE

OUTPUT:

TITLE     Direct Submission 
isolation_source="granite rock material in the Xinjiang Bole at an elevation of 1,015 m above sea level"
ADD REPLYlink modified 5.6 years ago • written 5.6 years ago by Bharat Iyengar260
2
gravatar for Pierre Lindenbaum
5.6 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum116k wrote:

use the following xslt stylesheet:

example:

$ echo  -e "HM789874\nJN225528" | while read ACN ; do xsltproc stylesheet.xsl "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&rettype=gb&retmode=xml&id=${ACN}" ; done

HM789874    Biogeography and habitat modelling of high-alpine bacteria    Green Lakes Valley Talus
HM789874    Direct Submission    Green Lakes Valley Talus
JN225528    Direct Submission    granite rock material in the Xinjiang Bole at an elevation of 1,015 m above sea level
ADD COMMENTlink modified 3 months ago • written 5.6 years ago by Pierre Lindenbaum116k

This is amazing, works perfectly!

ADD REPLYlink modified 5.6 years ago • written 5.6 years ago by samlambrechts299120

Dear Pierre, Dear Bioinformaticians,

Do you know how to edit the above example use of the xslt stylesheet so that I can use an input text file with GenBank accession numbers instead of manually typing all the accession numbers like "HM789874\nJN225528" in the above example?

ADD REPLYlink written 3 months ago by samlambrechts299120
1
cat your.file | while read ...
ADD REPLYlink written 3 months ago by Pierre Lindenbaum116k

Thank you. Unfortunately, it seems the xslt stylesheet doesn't work anymore 5 years later. It produces the following error:

cannot parse stylesheet.xsl stylesheet.xsl:2: namespace error : xmlns:xsl: '

ADD REPLYlink modified 3 months ago • written 3 months ago by samlambrechts299120
1

it's because the 'new' biostars engine ate my xml code. I've updated my post with a 'gist'

ADD REPLYlink written 3 months ago by Pierre Lindenbaum116k

Thank you. Unfortunately something changed at ncbi:

warning: failed to load external entity "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&rettype=gb&retmode=xml&id=AF184218"

Edit: This problem persists after I removed the windows carriage returns

ADD REPLYlink modified 3 months ago • written 3 months ago by samlambrechts299120
1

since this post, 5 years ago, ncbi has moved from http to httpS . Try this:

$ echo  -e "HM789874\nJN225528" | awk '{printf("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&rettype=gb&retmode=xml&id=%s\n",$0);}' | while read URL ; do wget -O - -q  ${URL} | xsltproc --novalid stylesheet.xsl - ; done

HM789874    Biogeography and habitat modelling of high-alpine bacteria    Green Lakes Valley Talus
HM789874    Direct Submission    Green Lakes Valley Talus
JN225528    Direct Submission    granite rock material in the Xinjiang Bole at an elevation of 1,015 m above sea level
ADD REPLYlink written 3 months ago by Pierre Lindenbaum116k
1

furthermore, the recent version of ncbi doesn't allow more than 'x' request per minutes, you might have to add a 'sleep' to the command line

ADD REPLYlink written 3 months ago by Pierre Lindenbaum116k

And it works!! Thanks!!!

ADD REPLYlink written 3 months ago by samlambrechts299120
1
gravatar for Leszek
5.6 years ago by
Leszek4.0k
IIMCB, Poland
Leszek4.0k wrote:

for python lovers (just adding a few lines to Fetching genbank entries for list of accession numbers.):

#!/usr/bin/env python
"""Fetch GenBank entries for given accessions and report publication title and isolation source. 

USAGE:
  python acc2title.epost.py A22237 A22239 A32021 A32022 A33397 > out.tsv
or
  cat ids | python acc2title.epost.py > out.tsv

DEPENDENCIES:
Biopython
"""

import sys
from Bio import Entrez, SeqIO

#define email for entrez login
db           = "nuccore"
Entrez.email = "some_email@somedomain.com"
batchSize    = 100
retmax       = 10**9

#load accessions from arguments
if len(sys.argv[1:]) > 1:
  accs = sys.argv[1:]
else: #load accesions from stdin  
  accs = [ l.strip() for l in sys.stdin if l.strip() ]
#first get GI for query accesions
sys.stderr.write( "Fetching %s entries from GenBank: %s\n" % (len(accs), ", ".join(accs[:10])))
query  = " ".join(accs)
handle = Entrez.esearch( db=db,term=query,retmax=retmax )
giList = Entrez.read(handle)['IdList']
sys.stderr.write( "Found %s GI: %s\n" % (len(giList), ", ".join(giList[:10])))
#post NCBI query
search_handle     = Entrez.epost(db=db, id=",".join(giList))
search_results    = Entrez.read(search_handle)
webenv,query_key  = search_results["WebEnv"], search_results["QueryKey"] 
#print header
sys.stdout.write("#accesion\ttitle\tisolation_source\n")
#fecth all results in batch of batchSize entries at once
for start in range( 0,len(giList),batchSize ):
  sys.stderr.write( " %9i" % (start+1,))
  #fetch entries in batch
  handle = Entrez.efetch(db=db, rettype="gb", retstart=start, retmax=batchSize, webenv=webenv, query_key=query_key)
  #parse gb entries
  for r in SeqIO.parse(handle, 'gb'):
      title = isolation_source = ""
      #get title
      if 'references' in r.annotations:
          ref   = r.annotations['references'][0]
          title = ref.title
      #get source features
      source = r.features[0]
      #get isolation_source
      if 'isolation_source' in source.qualifiers:
          isolation_source = source.qualifiers['isolation_source'][0]
    #print output to stdout
    sys.stdout.write("%s\t%s\t%s\n" % r.name, title, isolation_source))
ADD COMMENTlink modified 5.6 years ago • written 5.6 years ago by Leszek4.0k

Thank you. However, this does not seem to work for large numbers of Genbank accessions (77 000). I receive the following error message:

urllib2.HTTPError: HTTP Error 414: Request-URI Too Long

ADD REPLYlink modified 3 months ago • written 3 months ago by samlambrechts299120
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1791 users visited in the last hour