Question: How To Extract Title And /Isolation_Source From A List Of Genbank Accession Numbers
0
gravatar for samlambrechts299
6.0 years ago by
samlambrechts299130 wrote:

Mighty Bioinformaticians,

I have a text file (txt format) with roughly 350 GenBank accession numbers (each representing a specific 16S sequence) and I want to extract the title of the corresponding paper ("TITLE" in genbank files) and the isolation source ("/isolation_source" under "FEATURE" in gb records) for each accession number.

Input example:

HM789874
JN225528

Desired output:

TITLE, /isolation_source of HM789874
TITLE, /isolation_source of JN225528

Now I have searched previous questions and found all sorts of useful information on how I might be able to solve this problem with BioPerl or BioPython modules, but as I am a biologist and not yet a bioinformatician, I find it a bit overwhelming and fail to compile all the potentially useful tricks and tips into a script or command. FYI, I am familiar with Perl/BioPerl (beginner), but I still have to make my first steps in Python.

Any script or help would be greatly appreciated,

Sam

bioperl biopython genbank • 4.6k views
ADD COMMENTlink modified 6.0 years ago by Leszek4.0k • written 6.0 years ago by samlambrechts299130
1

A shell one liner:

wget -q -O temporaryfile "http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=JN225528&rettype=gb&retmode=text" ; sed 's/^ *\//%/;s/ */ /;s/TITLE/%TITLE/;s/JOURNAL/%J/' temporaryfile | tr -d "\n"| tr "%" "\n" | grep -e "isolation_source" -e TITLE

OUTPUT:

TITLE     Direct Submission 
isolation_source="granite rock material in the Xinjiang Bole at an elevation of 1,015 m above sea level"
ADD REPLYlink modified 6.0 years ago • written 6.0 years ago by Bharat Iyengar270
2
gravatar for Pierre Lindenbaum
6.0 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum121k wrote:

use the following xslt stylesheet:

example:

$ echo  -e "HM789874\nJN225528" | while read ACN ; do xsltproc stylesheet.xsl "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&rettype=gb&retmode=xml&id=${ACN}" ; done

HM789874    Biogeography and habitat modelling of high-alpine bacteria    Green Lakes Valley Talus
HM789874    Direct Submission    Green Lakes Valley Talus
JN225528    Direct Submission    granite rock material in the Xinjiang Bole at an elevation of 1,015 m above sea level
ADD COMMENTlink modified 8 months ago • written 6.0 years ago by Pierre Lindenbaum121k

This is amazing, works perfectly!

ADD REPLYlink modified 6.0 years ago • written 6.0 years ago by samlambrechts299130

Dear Pierre, Dear Bioinformaticians,

Do you know how to edit the above example use of the xslt stylesheet so that I can use an input text file with GenBank accession numbers instead of manually typing all the accession numbers like "HM789874\nJN225528" in the above example?

ADD REPLYlink written 8 months ago by samlambrechts299130
1
cat your.file | while read ...
ADD REPLYlink written 8 months ago by Pierre Lindenbaum121k

Thank you. Unfortunately, it seems the xslt stylesheet doesn't work anymore 5 years later. It produces the following error:

cannot parse stylesheet.xsl stylesheet.xsl:2: namespace error : xmlns:xsl: '

ADD REPLYlink modified 8 months ago • written 8 months ago by samlambrechts299130
1

it's because the 'new' biostars engine ate my xml code. I've updated my post with a 'gist'

ADD REPLYlink written 8 months ago by Pierre Lindenbaum121k

Thank you. Unfortunately something changed at ncbi:

warning: failed to load external entity "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&rettype=gb&retmode=xml&id=AF184218"

Edit: This problem persists after I removed the windows carriage returns

ADD REPLYlink modified 8 months ago • written 8 months ago by samlambrechts299130
1

since this post, 5 years ago, ncbi has moved from http to httpS . Try this:

$ echo  -e "HM789874\nJN225528" | awk '{printf("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&rettype=gb&retmode=xml&id=%s\n",$0);}' | while read URL ; do wget -O - -q  ${URL} | xsltproc --novalid stylesheet.xsl - ; done

HM789874    Biogeography and habitat modelling of high-alpine bacteria    Green Lakes Valley Talus
HM789874    Direct Submission    Green Lakes Valley Talus
JN225528    Direct Submission    granite rock material in the Xinjiang Bole at an elevation of 1,015 m above sea level
ADD REPLYlink written 8 months ago by Pierre Lindenbaum121k
1

furthermore, the recent version of ncbi doesn't allow more than 'x' request per minutes, you might have to add a 'sleep' to the command line

ADD REPLYlink written 8 months ago by Pierre Lindenbaum121k

The above command works great. I am trying to extract PUBMED ID (PMID) information for ~500 accession numbers that I have downloaded from the RefSeq database. My goal is to identify accession numbers which are published and are associated with Pubmed. The 500 accessions include both finished (completed), and unfinished (contigs) genomes. Here is the modified xslt stylesheet:


<xsl:stylesheet xmlns:xsl="&lt;a href=" http:="" www.w3.org="" 1999="" XSL="" Transform"="" rel="nofollow">http://www.w3.org/1999/XSL/Transform" version="1.1">
<xsl:output method="text"/>

<xsl:template match="/">
<xsl:apply-templates select="/GBSet/GBSeq"/>
</xsl:template>

<xsl:template match="GBSeq">
<xsl:variable name="gbseq" select="."/>
<xsl:for-each select="$gbseq/GBSeq_references/GBReference/GBReference_pubmed">
<xsl:variable name="pubmed" select="./text()"/>
<xsl:value-of select="$gbseq/GBSeq_locus"/>
<xsl:text>    </xsl:text>
<xsl:value-of select="$pubmed"/>
<xsl:text>    </xsl:text>
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>

When I supply the list of only FINISHED GENOMES I get the PMID of all the accession numbers without any error. However, when I supply a list with both FINISHED and UNFINISHED genomes, it gives error after working fine for few samples.

Example file for only finished genomes, one accession per line

CP007676

CP007690

CP009361

CP009423

CP009554

CP009681

CP009828

CP010295

Example file with both finished and unfinished genomes, one accession per line

CP007676

CP007690

CP009361

CP009423

CP009554

CP009681

CP009828

CP010295

LALV00000000.1

LALA00000000.1

LALB00000000.1

ASZO00000000.1

AUPV00000000.1

AUPT00000000.1

Here is my command-

cat finishedUnfinished.txt | awk '{printf("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&retmode=xml&id=%s\n",$0);}' | while read URL ; do wget -O - -q  ${URL} | xsltproc --novalid stylesheet_PMID.xsl - ; done

Here is the output-

CP007676 26048971
CP007690 24970829
CP009361 25377701
CP009681 27152133
CP010295 25767217
LALV01000000 26048971
LALA01000000 26048971
LALB01000000 26048971
-:1: parser error : Document is empty

^ unable to parse - -:1: parser error : Document is empty

Any help will be highly appreciated.

Tauqeer

ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by tauqeer90

And it works!! Thanks!!!

ADD REPLYlink written 8 months ago by samlambrechts299130
1
gravatar for Leszek
6.0 years ago by
Leszek4.0k
IIMCB, Poland
Leszek4.0k wrote:

for python lovers (just adding a few lines to Fetching genbank entries for list of accession numbers.):

#!/usr/bin/env python
"""Fetch GenBank entries for given accessions and report publication title and isolation source. 

USAGE:
  python acc2title.epost.py A22237 A22239 A32021 A32022 A33397 > out.tsv
or
  cat ids | python acc2title.epost.py > out.tsv

DEPENDENCIES:
Biopython
"""

import sys
from Bio import Entrez, SeqIO

#define email for entrez login
db           = "nuccore"
Entrez.email = "some_email@somedomain.com"
batchSize    = 100
retmax       = 10**9

#load accessions from arguments
if len(sys.argv[1:]) > 1:
  accs = sys.argv[1:]
else: #load accesions from stdin  
  accs = [ l.strip() for l in sys.stdin if l.strip() ]
#first get GI for query accesions
sys.stderr.write( "Fetching %s entries from GenBank: %s\n" % (len(accs), ", ".join(accs[:10])))
query  = " ".join(accs)
handle = Entrez.esearch( db=db,term=query,retmax=retmax )
giList = Entrez.read(handle)['IdList']
sys.stderr.write( "Found %s GI: %s\n" % (len(giList), ", ".join(giList[:10])))
#post NCBI query
search_handle     = Entrez.epost(db=db, id=",".join(giList))
search_results    = Entrez.read(search_handle)
webenv,query_key  = search_results["WebEnv"], search_results["QueryKey"] 
#print header
sys.stdout.write("#accesion\ttitle\tisolation_source\n")
#fecth all results in batch of batchSize entries at once
for start in range( 0,len(giList),batchSize ):
  sys.stderr.write( " %9i" % (start+1,))
  #fetch entries in batch
  handle = Entrez.efetch(db=db, rettype="gb", retstart=start, retmax=batchSize, webenv=webenv, query_key=query_key)
  #parse gb entries
  for r in SeqIO.parse(handle, 'gb'):
      title = isolation_source = ""
      #get title
      if 'references' in r.annotations:
          ref   = r.annotations['references'][0]
          title = ref.title
      #get source features
      source = r.features[0]
      #get isolation_source
      if 'isolation_source' in source.qualifiers:
          isolation_source = source.qualifiers['isolation_source'][0]
    #print output to stdout
    sys.stdout.write("%s\t%s\t%s\n" % r.name, title, isolation_source))
ADD COMMENTlink modified 6.0 years ago • written 6.0 years ago by Leszek4.0k

Thank you. However, this does not seem to work for large numbers of Genbank accessions (77 000). I receive the following error message:

urllib2.HTTPError: HTTP Error 414: Request-URI Too Long

ADD REPLYlink modified 8 months ago • written 8 months ago by samlambrechts299130
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1137 users visited in the last hour