Question

edirect: Entrez Unix Command line

0

Entering edit mode

10.8 years ago

st.ph.n ★ 2.7k

Hi All. I'm trying to output a search using the Entrez Direct: E-utilities on the UNIX Command Line. What I want to do is search using esearch, and using xtract output the following format:

pubmedid        First Author Last Name / FA first initial       First Author Affiliation        Date Published

I can get some of the output I need using two different codes, but putting them together would be tricky, and harder than it should be, or so I imagine. The problem, I think lies with two different formats: docsum, and xml.

The first command I've been playing with:

./esearch \
  -db pubmed \
  -query "search string" | \
  ./efilter -mindate 2005 | \
  ./efetch -format docsum | \
    ./xtract \
      -pattern DocumentSummary \
      -element MedlineCitation/PMID \
      -element Id SortFirstAuthor | \
      sort -t $'\t' -k 3,3n -k 2,2f

So that outputs the first two columns as desired. However, the docsum format doesn't contain information about affiliation.

Using this command:

./esearch \
  -db pubmed \
  -query "search string" | \
  ./efilter -mindate 2005 | \
  ./efetch -format xml | \
    ./xtract -pattern PubmedArticle \
    -element MedlineCitation/PMID \
    -element Id SortFirstAuthor Affiliation \
    -block PubDate \
    -sep " " \
    -element Year,Month MedlineDate | \
    sort -t $'\t' -k 3,3n -k 2,2f

I get the pubmedid, and all affiliations of every author on each publication.

Does anyone know how I might tweak either of these codes? Is it possible to ignore all other authors except the first other?

All help is appreciated.

edirect efetch entrez xtract esearch • 9.4k views

ADD COMMENT • link updated 3.5 years ago by Ram 45k • written 10.8 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

10.8 years ago

Pierre Lindenbaum 166k

FYI: you might be interested in my old blog post "Mapping NCBI/PUBMED" (2007) : http://plindenbaum.blogspot.fr/2007/06/mapping-ncbipubmed.html

http://code.google.com/p/lindenb/source/browse/trunk/src/java/org/lindenb/tool/oneshot/NCBIMap.java

ADD COMMENT • link 10.8 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Thanks, Pierre. What you did seems like what I am going to try and accomplish. I plan to parse out the city/state/country information from the affiliation of the first author for each publication found on a given topic. (hence esearch, instead of efetch). Then using that information use geopy to get longitude/latitude for those cities, and map them to a basemap using networkx, and matplotlib.

ADD REPLY • link 10.8 years ago by st.ph.n ★ 2.7k

Ram · Accepted Answer · 2014-09-02

2

Entering edit mode

10.8 years ago

Ashutosh Pandey 12k

You should be using "efetch" eutils for this task. You can write your own xml parser. But I would not advise you to reinvent the wheel as it can be pretty easily done using Biopython or Bioperl. See below how it can be done in Biopython:

from Bio import Medline
from Bio import Entrez
Entrez.email = 'xyz@gmail.com' # replace it with your email id

handle = Entrez.esearch(db="pubmed", term="Ignorome")
record = Entrez.read(handle)
def Pubmedsearch(PMID):

        pmid = ""
        pmid = PMID
        handle = Entrez.efetch(db="pubmed", id= pmid, rettype="medline",retmode="text")
        records = Medline.parse(handle)
        records = list(records)
        for record in records:
                return (str(pmid)+"\t"+str(record.get("TI", "?"))+"\t"+str(record.get("AU", "?")[0])+"\t"+str(record.get("AD", "?").split(".")[0])+"\t"+str(record.get("DP", "?")))
                ### TI >Title, FAU > Full Author Name, AU > Author name, AD > Affiliation, DP > Publication  date (More tags can be added from here http://www.ncbi.nlm.nih.gov/books/NBK3827/)
## I have printed the first element of the authors list because you needed the first author. Similarly I split the AD string using "." to get the affiliation of the first author.
## Test , this is my paper. I have also printed the title.
for pmid in record["IdList"]:
      info = Pubmedsearch(pmid)
      print info

Output

24523945        Functionally enigmatic genes: a case study of the brain ignorome.       Pandey AK       UT Center for Integrative and Translational Genomics and Department of Anatomy and Neurobiology, University of Tennessee Health Science Center, Memphis, Tennessee, United States of America  2014

EDIT: Now it takes a string as input , retrieves all the PMIDs from Pubmed related to that string and then loops over each PMID and returns the first author information.

ADD COMMENT • link updated 3.5 years ago by Ram 45k • written 10.8 years ago by Ashutosh Pandey 12k

0

Entering edit mode

Can efetch be used to search for all publications matching a string, similar to my example where esearch is used?

ADD REPLY • link 10.8 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

I didnt read your question carefully. I thought you needed author information for a specific pubmedid. Efetch works when you already have a pubmedid. It doesn't work with all the entrez databases. You will have to write a Esearch-> Efetch pipeline as explained here. OR you can use Biopython module that uses Esearch and returns list of PMIDs and then you can get the authors information using Efetch. I am sure Biopython will have some module for Esearch.

ADD REPLY • link updated 3.5 years ago by Ram 45k • written 10.8 years ago by Ashutosh Pandey 12k

0

Entering edit mode

Go through this tutorial. You can see how it used a string to search Pubmed for all the PMIDs that have that string in them using "einfo". Once you have the PMIDs , then you can use "efetch".

ADD REPLY • link updated 3.5 years ago by Ram 45k • written 10.8 years ago by Ashutosh Pandey 12k

0

Entering edit mode

Ok, so I will modify the first command,

./esearch -db pubmed -query "coral genomics" | ./efilter -mindate 2005 | ./efetch -format xml | ./xtract -pattern PubmedArticle -element MedlineCitation/PMID

to now only print pubmedid's. From there, I can use BioPython, with your example, to get the information, per pubmedid. What I really want is only the City/State/Country from each first author affiliation, and can parse that out later.

ADD REPLY • link 10.8 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

You will have to write your own code to extract information at that specificity level. Try to understand the structure of the output and then write a parser that will retrieve city, country etc. The "affiliation" information can't be accessed individually as city, country, state from pubmed. It will be extracted together as a big affiliation string and then you will have to parse whatever you need from that string. PS: I have updated a link in my previous comment. please go through that tuorial and it should be very helpful. Thanks.

ADD REPLY • link updated 3.5 years ago by Ram 45k • written 10.8 years ago by Ashutosh Pandey 12k

0

Entering edit mode

Please check the new code.

ADD REPLY • link updated 3.5 years ago by Ram 45k • written 10.8 years ago by Ashutosh Pandey 12k

0

Entering edit mode

Thanks, Ashutosh. I'm familiar with Python and that Biopython has can deal with these kinds of tasks. I was hoping to get it all done with entrez edirect.

ADD REPLY • link updated 3.5 years ago by Ram 45k • written 10.8 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

Well I also have a code that doesn't use Biopython and use pure eutils and my own xmlparser but I think Pierre's code would be much better than mine. Good luck.

ADD REPLY • link updated 3.5 years ago by Ram 45k • written 10.8 years ago by Ashutosh Pandey 12k