Question: Extract Bibliographic Reference from GEO dataset using EFetch?
0
gravatar for Tom
16 months ago by
Tom20
Tom20 wrote:

Hi all, I have say 100 data set names from GEO, and for each, I want to pull out a reference using Efetch.

so for example, for the list:

GDS5204
GDS520
GDS4925

The output should be:

GDS4925 D'Souza M, Zhu X, Frisina RD. Novel approach to select genes from RMA normalized microarray data using functional hearing tests in aging mice. J Neurosci Methods 2008 Jun 30;171(2):279-87. PMID: 18455804

GDS520 Blalock EM, Chen KC, Sharrow K, Herman JP et al. Gene microarrays in hippocampal aging: statistical profiling identifies novel processes correlated with cognitive impairment. J Neurosci 2003 May 1;23(9):3807-19. PMID: 12736351

GDS5204 Lu T, Aron L, Zullo J, Pan Y et al. REST and stress resistance in ageing and Alzheimer's disease. Nature 2014 Mar 27;507(7493):448-54. PMID: 24670762

I am trying to use an Efetch command:

esearch -db GDS -query "GDS5204[ACCN]" | efetch -format docsum | xtract -pattern DocumentSummary -element PubMedIds

The command "runs" (as in no error), but there's also no output at all. I've tried editing different parts of the above command but I still just get the same thing.

Would someone know how to edit the above command to get the output described above (i.e. full reference for a set of data sets)?

Thanks

efetch ncbi geo • 560 views
ADD COMMENTlink modified 16 months ago by Santosh Anand3.9k • written 16 months ago by Tom20
1
gravatar for Santosh Anand
16 months ago by
Santosh Anand3.9k
Santosh Anand3.9k wrote:

Not a full solution, but ....

For some reason unknown to me, the PubMedIds in the XML-doc is further tagged as <int>. So your query should be like

esearch -db GDS -query "GDS5204[ACCN]" | efetch -format docsum | xtract -pattern DocumentSummary -element PubMedIds/int

This should give you pubmed id, which has to be searched on pubmed db now, with query something like

efetch -db pubmed -id  24670762 -format xml |   xtract -pattern PubmedArticle -element MedlineCitation/PMID     -block Author -sep " " -tab ""       -element "&COM" Initials,LastName -COM "(, )"

Hope you got the idea. See also https://www.ncbi.nlm.nih.gov/books/NBK179288/

ADD COMMENTlink modified 16 months ago • written 16 months ago by Santosh Anand3.9k

I really appreciate the help.

Unfortunately, when I type:

efetch -db pubmed -id  24670762 -format xml | xtract -pattern PubmedArticle -element MedlineCitation/PMID

or

efetch -db pubmed -id  24670762 -format xml |   xtract -pattern PubmedArticle -element MedlineCitation/PMID     -block Author -sep " " -tab ""       -element "&COM" Initials,LastName -COM "(,` )"

I get the same as before; no error, but also no output. Would you've any ideas?

When I type:

efetch -db pubmed -id  24670762 -format xml

I can see the full XML output though, so I know it's not a connection etc problem. I did also look up that link you sent, I appreciate it, it's just something that's new to me, so that's the link I've been following to get this far, but now I'm just a bit stuck.

ADD REPLYlink modified 16 months ago • written 16 months ago by Tom20
1
efetch -db pubmed -id  24670762 -format xml | xtract -pattern PubmedArticle -element MedlineCitation/PMID

Your first query works on my comp and gives PMID as output (24670762)!

ADD REPLYlink written 16 months ago by Santosh Anand3.9k

Thanks, that's strange, on either a work server, or on my home mac laptop (no server); when I type:

efetch -db pubmed -id  24670762 -format xml

I can see XML output.

When I type:

efetch -db pubmed -id  24670762 -format xml | xtract

It says (as expected) "No command-line arguments supplied to xtract"

But then when I type:

efetch -db pubmed -id  24670762 -format xml | xtract -pattern PubmedArticle -element MedlineCitation

I get nothing at all.

Do you think it's strange that when I change the above command to:

efetch -db pubmed -id  24670762 -format xml | xtract -pattern BLAH -element MedlineCitation

I similarly get no output. So I think this means that it's not picking up the pattern at all, so then none of the elements matter?

I can see from the Extraction Arguments section that pattern tells the command what part of the XML file to look at, and I can see that PubmedArticle is one of the sections from the full XML output, so I don't know what pattern can't pick it up?

ADD REPLYlink written 16 months ago by Tom20
1

You are missing /PMID from MedlineCitation/PMID. Also, there are parsing issues on biostars which are changing some of the commands. I'm highlighting them differently now

Try these, they are working on my computer

Gives back the PMID

efetch -db pubmed -id 24670762 -format xml | xtract -pattern PubmedArticle -element MedlineCitation/PMID

PMID and author's name

efetch -db pubmed -id 24670762 -format xml | xtract -pattern PubmedArticle -element MedlineCitation/PMID -block Author -sep " " -tab "" -element "&COM" Initials,LastName -COM "(, )"

ADD REPLYlink written 16 months ago by Santosh Anand3.9k

I really appreciate it. I think there's something wrong with either my computer (just a standard mac terminal) or my connection to entrez. As you can see from the attached image here, even when I type the first command, I just don't get the PMID back. Thank you for your help though, I appreciate it.

ADD REPLYlink modified 16 months ago • written 16 months ago by Tom20
1

That's very strange!! Let's do some check: Are you able to run these commands individually?

1. efetch -db pubmed -id 24670762 -format xml >tmp.xml
2. cat tmp.xml | xtract -pattern PubmedArticle -element MedlineCitation/PMID

Does it produce 1) tmp.xml 2) Output as 24670762 If not, your xtract command is not working. you may reinstall from etuils

ADD REPLYlink written 16 months ago by Santosh Anand3.9k

Thanks so much, I officially give up I think I'd rather do >100 references manually at this stage. Command 1 works (as in gives me the tmp.xml file); command 2 doesn't work (no output).

So I re-installed e-utils:

localhost:~ Aoife$ cd ~
localhost:~ Aoife$ perl -MNet::FTP -e \ '$ftp = new Net::FTP("ftp.ncbi.nlm.nih.gov", Passive => 1); $ftp->login; $ftp->binary; $ftp->get("/entrez/entrezdirect/edirect.zip");'
localhost:~ Aoife$ unzip -u -q edirect.zip
replace edirect/setup.sh? [y]es, [n]o, [A]ll, [N]one, [r]ename: A
localhost:~ Aoife$ rm edirect.zip
localhost:~ Aoife$ export PATH=$PATH:$HOME/edirect
localhost:~ Aoife$ ./edirect/setup.sh

Trying to establish local installations of any missing Perl modules
(as logged in /Users/Aoife/edirect/setup-deps.log).
Please be patient, as this step may take a little while.
Entrez Direct has been successfully downloaded and installed.

Which says it was successful installed. And then I typed the two commands again, and again, the first one works and the second one doesn't. I don't want to waste any more of your time, I really appreciate the help you gave me.

Thank you.

ADD REPLYlink modified 16 months ago • written 16 months ago by Tom20

I am sorry; it's not your fault. e-utils is notorious in not being documented properly and not giving proper error message :-( But I have a deeper look and probably I understand what is the problem. Can you do the following ( I assume you are on mac) ?

  1. Dload xtract.Darwin from here: ftp://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/xtract.Darwin
  2. give it an execute permission: $ chmod +x xtract.Darwin
  3. Just run it on command line: $ xtract.Darwin => This should give an error somethign like "ERROR: No command-line arguments supplied to xtract"

  4. If 1-3 passes as described, then try this commandline (note xtract.Darwin in cmdline)

    cat tmp.xml | xtract.Darwin -pattern PubmedArticle -element MedlineCitation/PMID

  5. If 4 passes, then add this xtract.Darwin to your edirect folder, and it should now work normally. ie.

    cat tmp.xml | xtract -pattern PubmedArticle -element MedlineCitation/PMID

In case of error in any step, please paste the complete error here (with step number)

ADD REPLYlink modified 16 months ago • written 16 months ago by Santosh Anand3.9k

I appreciate it; Everything works great, until step 4, where I get no output at all, so there is no error I can paste (and I checked, tmp.xml is not empty).

I emailed the NCBI help desk this question:

I have a list of GEO data set accession (e.g. GDS4925) and I want a reference for each one; e.g. GDS4925 D'Souza M, Zhu X, Frisina RD. Novel approach to select genes from RMA normalized microarray data using functional hearing tests in aging mice. J Neurosci Methods 2008 Jun 30;171(2):279-87. PMID: 18455804

My xtract command doesn't work (these are the exact commands I used: https://ibb.co/fsv7i5). I asked this question on biostars: C: Extract Bibliographic Reference from GEO dataset using EFetch?. I re-installed e-utils, checked on both server and mac terminal, and opened a new terminal, and the command still doesn't work. Can you pinpoint what the issue is?

and they replied:

For your specific example, the information is in the GPL and GDS entries retrieved through esummary. If you install edirect, the following can be used to retrieve the publication:

$ esearch -db gds -query "GDS4925" |esummary |xtract -pattern DocumentSummary -first Accession -group PubMedIds -element int |grep "GDS"

GDS4925 18455804 24587312

My problem is; I do get the output that they describe "GDS4925 18455804 24587312", but I still need to put it into this format:

GDS4925 D'Souza M, Zhu X, Frisina RD. Novel approach to select genes from RMA normalized microarray data using functional hearing tests in aging mice. J Neurosci Methods 2008 Jun 30;171(2):279-87. PMID: 18455804

ADD REPLYlink modified 16 months ago • written 16 months ago by Tom20
1

This is one of the weirdest things I'm seeing in a while! Did you use xtract.Darwin, that was dloaded in step 1 for the step 4?

If 1-3 passes as described, then try this commandline (note xtract.Darwin in cmdline)

cat tmp.xml | xtract.Darwin -pattern PubmedArticle -element MedlineCitation/PMID

If yes, could you upload your tmp.xml and xtract.Darwin somewhere for check?

ADD REPLYlink modified 16 months ago • written 16 months ago by Santosh Anand3.9k

Yes no problem, so this is my xtract.Darwin:

http://www.filehosting.org/file/details/664811/xtract.Darwin

and this is my tmp.xml (for GDS4925, the dataset in above example):

http://www.filehosting.org/file/details/664812/tmp.xml

Thanks for your help. I've also emailed back the help desk to ask them this question again/explain that I still can't get the xtract command to work.

ADD REPLYlink written 16 months ago by Tom20
1

Ok, this is what I was guessing and was afraid of: your tmp.xml file is truncated (means there is some issue with dload of this file from NCBI server). See my tmp.xml for reference https://ufile.io/nfwj8

Firs check if this is consistently happening with other PMID queries and not for just this PMID. If that's the case, you need to better check the connectivity issue. Are you behind some firewall?

PS You may also use some options to debug if this is a connectivity issue:

https://www.ncbi.nlm.nih.gov/books/NBK179288/

For debugging, -silent will suppress link failure retry messages, -verbose will display the <entrez_direct> field values at each step, -debug will print the internal URL query and XML results of each step, and -base will specify a particular server for quality assurance testing.

ADD REPLYlink modified 16 months ago • written 16 months ago by Santosh Anand3.9k

Thank you; I emailed back the help desk to ask them why the file was being truncated (this happened whether I was at home or in work); they replied:

Unfortunately, web displayed citation format is NOT a format supported by API.

There is no good way to regenerate that format through fetched content. JSON format returned by esummary given a pubmed id, contains all the pieces. One need to piece them back together in a human readable form. The following is a thread on this topic:

http://www.alexhadik.com/blog/2014/6/12/create-pubmed-citations-automatically-using-pubmed-api

So unfortunately, I think I may need to just get every reference manually. Thank you so so much for your time.

ADD REPLYlink written 16 months ago by Tom20
1

This whole episode has become a detective movie for me! The main Q I am trying to get answer is that if I am able to do it at my end, why you are not? If the format is not supported by API for you, it must be the same for me. Also XML is standard format for data exchange; don't know why you have to go for JSON? Moreover, it fetching the file partially. If it were a problem of format or interface itself - I would have expected no output all together! Only if I could get answer to those Qs!

ADD REPLYlink written 16 months ago by Santosh Anand3.9k

I know, it seems like this would be relatively straightforward, it's very frustrating. Unfortunately, I am not an expert in this particular area, so I don't feel like I know enough (yet) to try and find a workaround or understand exactly what's happening.

ADD REPLYlink written 16 months ago by Tom20

In the link, go directly to the section "Extraction Arguments"

ADD REPLYlink written 16 months ago by Santosh Anand3.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1708 users visited in the last hour