How To Find Best Matching Pmid Given A Blob Of Text
2
7
Entering edit mode
10.3 years ago

I've recently been looking into the text corpora of the BioCreative I Challenge for a project using gene name normalization text mining methods, and found that (unfortunately) the training and testing data consist of abstracts that labelled by internal BioCreative IDs (e.g. fly00001training.txt, fly00001testing.txt) instead of standard PMIDs.

This problem raises the more general question of how to find the best matching PMID in PubMed given a blob of text like the following:

Dorsal-ventral patterning within the ectodermal and mesodermal germ layers of Drosophila and Xenopus embryos is specified by a system of genes that has been conserved over 500 million years of evolution. In both organisms, the activity of the TGF-beta family member DPP/BMP4 is antagonized by SOG/CHORDIN. A second Xenopus gene, noggin, has a similar biological activity to chordin. Analysis of the action of these genes indicate that Spemann's organizer promotes dorsal cell fates in Xenopus by antagonizing a ventralizing signal encoded by the Bmp4 gene.

Entering this into the PubMed search interface pulls up only one PMID (8791529), which is an exact match to the text, and clearly the correct answer. But I've had no luck with using standard eutils queries or the JANE API to do this because they are choking on common "stop" words in different ways.

A solution that uses a remote web service would be preferred since there are only a few hundred BC I abstracts to map to PMIDS.

Many thanks, Casey

text pubmed • 2.0k views
0
Entering edit mode

For those interested in this particular problem about, @Nathan Harmston has kindly provided a look-up table between BC I and PMIDs here.

4
Entering edit mode
10.3 years ago

But I've had no luck with using standard eutils queries

Casey, are you sure about NCBI-eUtils ? I got only one result too with eSearch (PMID:8791529 )...

curl -L http://goo.gl/4npgC

<?xml version="1.0"?>
<!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD eSearchResult, 11 May 2002//EN" "htt
p://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSearch_020511.dtd">
<eSearchResult>
<Count>1</Count>
<RetMax>1</RetMax>
<RetStart>0</RetStart>
<IdList>
<Id>8791529</Id>
</IdList>
(...)

0
Entering edit mode

OK, this looks very promising. The difference between your query and mine was to include the blob in "double quotes". I've tried with a couple other longer abstracts and I'm getting a "Bad Gateway!" error, which goes away when I truncate the abstract text. The truncated text gets the correct PMID, so it may be matter of finding the upper limit on the length of string that can be passed to eutils and wrapping in double quotes. Many thanks!

0
Entering edit mode

Depending on which type of query you are using, this could be a URL encoding issue instead of a character limit as well.

1
Entering edit mode
10.2 years ago
Yogesh Pandit ▴ 500

BioPython supports this

from Bio import Entrez

handle = Entrez.esearch(db="pubmed", retmax=10, term="Dorsal-ventral patterning within the ectodermal and mesodermal germ layers of Drosophila and Xenopus embryos is specified by a system of genes that has been conserved over 500 million years of evolution. In both organisms, the activity of the TGF-beta family member DPP/BMP4 is antagonized by SOG/CHORDIN. A second Xenopus gene, noggin, has a similar biological activity to chordin. Analysis of the action of these genes indicate that Spemann's organizer promotes dorsal cell fates in Xenopus by antagonizing a ventralizing signal encoded by the Bmp4 gene.")
print record["Count"]
print record["IdList"]


The output u get is

1
['8791529']