I've recently been looking into the text corpora of the BioCreative I Challenge for a project using gene name normalization text mining methods, and found that (unfortunately) the training and testing data consist of abstracts that labelled by internal BioCreative IDs (e.g. fly00001training.txt, fly00001testing.txt) instead of standard PMIDs.
This problem raises the more general question of how to find the best matching PMID in PubMed given a blob of text like the following:
Dorsal-ventral patterning within the ectodermal and mesodermal germ layers of Drosophila and Xenopus embryos is specified by a system of genes that has been conserved over 500 million years of evolution. In both organisms, the activity of the TGF-beta family member DPP/BMP4 is antagonized by SOG/CHORDIN. A second Xenopus gene, noggin, has a similar biological activity to chordin. Analysis of the action of these genes indicate that Spemann's organizer promotes dorsal cell fates in Xenopus by antagonizing a ventralizing signal encoded by the Bmp4 gene.
Entering this into the PubMed search interface pulls up only one PMID (8791529), which is an exact match to the text, and clearly the correct answer. But I've had no luck with using standard eutils queries or the JANE API to do this because they are choking on common "stop" words in different ways.
A solution that uses a remote web service would be preferred since there are only a few hundred BC I abstracts to map to PMIDS.
Many thanks, Casey