i have several paper abstracts and not the PubMed IDs. I wanted to know if there is a way to try to get the PubMed ID for that abstract, even if i had a list of probable IDs, then that might work for us.
any thoughts / code / ideas would be much appreciated.
You could use part of the abstract to search Europe PMC. For example just paste in a sentence: "Simplicity has made C. elegans pharyngeal development a particularly well-studied subject."
If you have them, you could also use first author, journal name, pub year & vol. A help page on Europe PMC shows you the various search fields available with examples: http://europepmc.org/Help. These search fields can also be used programatically, details are available via Europe PMC Web Services. Take the 'Resources' menu option from the home page.
That should get you started with a python solution. Note that PUBMED generally does a very good job at interpreting free text query; but some of your abstracts might still need some pre/post processing. Depending on the accuracy you target, you might need to compare the abstracts corresponding to the IDs matching your query, and decide whether the two are "close enough". The Difflib python module might help you. Note also that you are only allowed a limited number of queries per second. If you have a large study to perform (say 1million record), try to get a mirror of the MEDLINE db and perform the matches locally. Let us know if you need more help for the pre/post processing step.
Before using the script, please fill the Entrez.email variable accordingly. PUBMED admin might need to contact you if you go over fair usage of the db (instead of blocking the IP !)
from Bio import Entrez,Medline
an_abstract = """
Uncovering the relationship between the conserved chromosomal segments and the functional relatedness of elements within these segments is an important question in computational genomics. We build upon the series of works on gene teams and homology teams.
Our primary contribution is a local sliding-window SYNS (SYNtenic teamS) algorithm that refines an existing family structure into orthologous sub-families by analyzing the neighborhoods around the members of a given family with a locally sliding window. The neighborhood analysis is done by computing conserved gene clusters. We evaluate our algorithm on the existing homologous families from the Genolevures database over five genomes of the Hemyascomycete phylum.
The result is an efficient algorithm that works on multiple genomes, considers paralogous copies of genes and is able to uncover orthologous clusters even in distant genomes. Resulting orthologous clusters are comparable to those obtained by manual curation.
Entrez.email = "firstname.lastname@example.org"
search_results = Entrez.read(Entrez.esearch(db="pubmed",term=query))
print "http://www.ncbi.nlm.nih.gov/pubmed/%s"%(search_results['IdList']) # http://www.ncbi.nlm.nih.gov/pubmed/22151970