I would like to use a tool that will take any valid pubmed query, extract the abstracts, and create a wordle diagram. Ideally it would be a one-click operation that automatically posts to the wordle site, but simply outputting text that can be copied and pasted would be okay too.
Does such a tool exist? If not, what would be your strategy for implementing it? If such a tool does not exist, I will offer a bounty of 150 points for anyone who implements it (awarded to the best working solution if multiple are offered). Ideally the solution would be hosted on Google App Engine, but code in a public code repository would be acceptable.
EDIT: Sorry, got impatient and implemented it myself at http://pubmed2wordle.appspot.com/. Winning answer to Lars for the outline and especially the wordle advanced link. Also, I put the code on Google code, in case anyone else wants to work on any of the potential enhancements.
If you can live with the Wordle being based on only the first, for example, 200 abstract returned from the PubMed query, it should not be so difficult to do. What I would do is the following:
Use the NCBI eutils ESearch method to retrieve a list of PMIDs that match your query.
Use the NCBI eutils EFetch method to retrieve the abstracts for this set of PMIDs
Concatenate the abstracts and use an HTTP POST request to submit it to Wordle using its advanced interface.
This solution could be implemented on Google App Engine without too much trouble.
If you want to make a Wordle cloud that is based on all PubMed abstracts that match your query, it is much harder since you cannot count on being able to retrieve the abstracts via NCBI eutils, and since the total amount of text could be too much to submit to Wordle. What I would do in that case would be:
Again use NCBI eutils ESearch to retrieve the (possibly long) list of PMIDs.
Retrieve the abstracts from a local, indexed copy of Medline.
Calculate all the word frequencies myself in a hash table.
Filter down the set of counts to include only the N most frequent words.
Submit the resulting counts to Wordle using its advanced interface.
This solution would obviously be far more work to implement and would require that you maintain a local mirror of Medline. Due to the amount of data involved, I don't see this solution running on Google App Engine.
It also took me a little while to spot the advanced link. Very nice implementation - you might be able to work around the GAE timeout issues by downloading abstracts in a few chunks. GAE has a timeout of 10 seconds on HTTP requests, so chopping the data transfer into several smaller requests should work.
I'm quite confident. You are right that GAE also times out, but that is one is 30 seconds whereas HTTP requests from within GAE time out after just 10 seconds. So you should be able to handle more abstracts by cutting it into chunks, but only by a factor of 2-3.
Lars, you were indeed right. http://pubmed2wordle.appspot.com/ looks like it handles up to 500 pubs quite reliably now. Probably it could be increased a bit more by tuning the number vs size of requests, but 500 seems pretty good to me. Anyway, upvotes to you all around!
The advanced link at Wordle.net -- hidden in plain sight! And to think I was just digging around in the JS source trying to figure out how I might hack it...
Okay, every once in a while I have to convince myself that I can program . So check it out: http://pubmed2wordle.appspot.com. (Doesn't work for huge queries due to GAE timeout issues, I think. executing on my localhost works fine...)
Hmm, how confident are you of that? I assumed GAE times out on the parent request (from you to GAE), so splitting up the child requests (from GAE to eutils) wouldn't help. But my assumption could be wrong...
As an alternative to the advanced form, you can download a command-line version of the Wordle engine from IBM, and then generate the tag-clouds on your own server.
You can also supply a file of additional stopwords that you don't want to appear in the pictures.
Well, this isn't exactly what you are asking for, but perhaps a step in the right direction?
XplorMed takes a pubmed query and gives you the main associations between the words in groups of abstracts. So, you'd get the words most used and associated across the abstracts in the query. It's a step from there to create a wordle perhaps?
edited to add: though, now that I look at it, I'm not sure that will work right for your request, or maybe take a couple steps? :D
It also took me a little while to spot the advanced link. Very nice implementation - you might be able to work around the GAE timeout issues by downloading abstracts in a few chunks. GAE has a timeout of 10 seconds on HTTP requests, so chopping the data transfer into several smaller requests should work.
I'm quite confident. You are right that GAE also times out, but that is one is 30 seconds whereas HTTP requests from within GAE time out after just 10 seconds. So you should be able to handle more abstracts by cutting it into chunks, but only by a factor of 2-3.
Lars, you were indeed right. http://pubmed2wordle.appspot.com/ looks like it handles up to 500 pubs quite reliably now. Probably it could be increased a bit more by tuning the number vs size of requests, but 500 seems pretty good to me. Anyway, upvotes to you all around!
The advanced link at Wordle.net -- hidden in plain sight! And to think I was just digging around in the JS source trying to figure out how I might hack it...
Okay, every once in a while I have to convince myself that I can program . So check it out: http://pubmed2wordle.appspot.com. (Doesn't work for huge queries due to GAE timeout issues, I think. executing on my localhost works fine...)
Incidentally, I tried to make one of the example queries as a hat tip to Lars, but sadly (or not) he's got too many publications...
Hmm, how confident are you of that? I assumed GAE times out on the parent request (from you to GAE), so splitting up the child requests (from GAE to eutils) wouldn't help. But my assumption could be wrong...
Thank you - I'll go play with your tool myself now :)