What tools are useful for text mining of pdf-based literature? For example, suppose I had a list of several genes and several phenotypes, and wanted to look for associations between those genes and phenotypes in literature for which a PDF is available, but HTML of the full text is not. Are there tools to efficiently do this type of search?
I have installed - but never used - Xapers, which can index pdf files and other sources. I don't know if you are looking for a fancy machine-learning kind of stuff, or simple indexing and searching are good enough for your purposes.
There is also pdfgrep, which could be nice for quickly searching a few pdfs.