Forum: Large Tables In Supplementary Pdf In Journal Articles
6
gravatar for dario.garvan
21 months ago by
dario.garvan230
Australia
dario.garvan230 wrote:

What are you opinions on this ? When a table is embedded in a PDF, it can't be correctly pasted into a spreadsheet program, especially if it spans multiple pages. This makes computational operations on the table impossible. It would be easy for the journals to have rules for providing tables as CSV or XLS.

ADD COMMENTlink modified 21 months ago by Maximilian Haeussler920 • written 21 months ago by dario.garvan230

moving to "Forum"

ADD REPLYlink written 21 months ago by Pierre Lindenbaum69k
5
gravatar for Michael Dondrup
21 months ago by
Bergen, Norway
Michael Dondrup31k wrote:

I agree that this is annoying and obfuscating, whether intentional or not I do not know. I do not fully understand, how someone who has even the most minimal skills in computing could have such an amazingly stupid idea and store computational data in PDF, pdf is for documents to look identical on any screen and printer not to store data. Was it the journals allowing only PDF or the authors, while assuming scientists are at least half-way intelligent people (or possibly not)?

Anyway it is possible to break the obfuscation, sometimes at least. Recently I wanted to extract data from this supplementary: http://www.nature.com/nbt/journal/v23/n8/extref/nbt1118-S4.pdf

To get the text I used pdf2txt.py with pdf2txt.py -o data/nbt.txt -W 10000 -M 1000 -L 1000 nbt1118-S4.pdf, experimented a bit with the options. The output still looks weird, because the column headers are given as one letter per row, the table is fragmented, contains special characters, etc.

Using the following perl-script, I was able to retrieve a clean tab-separated table, I wasn't able to retrieve the correct meta-data, given by an "X" in certain columns though. I think it is possible, one has to count the white-spaces, but I didn't have time to test it.

#!/usr/bin/env perl
use strict;
use warnings;

while (<>) {
# that's a line
#16846 AAGUAUAAAAGUUUAGUGUtc X X                     0.761
  chomp;
  my @ar = m/\s*(\d+)\s+(\w+)[X ]+\s*(\d\.\d+)/;
  if (@ar) {
    print join ("\t", @ar), "\n";
  } 
}
ADD COMMENTlink modified 21 months ago • written 21 months ago by Michael Dondrup31k

Please note that this is not a suggestion for a solution, quite the opposite, this is totally messy and is meant to show the absurdity of such attempts. There is further no guarantee in any of the mentioned approaches that the data that is extracted is completely correct. What I mean to say is that you can probably put a nail into the wall using a forceps instead of a hammer, but that you can doesn't mean you should.

ADD REPLYlink modified 21 months ago • written 21 months ago by Michael Dondrup31k
4
gravatar for sarahhunter
21 months ago by
sarahhunter510
Cambridge, UK
sarahhunter510 wrote:

You might find some of the work done by Manchester University on extracting information from PDFs interesting, particularly Utopia documents: see "Calling International Rescue: knowledge lost in literature and data landslide" by Attwood et al.

ADD COMMENTlink written 21 months ago by sarahhunter510
4
gravatar for Ben
21 months ago by
Ben1.8k
Edinburgh, UK
Ben1.8k wrote:

There are some online tools that can do this pretty well, e.g. this seems to correctly convert everything from the pdf example in Michael's post, and Zamzar seems to get everything except the column headers. They split the pdf pages across sheets in excel but I guess you could then write to csv and concatenate; not an ideal solution but good enough for the odd occasion you have to do this sort of thing.

I expect authors do this due to author guidelines or submission systems that ask for supplementary data as pdf, they're not using the data themselves in pdf form and inevitably realise its limitations.

ADD COMMENTlink written 21 months ago by Ben1.8k

The conversions are in fact really good, but there are downsides: Nitro cloud allows only 5 free conversions without registration, with account it is limited to 5 per month (or pay, or have some spare email addresses ;). As you say, all tables on different pages are rendered in their own sheet. To export them manually, this makes 49 save as CSV clicks. I think, a VBA script to join all sheets into a single one first, should be a bit more comfortable (and easy too if you know VBA). Maybe like this?

ADD REPLYlink modified 21 months ago • written 21 months ago by Michael Dondrup31k

I tried the Macro and it worked with the Zamzar converted xls, after...

  • saving as "Macro enabled excel workbook" to allow for macros
  • removing additional text rows (header and footnotes) on sheet 1 and 49
  • sheet 48 had an extra empty column B from the conversion, removing empty column B shifting all columns right, no left!

after that, I have single sheet with 2384 rows + header.

I think we have made a great step ahead towards reproducible science </irony>

ADD REPLYlink modified 21 months ago • written 21 months ago by Michael Dondrup31k
2
gravatar for Neilfws
21 months ago by
Neilfws44k
Sydney, Australia
Neilfws44k wrote:

There are solutions, as outlined in some of the answers but do not expect them to work well (if at all) for all cases.

What you need to understand is that historically, the role of journals has not been to provide data in a usable form. The PDF is simply a hangover from the days of print. Journals assume that all most readers want to do is print out material to take away and read. The idea that people might want to mine published material is rather strange to many publishers and indeed some would actively seek to prevent it.

I can only suggest we all lobby journals to adopt more modern policies regarding provision of data.

ADD COMMENTlink written 21 months ago by Neilfws44k

I think you are correct with your assumption about historical reasons, but in addition there seems to be lack of consciousness for reproducible science. In fact the fragility and complexity of all the demonstrated methods ridicules the journal's approach for handling supplementary data and provides the best argument for why one does not want to do it that way, we even found the need for using OCR software to read sequences from image files.

ADD REPLYlink written 21 months ago by Michael Dondrup31k

I agree this appears to be a leftover from the premodern days of Charles Dickens, but I don't think it's purposefully obfuscatory (although it can have this effect when you can't...get...at...the...data.) Most manuscript submission websites seem to convert your stuff to .pdf, and there it stays. And there appears to be a disconnect between the editor/associate editor (who would understand our issues) and the guys who come in later to do the proofs (who are not scientists). This whole concept of "supplementary data" has grown silly anyway -- that's usually where the main work is that you need access to. A larger change in publishing format is clearly needed.

ADD REPLYlink written 21 months ago by Alex Paciorkowski3.1k
1
gravatar for Maximilian Haeussler
21 months ago by
UCSC
Maximilian Haeussler920 wrote:

PDF is format for graphics or maybe English text. People should not use it for tables or any data that they want to be reused.

For the special case of tables in PDFs, this problem is annoying and common enough (government data) that someone wrote a special converter, just for tables in PDFs: https://github.com/jazzido/tabula. I have never tried it but it includes special code to identify rows, remove headers and pagebreaks, so it really should work better.

Mary: If authors report motifs as graphics in a PDF, then the only motivation I can see is that they don't want their data be used or they forgot to provide it. You should email Matthieu Blanchette and ask for the raw data, which he definitely has. He is most likely aware of the problem. (If he doesn't reply: one of my colleagues works for him.)

As a general-purpose solution, I got very good results with an OCR software like Omnipage or Abbyy. It often produces good XML or at least HTML from PDFs that for some reason fail with pdftotext. You can give the java-based pdfbox a try or python's pyPdf or pdfMiner cited above, there is not a lot of difference between these tools in my hands.

If you want to write something yourself, which I don't recommend, you need one of these pdf-extract libraries. They give you access to each individual character on all pages and allow you to find out fontsize, fonttype, position, etc. Cermine is supposedly a good tool for this, but I haven't tried it, see http://sciencesoft.web.cern.ch/node/120

For anyone working in text mining, PDFs are a time-consuming obstacle but the de facto standard for scientific text. Tools like Papers or Google Scholar's parsers have to use various rules to find out the author names, titles and abstract from a PDF. They go for the biggest font on the first page (title), non-English text underneath (authors), and maybe a single paragraph of intended or bold text then (abstract). Another technique is to look for a DOI that is easily recognizable with a regular expression and then lookup the data in CrossRef.

ADD COMMENTlink written 21 months ago by Maximilian Haeussler920

there is a pdf metadata standard described here: How does Mekentosj Papers work?

ADD REPLYlink modified 21 months ago • written 21 months ago by Jeremy Leipzig14k

Hi Max! I doubt they are trying to be hard to work with, and I'd definitely contact them if I really need these later. But I just thought that paper which I just happened to be reading was a great test of some of these strategies. Maybe there's something I don't understand about this format though. Have a look at the supplement and tell me if there's something I'm missing about it.

ADD REPLYlink written 21 months ago by Mary10.0k
0
gravatar for Mary
21 months ago by
Mary10.0k
Boston MA area
Mary10.0k wrote:

Arrgggh...I had my first opportunity to try some of these converter tools out. I was hoping to get out those motifs in supplementary table 7. There's a lot of 'em. But it looks like this supplement is made of images. I tried Zamzar and got 64 tabs of nothing.

http://www.nature.com/ng/journal/vaop/ncurrent/full/ng.2684.html

I loved the paper anyway. But that was a bear.

ADD COMMENTlink written 21 months ago by Mary10.0k

Maybe some OCR software could.... but no, this is just absurd.

ADD REPLYlink written 21 months ago by Michael Dondrup31k
0
gravatar for Jeremy Leipzig
21 months ago by
Philadelphia, PA
Jeremy Leipzig14k wrote:

I think this is a problem that existing tools can start to tackle. Adobe largely gave up on Flash and I think it's in their best interest to work on making PDFs more open, as well.

http://tv.adobe.com/watch/accessibility-adobe/acrobat-tagging-pdf-content-as-a-table/

ADD COMMENTlink written 21 months ago by Jeremy Leipzig14k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 731 users visited in the last hour