Forum: Large Tables In Supplementary Pdf In Journal Articles
6
gravatar for dario.garvan
9 months ago by
dario.garvan160
dario.garvan160 wrote:

What are you opinions on this ? When a table is embedded in a PDF, it can't be correctly pasted into a spreadsheet program, especially if it spans multiple pages. This makes computational operations on the table impossible. It would be easy for the journals to have rules for providing tables as CSV or XLS.

ADD COMMENTlink modified 9 months ago by Maximilian Haeussler790 • written 9 months ago by dario.garvan160

moving to "Forum"

ADD REPLYlink written 9 months ago by Pierre Lindenbaum58k
5
gravatar for Michael Dondrup
9 months ago by
Bergen
Michael Dondrup27k wrote:

I agree that this is annoying and obfuscating, whether intentional or not I do not know. I do not fully understand, how someone who has even the most minimal skills in computing could have such an amazingly stupid idea and store computational data in PDF, pdf is for documents to look identical on any screen and printer not to store data. Was it the journals allowing only PDF or the authors, while assuming scientists are at least half-way intelligent people (or possibly not)?

Anyway it is possible to break the obfuscation, sometimes at least. Recently I wanted to extract data from this supplementary: http://www.nature.com/nbt/journal/v23/n8/extref/nbt1118-S4.pdf

To get the text I used pdf2txt.py with pdf2txt.py -o data/nbt.txt -W 10000 -M 1000 -L 1000 nbt1118-S4.pdf, experimented a bit with the options. The output still looks weird, because the column headers are given as one letter per row, the table is fragmented, contains special characters, etc.

Using the following perl-script, I was able to retrieve a clean tab-separated table, I wasn't able to retrieve the correct meta-data, given by an "X" in certain columns though. I think it is possible, one has to count the white-spaces, but I didn't have time to test it.

#!/usr/bin/env perl
use strict;
use warnings;

while (<>) {
# that's a line
#16846 AAGUAUAAAAGUUUAGUGUtc X X                     0.761
  chomp;
  my @ar = m/\s*(\d+)\s+(\w+)[X ]+\s*(\d\.\d+)/;
  if (@ar) {
    print join ("\t", @ar), "\n";
  } 
}
ADD COMMENTlink modified 9 months ago • written 9 months ago by Michael Dondrup27k

Please note that this is not a suggestion for a solution, quite the opposite, this is totally messy and is meant to show the absurdity of such attempts. There is further no guarantee in any of the mentioned approaches that the data that is extracted is completely correct. What I mean to say is that you can probably put a nail into the wall using a forceps instead of a hammer, but that you can doesn't mean you should.

ADD REPLYlink modified 9 months ago • written 9 months ago by Michael Dondrup27k
4
gravatar for sarahhunter
9 months ago by
sarahhunter410
Cambridge, UK
sarahhunter410 wrote:

You might find some of the work done by Manchester University on extracting information from PDFs interesting, particularly Utopia documents: see "Calling International Rescue: knowledge lost in literature and data landslide" by Attwood et al.

ADD COMMENTlink written 9 months ago by sarahhunter410
4
gravatar for Ben
9 months ago by
Ben1.8k
Edinburgh, UK
Ben1.8k wrote:

There are some online tools that can do this pretty well, e.g. this seems to correctly convert everything from the pdf example in Michael's post, and Zamzar seems to get everything except the column headers. They split the pdf pages across sheets in excel but I guess you could then write to csv and concatenate; not an ideal solution but good enough for the odd occasion you have to do this sort of thing.

I expect authors do this due to author guidelines or submission systems that ask for supplementary data as pdf, they're not using the data themselves in pdf form and inevitably realise its limitations.

ADD COMMENTlink written 9 months ago by Ben1.8k

The conversions are in fact really good, but there are downsides: Nitro cloud allows only 5 free conversions without registration, with account it is limited to 5 per month (or pay, or have some spare email addresses ;). As you say, all tables on different pages are rendered in their own sheet. To export them manually, this makes 49 save as CSV clicks. I think, a VBA script to join all sheets into a single one first, should be a bit more comfortable (and easy too if you know VBA). Maybe like this?

ADD REPLYlink modified 9 months ago • written 9 months ago by Michael Dondrup27k

I tried the Macro and it worked with the Zamzar converted xls, after...

  • saving as "Macro enabled excel workbook" to allow for macros
  • removing additional text rows (header and footnotes) on sheet 1 and 49
  • sheet 48 had an extra empty column B from the conversion, removing empty column B shifting all columns right, no left!

after that, I have single sheet with 2384 rows + header.

I think we have made a great step ahead towards reproducible science </irony>

ADD REPLYlink modified 9 months ago • written 9 months ago by Michael Dondrup27k
2
gravatar for Neilfws
9 months ago by
Neilfws41k
Sydney, Australia
Neilfws41k wrote:

There are solutions, as outlined in some of the answers but do not expect them to work well (if at all) for all cases.

What you need to understand is that historically, the role of journals has not been to provide data in a usable form. The PDF is simply a hangover from the days of print. Journals assume that all most readers want to do is print out material to take away and read. The idea that people might want to mine published material is rather strange to many publishers and indeed some would actively seek to prevent it.

I can only suggest we all lobby journals to adopt more modern policies regarding provision of data.

ADD COMMENTlink written 9 months ago by Neilfws41k

I think you are correct with your assumption about historical reasons, but in addition there seems to be lack of consciousness for reproducible science. In fact the fragility and complexity of all the demonstrated methods ridicules the journal's approach for handling supplementary data and provides the best argument for why one does not want to do it that way, we even found the need for using OCR software to read sequences from image files.

ADD REPLYlink written 9 months ago by Michael Dondrup27k

I agree this appears to be a leftover from the premodern days of Charles Dickens, but I don't think it's purposefully obfuscatory (although it can have this effect when you can't...get...at...the...data.) Most manuscript submission websites seem to convert your stuff to .pdf, and there it stays. And there appears to be a disconnect between the editor/associate editor (who would understand our issues) and the guys who come in later to do the proofs (who are not scientists). This whole concept of "supplementary data" has grown silly anyway -- that's usually where the main work is that you need access to. A larger change in publishing format is clearly needed.

ADD REPLYlink written 9 months ago by Alex Paciorkowski3.0k
1
gravatar for Maximilian Haeussler
9 months ago by
UCSC
Maximilian Haeussler790 wrote:

PDF is format for graphics or maybe English text. People should not use it for tables or any data that they want to be reused.

For the special case of tables in PDFs, this problem is annoying and common enough (government data) that someone wrote a special converter, just for tables in PDFs: https://github.com/jazzido/tabula. I have never tried it but it includes special code to identify rows, remove headers and pagebreaks, so it really should work better.

Mary: If authors report motifs as graphics in a PDF, then the only motivation I can see is that they don't want their data be used or they forgot to provide it. You should email Matthieu Blanchette and ask for the raw data, which he definitely has. He is most likely aware of the problem. (If he doesn't reply: one of my colleagues works for him.)

As a general-purpose solution, I got very good results with an OCR software like Omnipage or Abbyy. It often produces good XML or at least HTML from PDFs that for some reason fail with pdftotext. You can give the java-based pdfbox a try or python's pyPdf or pdfMiner cited above, there is not a lot of difference between these tools in my hands.

If you want to write something yourself, which I don't recommend, you need one of these pdf-extract libraries. They give you access to each individual character on all pages and allow you to find out fontsize, fonttype, position, etc. Cermine is supposedly a good tool for this, but I haven't tried it, see http://sciencesoft.web.cern.ch/node/120

For anyone working in text mining, PDFs are a time-consuming obstacle but the de facto standard for scientific text. Tools like Papers or Google Scholar's parsers have to use various rules to find out the author names, titles and abstract from a PDF. They go for the biggest font on the first page (title), non-English text underneath (authors), and maybe a single paragraph of intended or bold text then (abstract). Another technique is to look for a DOI that is easily recognizable with a regular expression and then lookup the data in CrossRef.

ADD COMMENTlink written 9 months ago by Maximilian Haeussler790

there is a pdf metadata standard described here: How does Mekentosj Papers work?

ADD REPLYlink modified 9 months ago • written 9 months ago by Jeremy Leipzig12k

Hi Max! I doubt they are trying to be hard to work with, and I'd definitely contact them if I really need these later. But I just thought that paper which I just happened to be reading was a great test of some of these strategies. Maybe there's something I don't understand about this format though. Have a look at the supplement and tell me if there's something I'm missing about it.

ADD REPLYlink written 9 months ago by Mary9.2k
0
gravatar for Mary
9 months ago by
Mary9.2k
Boston MA area
Mary9.2k wrote:

Arrgggh...I had my first opportunity to try some of these converter tools out. I was hoping to get out those motifs in supplementary table 7. There's a lot of 'em. But it looks like this supplement is made of images. I tried Zamzar and got 64 tabs of nothing.

http://www.nature.com/ng/journal/vaop/ncurrent/full/ng.2684.html

I loved the paper anyway. But that was a bear.

ADD COMMENTlink written 9 months ago by Mary9.2k

Maybe some OCR software could.... but no, this is just absurd.

ADD REPLYlink written 9 months ago by Michael Dondrup27k
0
gravatar for Jeremy Leipzig
9 months ago by
Philadelphia, PA
Jeremy Leipzig12k wrote:

I think this is a problem that existing tools can start to tackle. Adobe largely gave up on Flash and I think it's in their best interest to work on making PDFs more open, as well.

http://tv.adobe.com/watch/accessibility-adobe/acrobat-tagging-pdf-content-as-a-table/

ADD COMMENTlink written 9 months ago by Jeremy Leipzig12k
Please log in to add an answer.

Help
Access
  • RSS
  • Stats
  • API

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.0.0
Traffic: 400 users visited in the last hour