In the context of the Open Data movement, we are currently witnessing how organisations (whether public administrations or private corporations) are increasingly releasing data to the public domain. The intention behind this can be of becoming more transparent or to encourage developers to build useful applications on top of the published data.

Bildschirmfoto 2014-05-08 um 13.49.48For the sake of its re-use, this information should be optimally stored in a well-structured and machine-readable file, formatted as XML, CSV or EXCEL. However, this is not always the case and although such organisations are willing to share the data, the format is not properly chosen what, in some cases, makes the information even useless. It is the case of PDF files. PDF is a format originally thought to contain data meant to be printed. That is the reason why this kind of files support paging, paper-like sizing or can contain indexes, but in any case achieves the goal of storing large amounts of structured data as we expect from Open Data.

Activists, journalists or researchers willing to analyse big amounts of information published in PDF files often have to give up on their intention due to the effort associated to extracting all the numbers out of the files. That is why we want to introduce you Tabula, a tool that helps extracting the information contained in tables inside PDF files.

68747470733a2f2f662e636c6f75642e6769746875622e636f6d2f6173736574732f35333132392f3238373935372f36626566656564652d393236352d313165322d396538352d6165386631393337646562332e706e67Developed by Manuel Aristarán with the help of other fellows working on data journalism, Tabula can be installed on every computer (Windows, Mac or Linux) and, as if it was magic, extracts the information from tables present in PDF files, exporting it directly in a nice CSV formatted file. The interface makes the tool really easy to use, allowing the user to “draw” a box to select the relevant information. This saves up lots of valuable time.

Although, it is important to warn that only text-based PDFs are supported by now and not scanned documents, which are in their internal structure significantly different. This is a feature that would make the tool super powerful and is placed on the top of the improvements wish-list. Did we mentioned that Tabula is Open Source? That means that you can contribute improving it if you are a developer (OCR gurus more than welcomed!), contribute with some improvement ideas or give your feedback as user.

