Friday, March 18, 2011

Converting PDF Tables

If you're someone that uses PDF's on a frequent basis, there's one thing that probably drives you the most crazy, and that's tabular data in PDF's generally cannot be copy and pasted into an Excel spreadsheet.  This is because a PDF is designed for visualizing and printing a document, a digital version of a paper document.  Thus PDF's are not designed to be used for analytical purposes like a Microsoft Excel spreadsheet where you can sum up rows, convert tabular data into a pivot table, etc.

To be fair to Adobe, PDF's are capability of storing the structural metadata a table.  Generally, when you use PDF printer software, even Adobe PDF Writer or Adobe Distiller, the software converts a table into lines with coordinates and positions data within the bounds of the lines that represent a cell of the table instead of retaining the table's metadata of table, row header, column header, row, and column.

The solution is to convert the PDF document to a format where tables are represented as tables with structural information so that software such as Microsoft Excel can recognize the table and import it into the program.

Before we get into software, please note, I've personally used a few of the PDF converter software products out there, but I'll post links to software vendors that I personally know as being reputable because some software vendor's websites can look a bit shady.  Again, I've used some of them myself and I'm not saying that they're not trustworthy, but please use good judgement when downloading software from the internet.  If you're not sure check for the software on a reputable freeware/nagware/shareware/trial software library such as CNET Download.

PDF to Text
The seemingly easiest solution is to convert the PDF to a text file with the data positioned correctly so that it looks like a table, but this isn't as easy as it seems.  PDF's do not store textual data the you visualize it.  Changes to the PDF are appended to the document, so when you extract the text from the PDF, you may find you textual data interlaced and positioned incorrectly.  Remember, PDF stores positional data for all elements of the PDF document so that when it is rendered to the screen or to a printer, everything is positioned correctly, but the data is not necessarily in order when extracted from the document.  The textual data must be read and reassembled.  Generally the free PDF to Text converters have the problem where PDF's with updated text is appended to the document generates a text file with the text out of order.

Here's a Google search to get you started:  convert pdf to text.

PDF to HTML
PDF to HTML converters suffer from a far different problem than PDF to Text converters because HTML represent structured and unstructured data.  Most PDF to HTML converts take the easy way out and instead of converting tabular data to <table>...</table> tags, it generates a HTML file with what looks to be a <div> tag explosion.  These <div> tags have absolute positioning so essentially, what you have is a direct PDF to HTML conversion which doesn't really help your cause any because there's still no structural information, but at the very least, it "looks" right.

Here's a Google search to get you started:  convert pdf to html.

PDF to Excel
This is probably the best solution for what you're trying to accomplish, convert visual tables into an actual table of data.  However, it's the hardest to acheive because most of the software products available are not free and tend to be trial versions.  Thre are 2 types of PDF to Excel converters and most of the software products to convert PDF to Excel falls into the first category.

1.  Manually map the PDF tables for extraction.
This method allows you to open a PDF file and draw a table over the tabular data.  The software product then uses the metadata created by you to extract the tabular data into an Microsoft Excel spreadsheet.

Here's a Google search to get you started:  convert pdf to excel.

2.  Automatically convert the PDF to a spreadsheet
This is my preferred version because it's automatic and the entire document gets converted into an Excel spreadsheet.  You can get wierd results if you've got a combination of both text and tabular data in the same document, but it works out ok.

Here I can vouch for a software product, Nuance PDF Converter can automatically convert a PDF into Word or Excel or other file types.  It generally works well and a trial version can be downloaded here from CNET Download.

No comments:

Post a Comment