Monday, May 18, 2015

DIY PDF to Excel / Spreadsheet conversion

1. If you want to do multiple pages at a go, then you'll first need to extract just those pages from the bigger document. For that, you can use:


2. Then, this site will convert that pdf to excel for you.. it was pretty accurate for me:
If not that, then my second choice is: 
You might have to clean up the converted excel, re-align stuff, etc.


3. In case it's just a few pages, or if you need a specific part of the page only, then better to use Tabula: 
It's a portable software that runs on your computer, works through your browser. You visually draw a box on the page and it converts that to table. Either copy it to clipboard and then paste in excel using import text wizard, or save it as CSV and then load it in excel.


4. Now, the converted file may have weird characters, if your document was using legacy fonts like Shree-Dev to display vernacular language data.
Here's a site having converters to convert such text to Unicode format, which can then be read, copy-pasted, searched:
You'll get HTML files that have scripts in them. Save them locally and run them in your browser. (note: certain browsers might block the script from working unless you activate it.)

For Pune's Budget book, which was using a font similar to but slightly different from Shree-Dev, I made a customized converter, you can copy it from here:
Convert, copy, and in excel, select just the topmost cell of your selection and press ctrl+v. all the cells you originally copied out from should be replaced.

In case this was in regular ShreeDev font, try this:

The text might get garbled up in the pdf-to-excel conversion. Try this mid-development version of Tabula: https://drive.google.com/open?id=0B3gxOiUzXTR-VkRmU1I3MElXYncGo
For more details, see this thread: https://github.com/tabulapdf/tabula/issues/303 . It handles legacy Devnagri scripts and does the conversion perfectly without distorting.
(Not confirmed but might work better with other Indian scripts too. But doesn't work with Unicode well the same way all others don't.. that seems to be an inherent problem in PDF technology.)

If is still happens (happened a lot on my end and later I got Tabula devs to fix it), then you have to copy-paste the text from the pdf. Press and hold the Ctrl, Alt buttons and draw a box (box-select) in the pdf-reader, that should let you select all the text in a column. First paste that in a simple text editor like Notepad.

Where there was more than one line of text in a cell, you'll have to "unwrap" them back to one line. (switch off wrapping by Format > Word Wrap). Where there were blank cells, you'll have to press enter, enter to space out the lines. Once you're sure it's exactly as it was in the PDF, then you can safely paste it back to the excel. Use the adjoining cells to make sure the stuff is well aligned.


5. For the figures, if they have commas like this: 52,50,000 then they'll be stored in the excel as text, not as numbers. To make them as numbers, select all the cells having the numbers, copy, and paste here:
..and press "zap commas". Copy the output and paste it back to the excel, with the first cell of your copied block selected.
(you can look for other such converters too or just make your own.. my file's source code is here: https://gist.github.com/answerquest/273360c7098428f2449a

No comments:

Related Posts with Thumbnails