Friday, May 1, 2015

Sharing tools and best practices for budget data analysis

Yes, I'm bad at sharing what I myself am up to. But for some reason when communicating with a person it happens nicely. So, forwarding an email:

Hi,


2 open source tools that were used:

PDFtk4all : for splitting heterogenous pdf's into homogenous sections, so that each file only has one kind of table. http://pdftk4all.sourceforge.net/

Tabula : for refined conversion from PDF to excel. http://tabula.technology

Go down this thread : https://github.com/tabulapdf/tabula/issues/303 to get a dropbox link to a version that handles Indian scripts and does the conversion perfectly without distorting. (they still have to push the update to the main version.) (The conversion happens nicely if the PDF uses legacy fonts, that is. If the data is in unicode, then God help us. Whatever you do, please don't persuade governments to publish large tables in pdf in unicode! If they are switching to unicode, then they should stop making PDFs of the docs and just give a link that we can compare the file with for authenticity.)


Our biggest hurdle is usually the with getting the data in Unicode format.
Here is a link to the converter I used: https://gist.github.com/answerquest/74c13f73f1bfb21c3177

That one was for a particular font that the PMC uses.

For Shree-Dev, which is a more popular font and which the bus authority here uses, here's the converter:

Detail: There's a substitution array of about 200 characters that we usually have to map to their unicode equivalents. The converter scripts basically does a quick find+replace on all the substitution pairs. But in Hindi/Marathi we also have vowel "maatras" and a variety of combos, and that can get a little tricky, hence some scripting is needed. I didn't make the script.. I got it from an online group and adapted it to work with what I had.

An example of letter-substitutions seen across some popular legacy fonts:

-----------------

Also, I've recently been working on using mapping to facilitate budget analysis. Here's one outcome, for browsing through this year's Pune's Participatory Budget:

One significant breakthrough here was learning how to combine geolocation data with a separate excel file having the info we want the map to show on the marked areas. More on how that happened is partially explained here: http://gis.stackexchange.com/questions/143730/leafletjs-combine-location-data-in-geojson-kml-gpx-with-other-information-in-c


And.. did you know google spreadsheets turns our regular graphs into interactive HTML5 ones? We can embed them on website, like so:
http://pugpune.weebly.com/punes-annual-budget.html (note: website is work in progress and this is a temporary address.)


All the visualizations you'll see in the pages here are coming from google spreadsheets, just like excel, making graphs out of table data. No coding needed; I just had to copy-paste the html code.


Then,
There's some quite nice open data sharing platforms out there, example: http://datahub.io/http://data.okfn.org/ etc. I'm going to explore how to plug into them.



No comments:

Related Posts with Thumbnails