Cleaning Data

Extracting Text from PDF

There are many ways to get OCR’d text out of a pdf, from APIs to python utilities to copy/paste. If you are on a mac, one of the easiest is to set up an automator script to generate output from any PDF you drop on top of the application.

Use Automator to extract text from PDFs

Regular Expressions

Understanding Regular Expressions lesson from the Programming Historian
- A great introduction to the power of regex using advanced find and replace in LibreOffice
Cleaning OCR’d text with Regular Expressions lesson from the Programming Historian
- A next level up in regex use, writing a python script to parse messy OCR’d data
Regular Expressions cheat sheet from the TextMate text editor
- A handy go to guide for the most commonly used regular expressions

OpenRefine

OpenRefine is like a spreadsheet program but with extra special features that let you perform batch operations to clean data and turn messy OCR’d or scrape text into nice regular tabular data. There are a lot of great resources for getting started out there, but here are a few to get you started.

Miriam Posner’s introduction to OpenRefine
- A basic introduction to the tool, that will give you the key features for working from messy data that is already in a table
Cleaning data with OpenRefine lesson from the Programming Historian
- A more in-depth exploration including using the GREL language to do more advanced programmatic cleaning
Fetch and Parse data with OpenRefine lesson from the Programming Historian
- A power application that uses GREL to fetch URLs and parse web content into csv files

More OpenRefine resources from Miriam Posner

Cleaning Data

Extracting Text from PDF

Regular Expressions

OpenRefine

Author: Austin

Leave a Reply Cancel reply