Cleaning Data

Extracting Text from PDF

There are many ways to get OCR’d text out of a pdf, from APIs to python utilities to copy/paste. If you are on a mac, one of the easiest is to set up an automator script to generate output from any PDF you drop on top of the application.

Use Automator to extract text from PDFs

 


Regular Expressions

 


OpenRefine

OpenRefine is like a spreadsheet program but with extra special features that let you perform batch operations to clean data and turn messy OCR’d or scrape text into nice regular tabular data.  There are a lot of great resources for getting started out there, but here are a few to get you started.

 

More OpenRefine resources from Miriam Posner

 

 

 

Author: Austin

Leave a Reply

Your email address will not be published. Required fields are marked *