Cleaning Text Data

For any sort of textual analysis project, DHers need to spend LOTS of time cleaning the data first before they can analyze it or create meaningful visualizations. Today we are going to step through the process of taking a data table of art history data and cleaning it up, so that next time we can turn it into a nice looking visualization as an interactive timeline.

There are many many MANY ways to do this kind of data manipulation from code based solutions (most commonly in DH and data science circles using the Python or R languages), to simply using find and replace in spreadsheets like MS Excel or Google Sheets.

We are going to use a fantastic tool called OpenRefine to clean our data, which we will then import to a Google Sheet to create a timeline in the KnightLab’s beautiful TimelineJS framework.

OpenRefine is like a spreadsheet program but with extra special features that let you perform batch operations to clean data and turn messy OCR’d or scrape text into nice regular tabular data.

Here are the data manipulation tasks we will take with OpenRefine

Clean up spelling inconsistencies
Remove extra white space
Format dates and numbers using regular expressions
Extract and combine values

Note: Extracting Text from PDF

We are starting with table data here, but many projects will first need to pull data from PDF files using Optical Character Recognition. There are many ways to get OCR’d text out of a pdf, from APIs to python utilities to copy/paste. If you are on a mac, one of the easiest is to set up an automator script to generate output from any PDF you drop on top of the application.

Use Automator to extract text from PDFs

1. Install OpenRefine

Head to www.openrefine.org/download and download OpenRefine. It’s available for Windows, Mac and Linux

NOTE: If you’re on a Mac and, when you try to open OpenRefine, you get a message saying that you can’t open software from an unidentified developer, do the following: right-click the program and choose Open, then click Open again.

2. Download our sample data

We are going to be using a sample dataset developed for an “Introduction to OpenRefine,” tutorial developed by Owen Stephens on behalf of the British Library. We will use the same data, but do slightly different things with it today.

Download the data from github here: BL-Flickr-Images-Book.csv
Save it to your hard drive to import into OpenRefine

3. Clean data in Open Refine

We will walk through the steps together but this is what we are going to do, along with some text you’ll need to copy and paste to get going

Import the data and create a project
Create a text facet on Place of Publication field
Cluster the Place of Publication field to clean up inconsistencies
Use Edit Cells > Common Transforms to
- Convert all values in Publisher field to Title Case
- Remove leading and trailing white space
Clean up the Date of Publication column so that we can create a timeline. We will do this using custom text transforms and regular expressions (see below).
1. Create a facet on Date of Publication
2. Clean up non-date characters with the following regex expression in the Edit Cells > Transform box
  - Remove all secondary dates that follow first
    - value.replace(/\s[.*\]/,””)
  - Remove [].? characters. First, sort column alphabetically so our preview shows same as facet
    - value.replace(/[?\.\[\]]/,””) Note we have to escape .[] because they have special meaning in regex
  - Capture groups. Group 1: any non-digit characters at the beginning, Group 2: the first 4 digit date, Group 3: any additional characters after that date
    - value.replace(/(\D)(\d{4})(.$)/,”$2″)
  - Add column “Date 2” from column “Publisher”:
    - value.match(/.(\d{4})./)[0]
  - Merge columns. Select all blank Date of Publication cells and transform
    - cells[“Date 2”].value
  - Remove “Date 2” column
  - Export csv

Regular Expressions

Regular Expressions are a fantastically useful tool to match patterns in text to do advanced find, replace, and extract operations. They are implemented slightly differently in many different applications, but here are some general resources to get you started.

RegExr is a great browser based tool to learn and try out regular expressions, much like paste bin sites like jsBin or CodePen that we’ve used for HTML, CSS and JavaScript
Understanding Regular Expressions lesson from the Programming Historian
- A great introduction to the power of regex using advanced find and replace in LibreOffice
Cleaning OCR’d text with Regular Expressions lesson from the Programming Historian
- A next level up in regex use, writing a python script to parse messy OCR’d data
Regular Expressions cheat sheet from the TextMate text editor
- A handy go to guide for the most commonly used regular expressions

OpenRefine Resources

There are a lot of great resources for getting started out there, but here are a few to get you started.

Miriam Posner’s introduction to OpenRefine
- A basic introduction to the tool, that will give you the key features for working from messy data that is already in a table
Cleaning data with OpenRefine lesson from the Programming Historian
- A more in-depth exploration including using the GREL language to do more advanced programmatic cleaning
Fetch and Parse data with OpenRefine lesson from the Programming Historian
- A power application that uses GREL to fetch URLs and parse web content into csv files

More OpenRefine resources from Miriam Posner

Hacking the Humanities 2019

Note: Extracting Text from PDF

1. Install OpenRefine

2. Download our sample data

3. Clean data in Open Refine

Regular Expressions

OpenRefine Resources

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Authors

License

Sources

Meta