Cleaning Data

Extracting Text from PDF There are many ways to get OCR’d text out of a pdf, from APIs to python utilities to copy/paste. If you are on a mac, one of the easiest is to set up an automator script to generate output from any PDF you drop on top

Continue reading

Text Analysis 101: XML, TEI, and VoyantTools

Over the past few weeks we have discussed and seen how the modern dynamic web—and the digital humanities projects it hosts—comprise structured data (usually residing in a relational database) that is served to the browser based on a user request where it is rendered in HTML markup.  This week we are exploring how

Continue reading