Week 6: Text Analysis + XML – Hacking the Humanities

February 12, 2018 Austin

Cleaning Data

Extracting Text from PDF There are many ways to get OCR’d text out of a pdf, from APIs to python utilities to copy/paste. If you are on a mac, one of the easiest is to set up an automator script to generate output from any PDF you drop on top

February 7, 2018 Austin

Text Analysis 101: XML, TEI, and VoyantTools

Over the past few weeks we have discussed and seen how the modern dynamic web—and the digital humanities projects it hosts—comprise structured data (usually residing in a relational database) that is served to the browser based on a user request where it is rendered in HTML markup. This week we are exploring how