Update to Carletonian project

Progress

So far, we received a whole lot of PDFs from the digital archivist, one for every page of every issue of the Carletonian, on a thumb drive. We’ve converted all of these into text files. However, when we realized that VoyantTools has a limit on how many files you can upload to a corpus, but no on file size, we combined the text files into one file per year using another Automator workflow. These combined files are stored in a Google Drive folder.

 

We also made progress on narrowing down our plan to analyze our data– we decided that we will search through our data set, using n-gram searches and topic modeling, on a few different topic fronts, each comparing how Carleton focused on certain topics over time. These topic fronts include academic, (Science vs Humanities, math vs physics, english vs history, spanish vs french, etc,) athletic, (basketball vs baseball vs volleyball, etc,) social (featuring men vs women , diversity) or geographic interest (featuring Europe vs Africa vs Asia.) In order to track each topic, we will compile lists of relevant words to search for.

 

Problems and proposed solutions

The first issue we encountered was getting our hands on all the issues of the Carletonian in file form. At first, we thought we might have to download each page individually, but my communicating with the digital archivist we were able to get them all on a thumb drive.

 

Our original plan for extracting text from the Carletonian PDFs was to use a python script using regular expressions combined with a free OCR API. While this solution did work, it was not easy to get a clean text extraction. There were many misspelled words and formatting was an issue. To solve this we used an automator workflow that took multiple inputs and output a test file for each PDF input. This method proved to be much more efficient and produced a more accurate text extraction compared to using the python script and api.

 

We also found there’s a limit to how many files you can upload to a corpus on VoyantTools. So, we used another Automator workflow to combine files for the same year into one file, drastically decreasing the number of files we had but losing no data.

 

Tools and Techniques

 

In order to extract text from each Carletonian PDF we have been using the automator workflow that is pictured below. Using automator has greatly improved our efficiency.

 

In order to combine all the text files from a respective year we used the following workflow application.

 

In order to run our analysis we plan on using the n-gram and topic modeling features provided by VoyantTools to hopefully gain some insight into how are Carleton has changed from 1887 to 2016.

 

Deliverables

 

So far our project is on track. After getting off to a slow start we have improved our data gathering efficiency with the help of Carleton’s Digital Archivist and automator workflows.  As of the writing of this post we converted about 40 years of PDF into usable text files and we are on track to have at least 60 years (1956 to 2016) by Wednesday. At this point we will begin running trial runs of our text analysis using voyant tools. As we continue testing various methods of analysis we will continue to add to our text file database. By the end of the project we hope to have converted every Carletonian issue from 1887 to 2016.

 

Author: lieberkotzo

http://orenlieberkotz.org/

Leave a Reply

Your email address will not be published. Required fields are marked *