Academic Catalog Analysis: Max and Cole
Progress
We talked to both Nat Wilson and Hsianghui Liu-Spencer from the Carleton archives to try and learn more about the documents. We learned how they were OCR’ed and what content is available. The information they had available to them was not helpful for our project specifically.
So far, we have gathered PDFs, and processed OCR text files of every academic catalog from 1900-2000. Downloading them individually took a long time. We have not yet built anything, but we have a plan in place.
Problems
Our initial issue is the variance in format and quality of OCR in the catalogs. This has forced us to take a different route. Our new plan is to look at overall trends in word use and frequency.
Tools and techniques
We are not sure exactly what software we will use to analyze the catalogs. We have considered using Voyant Tools. We would like to look into other software that can analyze several different files at once. We will be sure to add several stop words such as course, credit, professor, etc.
Timeline- We are on track for delivery.
Understand and evaluate the best tools and data storing option by Feb. 16th. Met
Clean up data and start working on Stop Words List: Monday Feb. 26th.
Website Started: Friday March 2
Presentation: Friday March 9
Author: Max Goldberg
http://goldbergmax.com/about/
Max and Cole,
As we discussed in class, there are many more or less intensive solutions to text analysis beyond voyant, which usually just serves as a first taste of what the possibilities of a text are. One of the more commonly used for DH projects is Antconc, and there is a good tutorial to corpus analysis with Antconc on the Programming Historian website.
Once you find some useful comparisons, we can explore further visualization options.