Data collection and zoobooks

group members: Genesis Rojas, Bloosom M. Booth, Quang Tran Dang, Selam Nicola

progress:

So far, the stalkermap group has found and collected data for years 1955-1966 from zoobooks. We’ve converted PDF downloads from Digital Collection on the Carleton Archive page to txt format using PDFMiner, terminal/command prompt. After converting, we manually went through the zoobook data, which consisted of names of students, their high schools and states. We then manually cleaned up all the unnecessary information such as page number, college name, and other miscellaneous information. We also edited misspellings in states and countries. Once we did that, we used Sublime text editor code to manually change our script’s code and ran it to convert txt into csv. 

Problems:

People in our group had different operational systems which lead to difficulty when trying to install software. This was fixed by looking up online tutorials and guides on how to install software in different OS. 

Applications and language:

PDFMiner – a tool for extracting information from PDF files that is written in Python. Includes a converter that can transform PDF files into other text formats (HTML, TXT).

regex_replace.py – the file conversion script that utilizes regular expressions to convert plain text into tabular data.

 

Below is a step by step layout of how we transcribed our data:

1. Converted PDF files downloaded from Digital Collections to txt format by using PDFMiner. 
 
2. We manually edited the data and made it so that each student’s information starts on a new line. We deleted anything else that is not relevant to our project and fixed some of the spellings resulted from PDFMiner conversion.

3. Ran our regex_replace.py script to get the data converted into CSV. CSV stands for comma-separated values and is a widely used format to store tabular data in plain text. 

Deliverable: 

This is 7th week. Now, we are finishing up transcribing our data. We still need data for year 1916-1954. For 8th week, we’ll start formulating our presentation strategies. From 9th to 10th week, we’ll be finalizing our story map.

 

nicolas

2 Comments

  1. Team Stalkermap,

    Looks like you’ve made great progress and have gotten a lot done! I can’t wait to see what you all come up with. A few comments and suggestions.

    –If you run into trouble finding comparable data for pre-1955, I think it would be fine if you used the zoobooks from 1966 to the present instead. That way you could stick to the same methodology and get a long time series over which to look for patterns.

    –Since you already have a test set of data for 10 years, I would suggest splitting your group’s responsibilities and assigning one or two members to continue processing data, while one of the others decides who will host the final project on their server and starts setting up, and another begins trying out the test data in Palladio, if that is what you will be using.

    The more diversified you get now, the quicker you will encounter roadblocks and the longer you will have to correct course.

  2. That’s cool that you guys have been able to use scripts and technology to sort through data. Our groups methods are, unfortunately, much more similar to manual grunt work of the conventional humanities than what I would expect for a digital humanities project. I’m excited to see your final product! It should be interesting.

Leave a Reply

Your email address will not be published. Required fields are marked *