Transcribing Challenges

Last week I spent quite some time writing a script to pull out the information from text files and convert it to a tabular format. The Zoobook data is not the worst thing ever, but there is a lot of it and processing everything manually is certainly not the way to go.

Here are several approaches I took in order to complete this task:

  1. The first script that I wrote utilizes some of the Python’s built-in functions, such as replace(), split(), strip(). The code opens a text file given by the user in the command line and reads it line by line. It then splits each line into an array of words and loops through every element of the array to check whether it belongs to the set of keys of my premade dictionary of states to integers.
  2. The second script is based on text frequency. The program first imports a big database of all cities in the world. It then loops through the text to determine the frequency of each city. This algorithm is slow and I decided to abandon it, as it also counts random student’s names that have the same spelling as some cities.
  3. The third script is based on Regular Expressions, and the majority of it was taken from Programming Historian Website (thanks, Austin). It utilizes re library to clean up the data and search for specific patterns, which it then outputs as a CSV file. For more information, visit this post.

For anyone interested in doing something similar, I recommend visiting Programming Historian Website, which has great Python and Regular Expressions tutorials that I found to be particularly useful.

Happy 7th week!

Quang

 

 

Quang Tran

4 Comments

  1. That’s actually really cool stuff! It’s unfortunate that you ran into the challenges you did–I don’t know what you could do to stop it from counting student names as cities. I’d like to know more about the third script, though. How exactly does it clean up the data? Also, how nicely organized is the Zoobook data you’ve been working with? Hopefully slight misspellings of names or cities haven’t been counted as cities or different names, respectively.

  2. Quang,

    Nice post. I like how you let us in on the process, by describing each attempt and why it wasn’t working. It would be really helpful to include an example or two here and there for the benefit of people like Pallav who want to know more. An example of a particular regular expression and what it does, for instance?

  3. Awesome post! Linking it to Programming Historian Website was a good call, as it allowed me to experience some of the material you mentioned and learn more about it in greater depth. A couple of screenshots may have been nice visual aid to your bullet points (which, by the way, made it easy to follow).

Leave a Reply

Your email address will not be published. Required fields are marked *