Institute for Quantitative Social Science, Harvard University
Konstantin Kashin is a Fellow at the Institute for Quantitative Social Science at Harvard University and will be joining Facebook's Core Data Science group in September 2015. Konstantin develops new statistical methods for diverse applications in the social sciences, with a focus on causal inference, text as data, and Bayesian forecasting. He holds a PhD in Political Science and an AM in Statistics from Harvard University.
This week I wanted to write a Python script that was able to extract text from both pdf files and Microsoft Word documents (both .doc and .docx formats). This actually proved to be rather difficult, particularly when it came to both Microsoft Word since there was no one utility that was able to handle both the old Word format and the more recent .docx one. This post is a summary of the utilities that I came across and what I finally used to complete this task.