Institute for Quantitative Social Science, Harvard University
Konstantin Kashin is a Fellow at the Institute for Quantitative Social Science at Harvard University and will be joining Facebook's Core Data Science group in September 2015. Konstantin develops new statistical methods for diverse applications in the social sciences, with a focus on causal inference, text as data, and Bayesian forecasting. He holds a PhD in Political Science and an AM in Statistics from Harvard University.
I recently had the honor of giving a talk with Adam Glynn at The Future of Big Data Symposium at the University of Nebraska-Lincoln. The talk, entitled “Challenges of Big Data in the Social Sciences”, can be viewed here:
In the talk, we presented results from two papers (in addition to broader reflections on big data in the social sciences):
I am a big fan of interactive data visualization for data exploration and presenting research findings, and thus I wanted to share a new site Sergiy Nesterko recently launched called Databits. The site features interactive visualizations built using a variety of tools (primarily D3.js, although there are a few recently examples in Processing.js), and the goal is to get a community of people interested in data visualization to showcase their work (including open-source code), interact with one another, and hopefully hone their skills in the process! It’s very much in its initial stages, but please join and contribute if you have visualizations you want to share.
I’ve uploaded a few of my visualizations here and hope to upload more in the near future. One of the nice features of Databits is the ease with which you can embed a visualization in another website:
I’ve finally updated and uploaded a detailed note on maximum likelihood estimation, based in part on material I taught in Gov 2001. It is available in full here.
To summarize the note without getting into too much math, let’s first define the likelihood as proportional to the joint probability of the data conditional on the parameter of interest ($\theta$): $$L(\theta|\mathbf{x}) \propto f(\mathbf{x}|\theta) = \prod\limits_{i=1}^n f(x_i|\theta)$$ The maximum likelihood estimate (MLE) of $\theta$ is the value of $\theta$ in the parameter space $\Omega$ that maximizes the likelihood function: $$\hat{\theta}_{MLE} = \max_{\theta \in \Omega} L(\theta|\mathbf{x}) = \max_{\theta \in \Omega} \prod\limits_{i=1}^n f(x_i|\theta)$$
This turns out to be equivalent to maximizing the log-likelihood function (which is often simpler): $$\hat{\theta}_{MLE} = \max_{\theta \in \Omega} \log L(\theta|\mathbf{x}) = \max_{\theta \in \Omega} \ell (\theta|\mathbf{x}) = \max_{\theta \in \Omega} \sum\limits_{i=1}^n \log (f(x_i|\theta))$$
One can find the MLE either analytically (using calculus) or numerically (by using R or another program).
Suppose that we want to visualize the log-likelihood curve for data drawn from a Poisson distribution with an unknown parameter $\lambda$. The data we observe is {2,1,1,4,4,2,1,2,1,2}. In R, we can do this quite simply as:
my.data <- c(2,1,1,4,4,2,1,2,1,2)
pois.ll<- function(x) return(sum(dpois(my.data,lambda=x,log=TRUE)))
pois.ll <- Vectorize(pois.ll)
curve(pois.ll, from=0,to=10, lwd=2, xlab=expression(lambda),ylab="Log-Likelihood")
We already know (based on analytic solutions) that the MLE for $\lambda$ in a Poisson distribution is just the sample mean, which comes out to 2 in this case. Thus, we can mark it on the log-likelihood curve to produce the following graph:
If we wanted to maximize the log-likelihood in R (on the parameter space [0,100], chosen because it’s sufficiently wide to encompass the MLE), we could have done:
opt <- optimize(pois.ll, interval=c(0,100),maximum=TRUE)
opt$maximum # gives MLE
opt$objective # gives value of log-likelihood at MLE
R confirms our analytic solution.
Why do we use maximum likelihood estimation? It turns out that subject to regularity conditions the following properties hold for the MLE:
Consistency: as sample size ($n$) increases, the MLE ($\hat{\theta}_{MLE}$) converges to the true parameter, $\theta_0$. $$\hat{\theta}_{MLE} \overset{p}{\longrightarrow} \theta_0$$
Normality: As sample size ($n$) increases, the MLE is normally distributed with a mean equal to the true parameter ($\theta_0$) and the variance equal to the inverse of the expected sample Fisher information at the true parameter. However, using the consistency property of the MLE, we can use the inverse of the observed sample Fisher information evaluated at the MLE, denoted as $\mathcal{J}_n(\hat{\theta}_{MLE})$ to approximate the variance. The observed sample Fisher information is the negation of the second derivative of the log-likelihood curve. $$\hat{\theta}_{MLE} \sim \mathcal{N} \left(\theta_0, \Big(\underbrace{- \Big( \dfrac{\partial^2 \ell(\theta|\mathbf{x})}{\partial \theta^2} \Big|_{\theta=\hat{\theta}_{MLE}} \Big)}_{\mathcal{J}_n(\hat{\theta}_{MLE})} \Big)^{-1} \right)$$
Efficiency: maximum likelihood estimation generally provides the lowest variance as sample size increases.
Here is a visualization I constructed using D3.js based on a visualization for Harvard’s Stat 221 class of a network of individuals for whom HIV status is known (original visualization here). I wanted the visualization to maximally exploit the information available in the data, such as for example whether friendships are mostly seroconcordant (with individuals of the same HIV status) or serodiscordant. I also wanted to see if most friendships were of the same gender or not. Hence, I adapted the hive plot template for this network data. Here is a static picture of the resultant network:
Note that the two axes of the hive plot connote the HIV status of the individuals. The nodes are colored and ordered first by gender and then within each gender are subsequently ordered by the number of links, where the nodes closer to the origin of the axis are more well-connected. The links are also color-coded by whether they are friendships between the same gender or opposite genders. Whether a link is serodiscordant or seroconconcordant is evident from its placement relative to the axes.
The full, interactive visualization is available here and the code is available on Github.
For those new to D3.js, here are some great resources for getting started:
This week I wanted to write a Python script that was able to extract text from both pdf files and Microsoft Word documents (both .doc and .docx formats). This actually proved to be rather difficult, particularly when it came to both Microsoft Word since there was no one utility that was able to handle both the old Word format and the more recent .docx one. This post is a summary of the utilities that I came across and what I finally used to complete this task.
First, with regards to pdf files, the main Python library for opening pdf files is PDFMiner. There exist several additional libraries that essentially serve as wrappers to PDFMiner, including Slate. Slate is significantly simpler to use than PDFMiner, but this comes at the expense of very basic functionality. Even though I first tried to use Slate, it ended up not performing well for the pdfs I was working with. Specifically, it did not fully respect the original spacing between words, thereby cutting certain words into multiple fragments or concatenating others. I thus switched to PDFMiner because of its customizability. Using the pdf2txt.py command line utility, PDFMiner experienced a similar problem with word spacing. However, this turned out to be extremely easy to tune just using a word margin option passed to the pdf2txt.py utility. Specifically, I ran the following in the command line:
pdf2txt.py -o foo.txt -W 0.5 foo.pdf
When it comes to Word 2007 .docx files, the Python-based utility that worked well is the python-docx library. It worked well in the command line as follows:
./example-extracttext.py foo.docx foo.txt
For older Word documents (for example Word 2003), the python-docx library does not work. I ended up using the C-based antiword utility. Originally a Linux-based utility, antiword (version 0.37) can be installed on Mac OS X as follows:
brew update
brew install antiword
From within Python, I was then easily able to convert a .doc document to text:
os.system('antiword foo.doc > foo.txt')