Institute for Quantitative Social Science, Harvard University
Konstantin Kashin is a Fellow at the Institute for Quantitative Social Science at Harvard University and will be joining Facebook's Core Data Science group in September 2015. Konstantin develops new statistical methods for diverse applications in the social sciences, with a focus on causal inference, text as data, and Bayesian forecasting. He holds a PhD in Political Science and an AM in Statistics from Harvard University.
In this post, I’m going to go through how to make plots of distributions (either density plots or histograms) in ggplot2. I’m going to draw upon examples of Fisherian testing in the context of causal inference, but the examples should be completely understandable without knowledge of Fisher’s approach to inference.
For example, suppose that we want to plot the randomization distribution of a test statistic (difference in means in this case) under the sharp null of zero treatment effect across all units. The possible values of the test statistic across 1000 randomly sampled randomizations are in the vector
test.stat. Moreover, the actually observed test statistic is
tau.hat. Not only do we want to plot the density, but we want to shade in the area under the density that corresponds to the one-sided p-value of this test statistic.
We can first turn the vector into a data frame that contains the values as x and the corresponding density as y:
dd <- with(density(test.stat),data.frame(x,y))
ggplot(data = dd, mapping = aes(x = x, y = y)) + geom_line(color="black") + layer(data = dd, mapping = aes(x=ifelse(x < tau.hat,x,tau.hat), y=y), geom = "area", geom_params=list(fill="red",alpha=.3)) + scale_y_continuous(limits = c(0,max(dd$y)), name="Density") + scale_x_continuous(name="Difference in Means") + geom_vline(aes(xintercept=tau.hat.1a), color="red", linetype="dashed") + geom_text(mapping=aes(x2,y2,label = text2), data=data.frame(x2=-42, y2=0.015, text2="T[obs]"), color=I("red"), parse=TRUE) + geom_text(mapping=aes(x2,y2,label = text2), data=data.frame(x2=-60, y2=0.0025, text2="p-value = 0.041"), size=3)
Several comments on the syntax passed to the
ggplot() function above:
layer()function with the
geom="area"option fills the area under the distribution corresponding to the one-sided p-value (probability of observing a value of the test statistic at least as small as the one we observe).
geom_text()function in this case by creating a data frame within the actual
ggplotto write expressions with text, such as subscripts which I use above, I pass the
Here’s another example that involves plotting the distribution of p-values under the sharp null of a constant additive treatment effect for each unit. I wanted to plot the values of the resultant test statistics on the x-axis, the p-values for the sharp null on the y-axis, and then shade the 95% Fisherian confidence interval. In this case,
tau.vector contains the values of the test statistic (difference in means).
p.vector is a corresponding vector that contains the one-sided exact Fisher p-value for the sharp null hypothesis that the effect is 0. Note that I placed a dotted horizontal red line at a p-value of 0.025 to obtain approximately 95% coverage on my confidence interval.
ggplot(data = CI.2c, mapping = aes(x = tau.vector, y = p.vector)) + geom_line(color="black") + scale_y_continuous(limits = c(0,max(CI.2c$p.vector)),name="p-value") + scale_x_continuous(name="Difference in Means") + layer(data = CI.2c, mapping = aes(x=ifelse((lb.2c < tau.vector & tau.vector < ub.2c),tau.vector,lb.2c), y=p.vector), geom = "area", geom_params=list(fill="yellow",alpha=.3)) + geom_hline(aes(yintercept=0.025), color="red", linetype="dashed") + geom_text(mapping=aes(x2,y2,label = text2), data=data.frame(x2=-100, y2=0.1, text2="95% CI: [-83,6.33]"), size=5)
Here’s a very simple histogram example. Suppose that
pval.5 contains a vector of p-values across all possible assignment vectors. We want to plot the distribution of Fisher’s exact p-values, which should be uniformly distributed between 0 and 1 under the null (and generally when dealing with continuous test statistics). We can do so simply using ggplot as:
ggplot() + geom_histogram(aes(x = pval.5), binwidth=0.1, linetype=1, alpha=.25, color="black", fill="red") + scale_y_continuous(name="Count") + scale_x_continuous(name="p-value")