The intriguing results of a new study out of Jamaica have caught the attention of scholars and journalists alike. Citing data collected over 20 years, researchers released a paper last March comparing a group of toddlers from low-income families who received psychosocial stimulation to a comparable group who received no such treatment, and they found that early educational intervention increased the former’s average earnings by 42%. Moreover, the earnings of the stimulated underprivileged group caught up with those of their better-off peers, suggesting that early intervention is a key driver in reducing inequality. Children in that group were also more likely to immigrate to wealthier countries, like the U.S. and U.K., which broadened their educational and professional opportunities. In short, there are sweeping political implications here – namely, how much should a government invest in early childhood education?
Before we can talk about the politics, though, we need to look at the methodology. I’ve been asked to comment by Prof. Hal Pashler of the University of California-San Diego (among others) so I’ll use this short essay to make a few points about how the experiment was set up, and how to think about the researchers’ numerical results. As I’ll explain, I think the basic premise is plausible, but the small sample size, among other issues, makes me question whether the earnings boost is as large as the authors claim. I’ll add here that the latter part of this essay is aimed at readers who are familiar with the report in particular, and with statistics in general.
The paper — “Labor Market Returns to Early Childhood Stimulation: A 20-year Followup to an Experimental Intervention in Jamaica,” by Paul Gertler, James Heckman, Rodrigo Pinto, Arianna Zanolini, Christel Vermeerch, Susan Walker, Susan M. Chang, and Sally Grantham-McGregor — touches on a debate that is fundamentally political: How effective are educational interventions? Education is a huge sector of the economy and a huge part of our lives, and the past several decades have seen an explosion of education research. There remains a lot of controversy about what works and what doesn’t, and for which students. Long-term experiments on children’s lives are costly in time, money, and human resources. As a result, major decisions on education policy can turn on the statistical interpretation of small, idiosyncratic data sets — in this case, a study of 129 Jamaican children.
There is another debate as well, one that relates to perennial questions of nature and nurture. At the most basic level there is an ongoing disagreement between those who believe that poor people are inevitably poor and little can (or should) be done about it, versus those who ascribe socioeconomic differences to differences in opportunity. The first group (roughly speaking, conservatives) tends to think of economic inequality as a product of human nature, and they tend to favor measures that preserve some social inequality without economic intervention. The second group (roughly speaking, liberals) recommends redistribution and other measures for directly and indirectly reducing economic inequality.
Beyond differences in values, liberals and conservatives tend to have different views on the effects of experiences on life outcomes, with liberals tending to favor “nurture” explanations, which imply that, with sufficient resources, society can reduce poverty and inequality. Conservatives lean toward “nature,” so that people will end up where they end up, and they view government policies as having little effect except to throw grit into the economic engine and reduce prosperity for all. Liberals latch on to stories about amazing teachers who lead low-income children to surprising success, while conservatives groove on biologically-flavored stories such as those of identical twins, separated at birth, who meet decades later and find they dress the same way, have similar jobs, and prefer the same favorite flavor of ice cream. These links are not absolute, but they give a political tinge to what could otherwise in theory be a purely scientific debate.
So this brings up back to the report by Gertler and his colleagues. Their claim that early childhood stimulation dramatically raised earnings provides a big boost to the “nurture” side of the story. And an interesting back-story here is that the second author on the paper is James Heckman, the Nobel Prize winner from the University of Chicago’s economics department, traditionally a stronghold of political conservatism. Amid the commentary on the blogosphere, Pashler questioned whether I was bothered by the statistical challenges in this small-sample study.
I replied that the two key concerns seem to be: (1) the very small sample size (thus, unless the effect is huge, it could get lost in the noise) and (2) the correlation of the key outcome (earnings) with immigration. The authors are aware of the challenges of interpreting the result in light of the immigration factor, but I’d like to see more here. In particular, I’d suggest including graphs of the individual observations. And, as always in such settings, I’d like to see the raw comparison — what are these earnings, which, when averaged, differ by 42%? Finally, I’d like to see these data broken down by emigration status. That bit did worry me a bit. Once I have a handle on the raw comparisons, I’d then like to see how this fits into the regression analyses.
The published analysis generally seems thoughtful and reasonable, as one would expect given that Heckman is a coauthor. But I’m inclined to just ignore the section on the permutation tests because my substantive interest is in the size of the effect (and the size of interactions), not in the test of whether the effect is exactly zero. I wouldn’t have thought of multiple comparisons as being too much of a problem — after all, earnings are the usual outcome that economists look at in such studies — so I wasn’t quite sure why the authors devoted a section (4.1.4) to discuss how to account for multiple comparisons.
Furthermore, I didn’t really look at the comparison with the nonrandomized group. That analysis might be just fine, but it’s the randomized analysis that is the headline result, so that’s what I focused on.
Overall, I have no reason to doubt the direction of the effect, namely, that psychosocial stimulation should be good. But I’m skeptical of the claim that income differed by 42%, due to the reason of the statistical significance filter. In section 2.3, the authors are doing lots of hypothesizing based on some comparisons being statistically significant and others being non-significant. There’s nothing wrong with speculation, but at some point you’re chasing noise and picking winners, which leads to overestimates of magnitudes of effects.
So those are my thoughts. My goal here is not to “debunk” but to understand and quantify. This story relates to a debate within statistical methodology regarding the interpretation of statistically significant findings from small studies. Traditionally, once a result reaches the 5% significance level in a randomized experiment, it is taken as a (provisional) scientific fact. Yes, everyone knows that these experiments are not perfect: participants drop out of the study, measurements can be systematically in error, multiple hypothesis testing affects the meaning of p-values, and individual studies have their own peculiarities, such as different rates of migration in treatment and control groups (as in this particular example). Still, such problems are typically taken to just slightly weaken the confidence with which we can believe statistically significant findings.
More recently, though, a group of researchers in medicine and psychology — including John Ioannidis, Uri Simonsohn, Brian Nosek, and others — have expressed concern that statistically significant published results are routinely overestimating the sizes of effects, often representing patterns from noise. I have found these arguments to be convincing enough that whenever I see a published claim, I tend to think it is an overestimate.
It’s just the nature of scientific reporting: the estimates near zero remain unpublished or get adjusted higher (based on decisions arising from reasonable scientific judgments), while the high estimates remain. I have every reason to think the effect of childhood stimulation is positive for most children — I would have been inclined to believe this already, and this study provides further evidence in that direction — but I can only assume that the 42% number is an overestimate.
Where does that leave us, then? If we can’t really trust the headline number from a longitudinal randomized experiment, what can we do? We certainly can’t turn around and gather data on a few thousand more children. If we do, we’d have to wait another 20 years. What can we say right now?
My unsatisfactory answer: I’m not sure. The challenge is that earnings are highly variable. We could look at the subset of participants who did not emigrate, or, if there is a concern that the treatment could affect emigration, we could perform an analysis such as principal stratification that matches approximately equivalent children in the two groups to estimate the effect among the children who would not have emigrated under either condition. Given that there were four groups, I’d do some alternative analyses rather than simply pooling multiple conditions, as was done in the article. But I’m still a little bit stuck. On one hand, given the large variability in earnings, it’s going to be difficult to learn much this sort of small-sample between-person study. On the other hand, there aren’t a lot of good experimental studies out there, so it does seem like this one should inform policy in some way. In short, we need to keep on thinking of ways to extract the useful information out of this study in a larger policy context.