A blog about evolving humans and evolving psychological methods
Long ago and far away, in Chicago, in 2006, I submitted one of my first papers as a graduate student. The topic was controversial, and so we were not particularly surprised, when the reviews came back, to see that the reviewers were skeptical of the conclusions we drew from our findings. They wanted more (as JPSP reviewers often do). They thought maybe we had overlooked a moderator or two…in fact, they could think of a whole laundry list of moderators that might produce the effect they thought we should have found in our data. So we ran 1,497 additional tests.
No, seriously. We counted. 1,497 post-hoc analyses to make sure that we hadn’t somehow overlooked the tests that would support Perspective X. We conducted them all and described them in the article (but there was still no systematic evidence for Perspective X).
If your work involves controversy, you’ve probably experienced something like this. It’s been standard operating procedure, at least in some areas of psychology.
Now, fast forward to today. I’m about to launch a new study in the same controversial topic area, and it’s likely that we’ll get results that someone doesn’t like, one way or another. But this time, before we start conducting the study, we write up an analysis plan and submit it to Comprehensive Results in Social Psychology (CRSP), which specializes in registered reports. The analysis plan goes out for review, and reviewers—who have the luxury of not knowing whether the data will support Perspective X or Y or Z—thoughtfully recommend a small handful of additional analyses that could shed better light on the research question.
The analysis plan that emerges is one that everyone agrees should offer the best test of the hypotheses; importantly, the tests will be meaningful however they turn out. We run the study and report the tests. We submit the paper.
And then, instead of getting a decision letter back asking for 1,497 additional suggestions that someone thought would surely show support for Perspective X…the paper is simply published. The data get to stand as they are, with no poking and prodding to try to make them say something else.
There’s a lot to like about this brave new world.
Our new paper in CRSP addresses whether attractiveness (as depicted in photographs of opposite-sex partners) is more appealing to men than to women. I, like most other evolutionary psychologists, had always assumed that the answer to this question was “yes.”
But you know what? Those prior studies finding that sex difference in photograph contexts? Most of them were badly underpowered by today’s standards. Our CRSP paper used a sample that was powered to detect whether the sex difference was q = .10 (i.e., a small effect) or larger (using a sample of N = ~1,200 participants and ~600 photographs). These photographs came from the Chicago Face Database, and we used the ratings in the database of the attractiveness of each face (based on a sample of independent raters).
The paper has two take-home lessons that are relevant to the broader discussion of best practices:
1. Even though prior studies of this sex difference were underpowered, the sex difference was there in our new study: r(Men) = .41, r(Women) = .28, q = .13, 95% CI (.18, .08). There is no chance that the prior studies were powered to find a sex difference as small as what we found. But it was hiding in there, nevertheless.
Lesson #1: Perhaps weakly powered studies in the published literature can still manage to converge on truth. At least, perhaps this happens in cases where the presence or absence of p < .05 is/was not a hard criterion for publication. Sex differences might be one such example. (Still no substitute for a high powered, direct test, of course.)
2. In this literature, scholars have posited many moderators in an attempt to explain why some studies show sex differences and some do not. For example, sex differences in the appeal of attractiveness are supposed to be bigger when people imagine a serious relationship, or when people evaluate potential partners in the low-to-moderate range of attractiveness. Sometimes, sex differences are only supposed to emerge when 2 or 3 or 4 moderators combine, like the Moderator Avengers or something. That wasn’t the case here: These purported moderators did not alter the size of the sex difference in the predicted manner, whether alone or in Avenger-mode combination.
Lesson #2: Perhaps we should be extremely skeptical of moderators that are hypothesized, frequently post hoc, to explain why Study X shows a significant finding but Study Y does not. Moderators within study? I’m on board. Moderators across studies? I’ll believe it when I see it meta-analytically.
For every single research question I dream up going forward, I will consider whether it could be a good candidate for a registered report. When I think about an idealized, all-caps form of SCIENCE that stays untethered from prior perspectives or ideology, that CRSP experience pretty much captures it. 
 This statement may shock some who think of me as some sort of sex-differences naysayer. Rather, my perspective is that this sex difference is larger in photograph contexts than live face-to-face contexts. Indeed, q = .13 is about 2-4 times larger than meta-analytic estimates of the same sex difference in initial attraction contexts or established close relationships (which are q = .05 or smaller). (Does it make me a naysayer to suggest that the sex differences here are extremely small, and that prior single studies are unlikely to have been powered to detect them?)
 And did I mention fast? This project went from “vague idea” to “in press” in less than 11 months. My prior best time for an empirical piece was probably twice as long.