Nature Human Behavior recently published an article with the title "Underrepresented minority faculty in the USA face a double standard in promotion and tenure decisions." They also published a comment on it, which said the authors "find double standards negatively applied to scholars of colour, and especially women of colour, even after accounting for scholarly productivity." Science published a piece on the article, titled "Racial bias can taint the academic tenure process—at one particular point." So the study must have found strong evidence, right?
Their key findings involved college-level (e. g., Arts & Sciences, Business, Fine Arts) tenure and promotion committees, where black or Hispanic candidates got more negative votes than White or Asian candidates. I'll look at the probability of getting a unanimous vote of support, where they found the strongest evidence of difference. In a binary logistic regression with several control variables (university, discipline, number of external grants, time in position, whether the decision involves tenure or promotion to full professor), the estimated effect of URM status is -.581 with a standard error of .246. That gives a t-ratio of -2.36 and a p-value of .018. If you prefer odds ratios, it's an estimate of .56 and a 95% confidence interval of .35 to .91. That's reasonably strong evidence by conventional standards.
What about scholarly productivity? They calculated the "h-index," which is based on citation counts,* and standardized it to have a mean of zero and standard deviation of one. If you add it as a variable:
est se t-ratio P
H .375 .129 2.91 .004
URM -.301 .299 1.01 .315
Now the estimated effect of being a member of a URM is nowhere near statistical significance. It is still large (the odds ratio is about .75 and the 95% confidence interval goes as low as .41), so the conclusion shouldn't be that there is little or no discrimination--there's just not enough information in the sample to say much either way.**
But the authors didn't report the previous analysis--I calculated it from the replication data. They gave an analysis including the H-index, URM status, and the interaction (product) of those variables:
est se t-ratio P
H .330 .128 2.59 .010
URM -.041 .337 0.12 .903
H*URM 1.414 .492 2.87 .004
That means that H has a different effect for URM and White or Asian (WA) candidates: for WA, it's .33 and for URM it's .33+1.41=1.74. The URM coefficient gives the estimated effect of URM status at an H-index of zero. At other values, of the H-index, the estimated effect of URM status is -.041+1.08*H. For example, the 10% percentile of the H-index is about -1, so the estimated effect is about -1.12. The 90th percentile of the H-index is about 1, so the estimated effect of URM status is about 1.04. That is, URM faculty with below-average H-indexes are have a lower chance of getting unanimous support compared to WA candidates with the same H-index, but URM faculty with above-average H-indexes have a higher chance. This is a "double standard" in the sense of URM and WA faculty being treated differently, but not in the sense of URM faculty consistently being treated worse.
The authors describe this as "differential treatment in outcomes is present for URM faculty with a below-average H-index but not for those with an average or above-average H-index." They suggest an "intriguing question for future research: do URM faculty with an above average h-index perform better than non-URM faculty with the same h-index?" But the interaction model is symmetrical--what justifies treating the estimate for low h-indexes as a finding and the estimate for higher h-indexes as just an "intriguing possibility"? You could fit a model in which URM faculty are disadvantaged at lower levels of productivity but there is no difference at moderate or high levels of productivity. I've done this, and it fits worse than the standard interaction model, although there's not enough information to make a definitive choice between them.
The result about the interaction between URM status and h-index is interesting, but doesn't support the claim of a general bias against URM faculty. So why is this study being hyped as strong evidence of bias? One obvious factor is that many people believe or want to believe that there is a lot of bias in universities, so they'll seize on anything that seems to support this claim. A second is that people get confused by interaction effects. But I think there's a third contributing factor: Nature and Science come from a tradition of scientific reports: just give your "findings," not a lot of discussion of how you got there and things you tried along the way. Journals in sociology, political science, and economics come from a more literary tradition--leisurely essays in which people developed their ideas or reviewed previous work. This tradition continued even after quantitative research appeared: articles are longer and have more descriptive statistics and more discussion of alternative models. If this paper had been submitted to a sociology journal, I'm pretty sure some reviewer would have insisted that if you're going to have an interaction model, you also have to show the model with only the main effects of H-index and URM. That would have made it clear that the data doesn't provide strong evidence of discrimination. It also might have led to noticing that there's a lot of missing data for the H-index (about 30% of the cases), which is another source of uncertainty.
*The largest number H for which you have H articles with H or more citations in Google Scholar.
**This is not the authors' fault--it's hard to collect this kind of data, and as far as I know they are the first to do so.