The core argument of the case is about the chance of "false positives." The great majority of the hypotheses proposed in the social sciences are of the form "x is associated with y" (controlling for other factors relevant to y). If the observed data would be unlikely under the "null hypothesis" that "x is

*not*associated with y" (controlling for other factors), you count it as support for the hypothesis that "x is associated."

Suppose that for every ten proposed hypotheses that are true, there are 100 that are false. Using a .05 level means that we can expect a statistically significant association for five of the false ones. Suppose a statistically significant association is found for 80% of the true hypotheses, which is the target that people usually aim for in designing experiments; then 5 out of 13, or almost 40% of the statistically significant associations will represent false hypotheses. Their idea is that researchers should change the standard of statistical significance to 0.5% and continue to aim for 80% power (which would mean bigger experiments). That would mean there would continue to be 8 statistically significant associations that represent real ones but only 0.5 (6% of the total) that are spurious.

The ratio of true to false proposed hypotheses is crucial here. If it's 1:1, then with 80% power and a 5% significance level, we have only 6% spurious associations. The authors offer some evidence that the ratio is about 1:10 for psychology experiments, and say that a "similar number has been suggested in cancer clinical trials, and the number is likely to be much lower in biomedical research." They also address the possible objection that the "threshold for statistical significance should be different for different research committees." They say that they agree, and that genetics and high-energy physics have gone for a higher standard--a t-ratio of about 5, but don't even address the possibility that a lower standard might be appropriate. That is, they seem to take a 10:1 ratio of false to true hypotheses as the minimum, and recommend the .005 standard as a baseline suitable to all fields. They return to this point in the concluding remarks, where they say that since the .05 level was established "a much larger pool of scientists are now asking a much larger number of questions, possibly with much lower prior odds of success." This isn't convincing to me. In the papers I read (published or for review), most of the hypotheses about relations between variables seem pretty plausible. Even if I don't find the reasoning that leads to the prediction convincing, and often I don't, it's not hard to think of an alternative argument (or several arguments) that leads to the same prediction. The idea that more scientists asking more questions means lower prior odds of success isn't compelling either. In some fields, theory has developed, and that should let you make reasonable predictions on more questions. In others, there's at least more evidence, meaning more examples to draw on in making predictions. So I doubt there is a tendency for the prior odds in favor of proposed hypotheses to decline.

If they were just making a suggestion about how to interpret the .05 significance level, I would not object, and in fact would generally agree (see my book Hypothesis Testing and Model Selection in the Social Sciences). But realistically, a "default" of .005 would mean it would become difficult to publish work in which the key parameter estimates were not statistically significant at that level, just as it's now difficult to publish work in which the key parameter estimates aren't significant at the .05 level.* That would be a loss, not a gain, especially with non-experimental data, where a bigger sample is usually not an option.

*They say results that didn't reach .005 "can still be important and merit publication in leading journals if they address important research questions with rigorous methods," but I'm confident that the great majority of reviewers and editors would say that about the .05 level today. Importance and rigor are matters of judgment, so there's usually disagreement among reviewers; the "default" level of significance is objective, so it takes on outsize importance.

## No comments:

## Post a Comment