## Wednesday, March 23, 2016

### Five imaginary surveys

Carl Morris offered three examples of poll results (see this post by Andrew Gelman for a link and discussion). The numbers are those who say they favor the candidate from Party A in a two-person race (no one is undecided).  He adds that "Party A candidates have always gotten between 40% and 60% of the vote and have won about half of the elections."

15 out of 20               (75% for A)
115 out of 200            (57.5%)
1,046 out of 2000       (52.3%)

The p-value for the hypothesis that exactly half support candidate A is .021 for each example.  But Morris argues that they provide different levels of evidence for the proposition that A is ahead:  strongest for 1,046 out of 2000, then 115 out of 200, with 15 out of 20 giving the weakest evidence.  For the explanation, read the original paper, but basically his point is given the experience of other elections, you should compare the pessimistic hypothesis ("I have just under 50%") to an alternative hypothesis that's consistent with that experience, like "I have more than 50% but less than 60%."

What if the poll showed support from 10,145 out of 20,000 (50.7%)?  The p-value would again be .021, and Morris's approach would show even stronger evidence for the proposition that A was ahead than in the 1,046 out of 2,000 example.  However, a candidate might reasonably find it less encouraging than 1046 out of 2000, and describe the results as indicating that the election was "too close to call."  Morris's analysis and the p-value are both based on the assumption that you had a random sample.  But in an election poll, you know that's not quite true--even if you contact a random sample (which is difficult), non-response is almost certainly not completely random.  So in addition to sampling error, there is some extra uncertainty.  It's hard to say exactly how much, but it's safe to say that it's at least 1-2%, even for high-profile races.

What if the poll showed support from 27 out of 30?  The p-value is about .000004, and with a prior distribution like that used by Morris, the posterior probability that candidate A is ahead is very near one.   That is, both agree that this provides stronger evidence than any of the other examples.  But I think that a reasonable candidate would suspect that there was something wrong with the poll:  that there was some kind of mistake or deception.

This is not to say that there's any mistake in Morris's analysis, just that things get more complicated as you get closer to the problems of interpreting actual results.  These examples are also relevant to the situation faced by someone asking "does x affect y, after controlling for other relevant factors?" (e. g., does income affect chances of voting for Donald Trump?).   You could divide the range of parameter estimates into four groups:
a.  too small to be of interest
b.  "normal" size
c.  surprisingly large
d.  ridiculously large

People often characterize (a) in terms of "substantive significance," but it can also be parallel to the uncertainty in even well-conducted surveys.  In an observational study, the specification of "other relevant factors" is almost certainly wrong or incomplete, so if you have a very small parameter estimate, it's reasonable to suspect that it would be zero or the opposite sign under some reasonable alternative specifications--in effect, it's "too close to call."  The second, (b) is the common situation in which the variable makes some difference, but not all that much--often it's one of a large number of factors.  Establishing something like that may be an advance in knowledge, but usually isn't very exciting.  A sufficiently large value (c) is different:  it suggests we may have to fundamentally change the way we think about something (as I recall, people said things like that about the LaCour "study" of personal influence).  Then there's (d), which could be a result of mistakes in recording the data, or miscoding, or some gross error in model specification (politeness prevents me from offering examples).  The problem is that the values for (c) and (d) overlap--just like the 27 out of 30 example could indicate that the candidate is going to win by a historically unprecedented margin, or that there was merely some kind of mistake.