Thursday, April 18, 2024

The problem is you?, part 1

 The Atlantic recently published a critical review of the new book by Tom Schaller and Paul Waldman, White Rural Rage: the Threat to American Democracy. The review, by Tyler Austin Harper, concluded by saying that they were not just wrong, but had it backwards--the threat is from the cities and suburbs:  

"Schaller and Waldman are right: There are real threats to American democracy, and we should be worried about political violence. But by erroneously pinning the blame on white rural Americans, they’ve distracted the public from the real danger. The threat we must contend with today is not white rural rage, but white urban and suburban rage.

Instead of reckoning with the ugly fact that a threat to our democracy is emerging from right-wing extremists in suburban and urban areas, the authors of White Rural Rage contorted studies and called unambiguously metro areas 'rural' so that they could tell an all-too-familiar story about scary hillbillies. Perhaps this was easier than confronting the truth: that the call is coming from inside the house. It is not primarily the rural poor, but often successful, white metropolitan men who imperil our republic."

The report that Harper links to says:  "the more rural a county, the lower its rate of sending insurrectionists, a finding which is significant with a p-value <.01%."  A  just-published paper by Robert A. Pape, Kyle D. Larson, Keven G. Ruby in PS: Political Science and Politics gives a more detailed analysis.  The results are from a negative binomial regression in which the  dependent variable is the number of people from a county who were charged with crimes related to the January 6 attack on the Capitol.  The number is estimated to be 2.88 times as large in urban than in rural counties, controlling for other factors.  

Of course, the population of the county is one of the other factors.  But a negative binomial regression predicts the logarithm of the dependent variable and their control is population (in 100,000s).   The estimated coefficient for population is .148, meaning that the natural log of the predicted number of insurrectionists goes up by .148 for every 100,000 increase in county population.  If the natural log of the predicted number goes up by .148, the predicted number goes up by about 15%.*  If you're starting from a population of 1,000, an increase of 100,000 means that population goes up by a factor of of about 100; if you're starting from a population of 1,000,000, it's 10%; if you're starting from a population of 5,000,000, it's only 2%.  So the model controlling for population builds in a relationship between county population and the chance that a person will be an insurrectionist:  declining and then increasing.  The figure shows the nature and size of the relationship using their estimate:


The number 1 on the y-axis represents the rate in a county of average size (about 100,000).  In a county with population of 10,000, the rate is about 8.5; in a county with 500,000, it's about .4, and in one of 5,000,000, it's about 80.  The biggest county in the United States (Los Angeles) has a population of about 10,000,000, but I don't extend the x-axis that far because it would make the figure too hard to read.   Of course, there is no reason to expect that there really is a relationship of this form.

A straightforward alternative would be to model the rate--number of insurrectionists (x) divided by county population (n).  But log(x/n)=log(x)-log(n), so you could express that by a regression with log(x) as the dependent variable and log(n) as one of the predictors.  Then a coefficient of 1.0 on log(n) would mean that the rate was the same across different county populations; a coefficient of less than one would mean it was higher in counties with smaller populations and a coefficient of greater than 1.0 would mean it was higher in counties with larger populations.  

What happens if you use log(population) rather than population as a control variable?

                                                                        Population                    Log

% white population decline                            .111***                    .035
                                                                        (.019)                        (.020)

manufacturing employment decline                .011                            -.006
                                                                        (.0054)                        (.006)

extra Trump %                                                    -.039***                .003
                                                                            (.0081)                    (.0082)

% non-Hispanic white                                           .009***                .014***
                                                                            (.0033)                    (.003)

Metro county                                                        1.095***             .326*
                                                                           (.1335)                    (.135)

Distance to DC                                                    -.304***            -.210***
                                                                            (.0623)                (.051)

(log) population                                                    .148***            .999***
                                                                             (.0210)                (.056)

  The fit of the model with the logarithm as control is better.  Several of the estimates for the other variables change substantially.  The estimate for metro counties is still statistically significant, but not overwhelmingly so (p=.019), and is much smaller than when using population.  So I don't think that the evidence justifies sweeping condemnation of urban and suburban men.

I have experimented with other specifications of the model, but this is enough for one post.  

*My figures are from my analyses using their replication data file, which are slightly different from the numbers implied by their tables.  
  


6 comments:

  1. Thanks for this interesting post. It is a subtle but important error you found there.

    I have some difficulty understanding the model. Specifically, how to interpret the coefficient for Metro County in a model that also controls for log of population. My guess is that both variables are highly correlated (I am not from the US). I guess Metro County is (Yes/No) (I did not find how Metro County is defined in you linked
    paper). Then a coefficient of .33 means: Comparing counties with the same log(pop), Metro Counties sent about 40% more
    insurrectionists. But is it really possible to vary one while fixing
    the other?
    Furthermore should there not be an interaction? With the usual ceteris paribus assumption in a model, this coefficient is the
    same for all levels of log(pop). But does this make sense? Comparing very small counties (population wise) a Metro county sends 40% more. Comparing very big counties, a Metro county also sends 40% more. Bit means Metro(Yes/No) for big counties the same as Metro(Yes/No) for small counties? Furthermore, Metro seems related to population. So for very large populations I guess all counties are Metro Counties where for small populations no county is. Is that not a problem? What do you think about including population density in the model? Or maybe something like "share of population within county
    living in Metro areas"?

    Best

    ReplyDelete
  2. It makes more sense when you think of it as predicting the (log of) the proportion who participates in insurrection. Then you can say that the rate is higher for urban counties. The urban variable is 0/1, but the data also had a 6-category variable that could be regarded as degree of urbanism. I didn't discuss that because it didn't change any conclusions. My main point was that the metro county estimate is much smaller when you change the population specification, but I suspect that even the one reported here is exaggerated--in principle, it doesn't seem likely to make much difference. I may return to that in a future post.

    ReplyDelete
  3. Thanks, I understand this. However I still have difficulties. Let me rephrase my problem: "Urban" is closely related to population density. A county is urban if it is dense and it is rural if it is not dense. Let's use density as predictor instead of urban, I think my problem becomes clearer then. Density is population / area, so in the model are two predictors: population/ area and population. Obviously population and density are correlated. Furthermore if density goes up while population stays constant then area has to decrease. I think it would be cleaner to use area and population as predictors in this case. Or maybe omitting population and only using density.

    I have an objection about the paper you cite as well.
    On page 6 the authors write: "The negative binomial is preferable to Poisson regression when the variance of the dependent variable is greater than the mean (Hilbe 2007), as it was in our case (i.e., variance=1.17, mean=0.30"

    This does not make sense. The authors are looking at the marginal distribution of the outcome. But the assumption of a poisson model is that the conditional expectation E(Xb) equals the conditional variance V(Xb). The unconditional expectation and variance has nothing to do with it.

    I think negative binomial has some real disadvantages compared to poisson, but I forgot which. I will check.

    ReplyDelete
  4. The more I think about the cited study as well as your model, the more dubious everything appears to me. I have several issues, some of them statistical, some of them theoretical. Let's start with statistical issues.

    The first is the dependent variable we care about. You want to estimate the association between degree of urbanity and the rate of insurrectionists that counties sent to Washington, r = x/n. But what exactly is n? I suggest it is not the population in the counties. In the cited paper the authors write that of the insurrectionists 93% were white and 85% male. So n should, for now, be the number of white and males. Further I think only people who voted for Trump went to Washington. So even better your outcome should be r = x / n, where n is the number of white males who voted for Trump. Then you can simply add density as a predictor, there is no need to include the population and it is not of interest. Furthermore modeling the rate in the way I suggest removes a second issue with your (and their) model: The interpretation of the coefficient of the share of non-hispanic white. I think of interest is not the association between the number of insurrectionists and the share of whites but between the rate and the share of whites. Assume you model the rate and the coefficient of the share is negative. That means, if more whites live there, relatively less were sent.

    There is another problem with your outcome. You write that log(x/n) = log(x) - log(n) which is true, in a deterministic setting. However, x is a random variable. Let's say x = zb + e, where z are covariates, b are the true parameters and e is a random component. If you take the log of x that implies you take the log of e as well. So you assume the errors are lognormally distributed, since you fit a linear model. I don't know if this assumption is justified. I think it is better to fit a poisson model to the rate (poisson works with non integer outcome, only positivity is required)

    That brings me to a more fundamental issue. The outcome itself. 85% of the counties did not sent a single person, only about 100 counties sent more than 2. So you try to estimate parameters with very little variation in the outcome. Don't you think this is problematic? If a county sent 1 or 0, this seems not meaningful at all. Maybe people in other counties wanted to go, but fell ill? I think the number that did go might be pretty much completely different if we hypothetically repeat the situation. Furthermore if more than 3 went, this might be a group, which further adds randomness. This brings me to my last issue:

    In short, I think it would be much better to look at the people who gathered in Washington that day, and not on the ones who actually stormed the capitol. I think storming the capitol involves not only a strong, and potentially dangerous, support of Trump (which you try to estimate with county characteristics) but also some criminal element or at least some comfort to confront the police. So maybe counties sent more (who did the storming) where white male Trump supporters are more likely to be criminals? I know this is a far stretch but I think results could look very different when there would be data on all the people who gathered there.

    Would be very great to hear your opinion on this! I am working on a modeling problem that shares some characteristics of yours and I have been thinking about these issues for a long time. However, as always, if you do much thinking of your own, serious mistakes are likely.

    Little correction of my comment above, I meant E(Y | Xb) and V(Y | Xb) and not E(Xb) and V(Xb), that does not make sense.

    ReplyDelete
  5. I am going to write another post addressing some of these issues, but here is a quick answer. On the distribution, I don't think it's a problem--it just means that there is less information than someone might think. The counties with small populations and zero insurrectionists don't add much information, even though they are numerous. The use of a negative binomial model is also reasonable, since the Poisson is a special case. (My results were also from a negative binomial regression). On density, I agree that it would be an alternative and possibly better way of representing "urbanism"--it just wasn't in the data set. I might add it and see how it affects the results.

    The issue of who is "at risk" is interesting, and will be the focus of the future post. It seems very likely that Trump voters were more likely to participate, but the chance that a Trump voter would go might depend on the share of Biden voters in the area (e. g., maybe Trump voters feel more threatened when they are part of a small minority), so the model should take account of Biden voters (and non-voters). But exactly how should you do that.....?

    ReplyDelete

  6. "the model should take account of Biden voters (and non-voters). But exactly how should you do that.....?"

    If you specify your outcome as x/n where n corresponds to white males who voted for Trump, you can simply add share of Biden voters as predictor, I think (or share of Trump voters, or the difference. Probably you can't add both since they are correlated). Let's assume you add share of Biden voters. If the coefficient is negative than if more people voted for Biden, more white males who voted for Trump become insurrectionists. This would be consistent with the idea that Trump voters feel threatened and that radicalizes them.

    Regarding poisson v negative binomial: I am not sure that poisson is really a special case of negative binomial. I know what you mean, because you can express variance in a negative binomial model as var = mu + mu^2 / theta and in poisson theta goes to infinity. But, unless the overdispersion parameter theta is known, you have to estimate it. In this case negative binomial regression is not really a generalized linear model anymore, I think? Also I think if the model is correctly specified, E(Y | Xb) = exp(Xb), then for parameter estimates poisson and negative binomial should return the same coefficients I guess? Further, estimating theta should have some disadvantage, since there is no free lunch as the saying goes. In any case, it would be interesting to run both and see if they differ.

    Regarding the many counties and many zero's. I think two issues are involved here. First, the division of the US into counties is, from our point of view, arbitrary. Second, because the division into counties is arbitrary, so is the number of regions we consider. Let me explain: Suppose we are real scientists and we actually want to connect our model meaningfully to reality. Our scientific question is: How does the share of Biden voters influence the radicalization of Trump voters? Well, the share of Biden voters in what area? This depends upon what Trump voters look at. Let's assume Trump voters are radicalized if they are in the minority. But what is their point of reference? The US? Might be, if they watch national news and they care about the election only. Or maybe they care about their federal state. Maybe they live in a big city and in their neighborhood are Biden posters all over the place, thus they care about their immediate neighborhood only. Maybe they live in a village and only want to make sure their village is democrat-free. I don't know but simply assuming that counties are relevant here, without ever stating this assumption explicitly, seems like a mistake to me.

    I want to offer another way to look at this: The division into counties is between two extremes: On the one hand is the division into 50 federal states, on the other (we can certainly imagine this) the division into 1km^2 patches. Since the US spans almost 10m square kilometers, there are then 10m regions. Assume we have all the relevant data and we estimate your model on the level of counties, federal states and 1km² patches. I am sure estimates differ by a lot. Which should we trust?

    Regarding the sheer number of zeros I am not sure why it matters. I think it does. Assume we have 10m regions and 1000 insurrectionists. This means that at least 10m - 1000 regions do not send a single one. I think this causes trouble for the estimation. I am going to think about this more and try to simulate some data.

    Looking forward to part three!

    ReplyDelete