## Tuesday, June 17, 2014

### Stormy weather

Since my last post, the authors of the hurricane study have issued statements defending their analysis.  They address the issue of taking the logarithm of normalized damage, but what they say doesn't make sense,  for reasons discussed by Jeremy Freese.

I thought it might be useful to show why the issue matters so much in this case.  Suppose you fit a model with just one independent variable, the logarithm of damage.  This is a reasonable thing to do because neither masculinity/femininity of name nor minimum pressure are statistically significant if you include them.  The equation is:
predicted log(deaths)= -1.909+.580*log(damage)

What is being predicted in a negative binomial regression is not the number of deaths, but the logarithm of the number of deaths.  So to translate it into the number of deaths:
deaths=.148*(damage**0.58)                        [e**-1.909=.148]
Fortunately, you don't even have to do the algebra--you just need to know that exp(log(x))=x, and give a command like pred=exp(-1.909+.580*ldam) and your statistics program will calculate the predicted number.  Then you can see how predicted values are related to damage.

The general numbers are reasonable. The highest predicted number of deaths is about 100; the highest actual number in the data set was about 200, but of course some hurricanes are going to cause more deaths than predicted.

Calculating predicted values for the model with normalized damage is more complicated, because minimum pressure has a statistically significant effect.  But if you set normalized damage and masculinity-femininity of names at their means, you get
predicted log(deaths)=1.95+.0000809*damage
or
deaths=7.02 e**(.0000809*damage)

The resulting figure:

This is hard to read, since the predicted values for the few largest hurricanes are so much bigger than the predicted values for all the others, and also a lot bigger than the actual number of deaths.  But you can see that the predicted values for the biggest hurricanes are much larger than the actual values.  If you use a log scale for both axes, you get:

Here you can see another odd thing. The predicted values barely increase as you go from the hurricanes that did the least damage to the medium ones.  As a result, the predicted deaths for the hurricanes that did the least damage are almost all too high--in fact, the 19 hurricanes that did the least property damage all caused fewer deaths than they were "supposed" to.  In contrast, most medium hurricanes caused more deaths than predicted, and the few largest hurricanes caused fewer deaths than predicted.

Damage is such an important predictor of deaths that it's not enough to sort of control for it--you need to control for it correctly. If you don't do that, nothing you do from that point on will give you the right results.