Sunday, April 26, 2020

Kicks in a plane

In 2016, a paper came out saying that the presence of a first-class cabin led to more incidents of "air rage" and that front boarding--where the economy passengers had to endure the humiliation of walking through first class before getting to their seats--led to even more.  It got a good deal of media attention at the time, and has 56 citations in Google Scholar, which is pretty good for for a 2016 publication.  I was reminded of it when looking at an old post on Andrew Gelman's blog yesterday.  He mentioned that he, Marcus Crede, and Carol Nickerson had a letter in the journal that published the article (PNAS), and that the authors, Karen DeCelles and Michael Norton, had a reply.  After reading those, I think I see where the analysis went wrong.

The dependent variable was whether or not an "air rage" incident happened on the flight.  Two important influences on the chance of an incident are the number of passengers and how long the flight was (their data apparently don't include the number of passengers or duration of the flight, but they  do include number of seats and the distance of the flight).  As a starting point, let's suppose that every passenger has a given chance of causing an incident for every mile he or she  flies.  Then the chance of an incident on a particular flight is approximately:

p=knd

p is the probability of an incident, k is the chance per passenger-mile, n is the number of passengers, and d is the distance.  It's approximate because some incidents might be the second, third, etc. on a flight, but the approximation is good when the probabilities are small, which they are (a rate of about 2 incidents per thousand flights).  When you take logarithms, you get

log(p)=log(k) + log(n) + log(d)

DeCelles and Norton used logit models--that is, log(p)/log(1-p) was a linear function of some predictors.   (When p is small, the logit is approximately log(p)).  So while they included the number of seats and distance as predictors, it would have been more reasonable to include the logarithms of those variables.  What if the true relationship is the one I've given above, but you fit a logit using the number of seats as a  predictor?  


  
That is, there are systematic discrepancies between the predicted and actual values.  That's relevant to the estimates of the other predictors.  E. g., suppose that small planes don't have first class, large planes have first class and boarding in the middle, and medium size planes have first class and front boarding.  Then a model that adds variables for those qualities will find that first class with front boarding has higher rates than expected given the number of seats, which is exactly what DeCelles and Norton appeared to find. 

The authors didn't release their data, so I don't know the shape of the actual relationship. But I would be willing to bet (and give long odds) that a model using logs would fit better and would produce substantially different estimates for their variables of interest.  The general point is that when a control variable is a strong predictor, it's not enough to include it--you have to include it in the right form.  Fortunately, this usually isn't hard to do--in addition to trying x as a control variable, try log(x) too, especially if you're using a logit, Poisson regression, or other model for binary or count data.

PS:  This is the same problem that led to spurious result in the hurricane names study.

No comments:

Post a Comment