Monday, July 28, 2014

Who's talking?

I remember hearing about some study finding that women spoke an average of 20,000 words per day while men spoke only 7,000.  It's hard to get an accurate count of something like that, so I figured it was based on a small non-representative sample.  According to a recent story in the New York Times, it probably wasn't based on any kind of sample:  it seems that the numbers were just invented.

The story went on to suggest that "this stereotype may dovetail with the idea that what women have to say isn’t important — that it’s 'fluff,' and that "such sterotypes [may] make women less likely to speak up, or men less likely to hear them..."  I had a different impression--that it was associated with the idea that women had more "emotional intelligence" than men.  A 2000 Gallup survey contains has some relevant information.  It listed a number of characteristics and asked if each was "generally more true of men or more true of women" (people could volunteer that there was no difference).  It also asked if "the country would be governed better or governed worse if more women were in political office" and "if you were taking a new job and had your choice of a boss would you prefer to work for a man or a woman?"  The characteristics were:  aggressive, emotional, talkative, intelligent, courageous, patient, creative, ambitious, easy-going, and affectionate.  77% say women are more talkative, 11% say men, and 10% say no difference, which is about the same as when the question was first asked in the 1940s.

Opinions about which sex is more intelligent, courageous, and patient help to predict opinions about whether more women in office would mean better or worse government.  Opinions about which sex is more intelligent, courageous, and easy-going help to predict preferences about a man or woman as boss.  That is, people who see women as more intelligent, courageous, patient, or easy-going are more likely to think that the country would be governed better or prefer a woman as a boss.  The others, including talkative, do not have a statistically significant relationship.  (For what it's worth, the estimates for talkative are positive --favorable-- with t-ratios of 1.3 and 1.0).

You might wonder if belief that woman are more talkative is part of a pattern, going with negative views about women's intelligence, courage, etc.  It has a significant negative association with courageous--that is, people who see women as more talkative tend to see men as more courageous--but not with views about which sex is more intelligent, easy-going, or patient.  Overall, the correlations with opinions about other qualities were low.

So in conclusion, the stereotype doesn't seem to matter much either way.


Monday, July 7, 2014

Another "don't know" problem

When looking at the tabulations for the questions in my last post, I noticed a difference in the percent of "don't know" answers of self-described liberals and conservatives.  I then checked the others to see if the pattern persisted.  It did, and there are strong parallels between the gender and liberal/conservative differences.  I'll give the average for the eleven questions to make it simpler:

                    Correct     Incorrect      DK
Liberals              38%          33%         29%
Men                   35%          31%         33%
Conservatives         31%          28%         40%
Women                 30%          29%         41%


Liberals and men give more correct answers, more incorrect answers, and fewer don't knows.  The ratio of correct to incorrect answers is about the same in all groups (slightly higher among men and liberals).

So what's going on?  Although I still think my point about gender differences from the last post is partly correct, it seems to be incomplete.  My interpretation:
 (a) People sometimes interpret "conservative" to mean "cautious" (as I've discussed in other posts, a significant number of people seem to understand liberal and conservative in non-political senses)
(b) differences in "don't knows" don't involve people who know or have no idea, but people who "sort of" know, or could make a fairly good guess.  Conservatives and women who are in that middle group may be less likely to venture an answer.

Some insight is provided by the question:   "Which one of the following people is not a college dropout:   Facebook founder Mark Zuckerberg, designer Ralph Lauren, entertainer Ellen Degeneres, Apple founder Steve Jobs, President Calvin Coolidge, movie mogul David Geffen, and oil magnate John D. Rockefeller?"
It seems safe to assume that very few people definitely knew the true answer or confidently believed an incorrect answer.*  But you could apply some pieces of common knowledge (e. g., that a lot of people who became rich from computers or the internet were college dropouts) to make an educated guess.   So the group differences in "don't knows" were essentially a matter of willingness to try.  

                    Correct     Incorrect      DK
Liberals              20%          43%         27%
Men                   14%          53%         33%
Conservatives          8%          50%         42%
Women                 11%          44%         45%

Men were more willing to try than women, but there was little or no difference in the probability of getting it right if they tried.  Conservatives were less willing to try than liberals. Given the fairly small number of liberals and large number of don't knows, the liberal/conservative differences in the conditional probability of getting it right, although large, are not statistically significant.  

*Coolidge graduated from Amherst College, so I count him as the correct answer.  I'm not sure it's accurate to call Rockefeller a college dropout--see his biography here--but he didn't have a college degree.  



Tuesday, July 1, 2014

What's the matter with men and/or women?

 Recently (actually, six weeks ago, but I lose my sense of time when the semester is over) the New York Times had a piece called "Women and the 'Don't Know' Problem," about the reasons that women are more likely to say "don't know" in polls than men are.  It started out by saying that women were less willing to express opinions than men were, but then turned to suggesting that men were more likely to claim knowledge that they don't actually have--"men comfortably hold forth on topics that they have little expertise on."  That theme was picked up in the reader comments, many of which comfortably held forth about the basic psychology of men and women.

An alternative hypothesis is that most poll questions are about politics and public affairs, and men may be more interested in those topics (or feel more obligation to be somewhat informed about them) than women are.  In order to choose between them we need to compare men and women on a range of questions, both political and non-political.  There is a series of surveys by Vanity Fair/CBS News which occasionally ask factual multiple-choice questions on a wide variety of issues.  I looked up the last 11 (it was going to be ten, but the last survey I looked at included two) examples, which involved:  what Donald Trump had said about himself, who Bubba Watson is, how many justices are on the Supreme Court, who Jamie Dimon is, how many universities are in the Ivy League, what Kwanzaa is, who Judd Apatow is, who Wayne LaPierre is, which one of a list of people was not a college dropout, where Northwestern University is located, and who Thomas Paine was.

The results:
                women            men
              c  i  dk         c   i dk
Trump         56 20 25   46 30 23
Watson        23 27 51         39 23 37
Supremes      36 49 16         45 49 5

Dimon         12 18 70         16 22 62
Ivy Leage     31 48 20         38 46 15
Kwanzaa       63 16 21         57 19 24

Apatow        12 22 65         16 21 62
LaPierre      19 26 56         30 23 46
dropout       11 44 46        14 53 33

Northwestern 29 36 35         39 37 23
Paine         41 16 44         47 18 35


Women are more likely to say that they don't know for ten of the eleven questions.The idea that men are more likely to claim knowledge even if they don't have it suggests that the ratio of correct to incorrect answers will be higher among women.  But that's true for only two of the questions, Donald Trump and Kwanzaa. Men are more likely to offer correct answers on nine of the questions, and more likely to offer incorrect answers on only six.

Overall, men just seem more likely to know the right answer (or be willing and able to make an educated guess) on most of the questions.  Of course, these questions aren't a representative sample of anything.  There are a couple on which you would expect men to have more knowledge (e. g., that Bubba Watson is a golf pro).  But it is noteworthy that on the two purely political questions--the Supreme Court and Wayne LaPierre--men are much more likely to identify the correct answer, and no more likely to pick the incorrect answer.

This suggests that the "problem" doesn't result from a general psychological tendency of women or men--it's that most polls focus on issues that men are more likely to know about, or have opinions about.

[Source:  iPOLL, Roper Center for Public Opinin Research]


Thursday, June 26, 2014

Rise and Fall

In a column last week, Greg Mankiw said "According to a recent study, if your income is at the 98th percentile of the income distribution — that is, you earn more than 98 percent of the population — the best guess is that your children, when they are adults, will be in the 65th percentile."  Of course, you wouldn't expect those children to do as well as their parents--there's not much room to rise and lot of room to fall--but that was a bigger decline than I would have expected, so I decided to take a closer look.  

 The study refers to people born in 1980-82 and "when they are adults" is age 30.  Of course, 30-year-olds generally earn less than middle-aged people, but the authors of the study say that relative positions have pretty much stabilized by then--that is, we'd see about the same pattern if we came back 20 years later. 

Here is the pattern for people whose parents were in the 60th percentile.





The large number of people in the second percentile (actually about 6% of all 30-year olds) had zero income.  That makes it hard to read, so here is the figure showing just the lower part of the y-axis.  




The most common destination is in the low 70s.  The chances of rising above that level drop off pretty sharply.  But overall, the differences are pretty small:  you could say that people from the 60th percentile are about equally likely to end up at any point in the distribution.


Here is the 80th percentile.  It's a similar basic pattern, although the chances of winding up near the top are higher and the chances of ending near the bottom are lower.




Here is people whose parents were in the 98th percentile.  This looks different--the higher the ranking, the better your chances of getting there.  The most likely destination is the 99th percentile--even higher than the  "percentile" of zero earnings.  


 I haven't calculated the mean percentile--I'll look at this more later--but this seems to give a different picture than Mankiw's summary.

Friday, June 20, 2014

Final comments on the hurricane study

I sent a letter to the editors of the Proceedings of the National Academy of Sciences summarizing the point I made in my post of June 12.  They declined to publish it (I hope that's because they had already accepted one or more letters making the same point), but I have posted it on my faculty page at the UConn Department of Sociology.  

It occurred to me that the hurricane study omitted the two hurricanes that caused the largest number of deaths (Audrey in 1957 and Katrina in 2005) because the models couldn't fit them--basically, they had too many deaths to be plausible under the distributions that they used.  But both hurricanes had female names, so they should be counted as some kind of favorable evidence for their hypothesis.  What sort of models could accommodate all of the hurricanes?  There are two reasonable approaches:

1.  Sophisticated:  The Cox proportional hazards model is widely used for duration data--time until some event.  Count data is like duration data in the sense that there is only one possible direction of change--just as a person who's turned 90 can't go back and die at 89, a hurricane that's killed 90 can't go back and wind up killing only 89.  So the Cox model can reasonably be applied to count data, although I don't think I've ever seen that done.  The model is useful because it makes minimal assumptions about the distribution--it essentially just tries to predict which cases will rank higher than others.  If you add Katrina and Audrey to the data set (I scored them both as highly feminine names), the estimated effect of feminine name is .031 with a standard error of .032, which is nowhere near statistical significance.

2.  Simple:  take the logarithm of (deaths+0.25).  You need to add the small constant because many hurricanes caused zero deaths and the logarithm of zero is undefined.  The exact value doesn't matter much.  Then do an ordinary linear regression with the log of (deaths+.25) as dependent variable and log of damage and hurricane name as the predictors.  The estimated effect of feminine name is .024 with a standard error of .043, again nowhere near significance.  The residuals from the model are approximately normally distributed, meaning that the estimate and se are trustworthy.

The interaction between name and damage is not statistically significant either way.

Tuesday, June 17, 2014

Stormy weather

Since my last post, the authors of the hurricane study have issued statements defending their analysis.  They address the issue of taking the logarithm of normalized damage, but what they say doesn't make sense,  for reasons discussed by Jeremy Freese.

I thought it might be useful to show why the issue matters so much in this case.  Suppose you fit a model with just one independent variable, the logarithm of damage.  This is a reasonable thing to do because neither masculinity/femininity of name nor minimum pressure are statistically significant if you include them.  The equation is:
predicted log(deaths)= -1.909+.580*log(damage)

What is being predicted in a negative binomial regression is not the number of deaths, but the logarithm of the number of deaths.  So to translate it into the number of deaths:
deaths=.148*(damage**0.58)                        [e**-1.909=.148]
Fortunately, you don't even have to do the algebra--you just need to know that exp(log(x))=x, and give a command like pred=exp(-1.909+.580*ldam) and your statistics program will calculate the predicted number.  Then you can see how predicted values are related to damage.


The general numbers are reasonable. The highest predicted number of deaths is about 100; the highest actual number in the data set was about 200, but of course some hurricanes are going to cause more deaths than predicted. 

Calculating predicted values for the model with normalized damage is more complicated, because minimum pressure has a statistically significant effect.  But if you set normalized damage and masculinity-femininity of names at their means, you get
predicted log(deaths)=1.95+.0000809*damage
or
deaths=7.02 e**(.0000809*damage)

The resulting figure:


This is hard to read, since the predicted values for the few largest hurricanes are so much bigger than the predicted values for all the others, and also a lot bigger than the actual number of deaths.  But you can see that the predicted values for the biggest hurricanes are much larger than the actual values.  If you use a log scale for both axes, you get:


Here you can see another odd thing. The predicted values barely increase as you go from the hurricanes that did the least damage to the medium ones.  As a result, the predicted deaths for the hurricanes that did the least damage are almost all too high--in fact, the 19 hurricanes that did the least property damage all caused fewer deaths than they were "supposed" to.  In contrast, most medium hurricanes caused more deaths than predicted, and the few largest hurricanes caused fewer deaths than predicted.

Damage is such an important predictor of deaths that it's not enough to sort of control for it--you need to control for it correctly. If you don't do that, nothing you do from that point on will give you the right results.

Thursday, June 12, 2014

The Story of the Hurricane

I was looking at Andrew Gelman's blog yesterday and saw a post on a study saying that hurricanes with female names caused more deaths than hurricanes with male names.  The study came out a couple of weeks ago; I seem to recall hearing some news reports and assuming there was probably something wrong with it, but I didn't give it any more thought.  This morning I looked at the New York Times and saw that Nicholas Kristof gave about half of his column to uncritically recounting the claims of the study. Then I looked a little more and saw that there had been a lot of news coverage, and a lot of critical commentary.  But the criticisms seemed sort of peripheral, or raised questions without really identifying a specific flaw.  So I read the original paper, downloaded the data, and did some analysis.  Their claim does not stand up, and here is my attempt to explain why not.  It's possible that someone beat me to it (in fact, I hope someone did, since the problem was so basic), but given the nature of the internet the more places it appears the better.

A statistical model uses variables to predict the value of another variable (total deaths resulting from the hurricane).  The "deviance" is a measure of how much of total deaths is not predicted.  So the goal is to get a small deviance using a small number of predictors.  


Here are the deviance and number of predictors in two of their models:

Deviance         Predictors
136.1                 3          female name, storm damage,                                        barometric pressure
121.8                 5          ""  plus interactions (products)                                    of female and damage, female                                    and pressure


Here are the deviance and number of predictors from two alternative models that I fit:

Deviance         Predictors
   97.5                 3        female name, logarithm of storm                                  damage, barometric pressure
   95.3                 5        ""  plus interactions of female                                  and log damage, female and                                        pressure

The models using the logarithm rather than the original variable had much lower deviance.  Adding the two interactions to the model with the logarithm reduced the deviance by 2.2, but the usual standard is that adding two predictors has to reduce the deviance by at least 6 to qualify as evidence that there's anything there (ie a reduction of less than 6 is not "statistically significant").  So the best model has a deviance of 97.5 and three predictors.  In that model, the estimated effect of the "femaleness" of the name (which they treat as a matter of degree) is .024, with a standard error of .036, which is not statistically significant, or close to statistically significant.  

So the flaw was that they controlled for the dollar value of damage when they should have controlled for the logarithm of damage.    With the right control, there is no evidence that the gender of the name makes any difference.  

Notes:  1. the paper and data, published in the Proceedings of the National Academy of Sciences