Blog

Polling 101

I teach a graduate level course every spring semester on survey and experiment methods in economics and the social sciences.  In this election season, I thought it might be worthwhile to share a few of the things I discuss in the course so that you might more intelligibly interpret some of survey research results being continuously reported in the newspapers and on the nightly news. 

You've been hiding under a rock if you haven't by now seen reports of polls on the likelihood of Trump or Clinton winning the presidential election.  Almost all these polls will report (often in small font) something like "the margin of error is plus or minus 3 percent".  

What does this mean?

In technical lingo it means the "sampling error" is +/- 3% with 95% confidence.  This is the error that comes about from the fact that the polling company doesn't survey every single voter in the U.S.  Because not every single voter is sampled, there will be some error, and this is the error you see reported alongside the polls.  Let's say the projected percent vote for Trump is 45% with a "margin of error" of 3%.  The interpretation would be that if we were to repeatedly sample potential voters, 95% of the time we would expect to find a voting percentage for Trump that is between 42% and 48%.

The thought experiment goes like this: imagine you had a large basket full of a million black and white balls.  You want to know the percentage of balls in the basket that are black.  How many balls would you have to pull out and inspect before you could be confident of the proportion of balls that are black?  We can construct many such baskets where we know the truth about the proportion of black balls and try different experiments to see how accurate we are in many repeated attempts where we, say, pull out 100, 1,000, or 10,000 balls.  The good news is that we don't have to manually do these experiments because statisticians have produced precise mathematical formulas that give us the answers we want.  

As it turns out, you need to sample about 1,000 to 1,500 people (the answer is 1,067 to be precise) out of the U.S. population to get a sampling error of 3%, and thus most polls use this sample size.  Why not a 1% sampling error you might ask?  Well, you'd need to survey almost 10,000 respondents to achieve a 1% sample error and the 10x increase in cost is probably not worth a measly two percentage point increase in accuracy. 

Here is a key point: the 3% "margin of error" you see reported on the nightly news is only one kind of error.  The true error rate is likely something much larger because there are many additional types of error besides just sampling error. However, these other types of errors are more difficult to quantify, and thus, are not reported.

For example, a prominent kind of error is "selection bias" or "non-response error" that comes about because the people who choose to answer the survey or poll may be systematically different than the people who choose not to answer the survey or poll.  Alas, response rates to surveys have been falling quite dramatically over time, even for "gold standard" government surveys (see this paper or listen to this podcast).  Curiously, those nightly news polls don't tell you the response rate, but my guess is that it is typically far less than 10% - meaning that less than 10% of the people they tried to contact actually told them whether they intend to vote for Trump or Clinton or someone else.  That means more than 90% of the people they contacted wouldn't talk to them.  Is there something special about the ~10% willing to talk to the pollsters that is different than the ~90% of non-respondents?  Probably.  Respondents are probably much more interested and passionate about their candidate and politics and general.  And yet, we - the consumer of polling information - are rarely told anything about this potential error.

One way pollsters try to partially "correct" for non-response error is through weighting.  To give a sense for how this works, consider a simple example.  Let's say I surveyed 1,000 Americans and asked whether they prefer vanilla or chocolate ice cream.  When I get my data back, I find that there are 650 males and 350 females.  Apparently males were more likely to take my survey.  Knowing that males might have different ice cream preferences than females, I know that my answer of the most popular ice cream flavor will likely be biased if I don't do something.  So, I can create a weight.  I know that the true proportion of the US population is roughly 50% male and 50% female (in actuality, there are slightly more females than males but lets put that to the side).  So, what I need to do is make the female respondents "count" more in the final answer than the males.  When we typically take an average, each person has a weight of one (we add up all the answers - implicitly multiplied by a weight of one - and divide by the total).  A simple correction in our ice cream example would be to make a females have a weight of 0.5/0.35=1.43 and males have a weight of 0.5/0.65=0.7.  Females will count more than one and males will count less.  And, I report a weighted average: add up all the female answers (and multiply by a weight of 1.43) and add to them all the male answers (multiplied by 0.7), and divide by the total.  

Problem solved right?  Hardly.  For one, gender is not a perfect predictor of ice cream preference.  And the reason someone chooses to respond to my survey almost certainly has something to do with more than gender.  Moreover, weights can only be constructed using variables for which we know the "truth" - or have census bureau data which reveals the characteristics of the whole population.  But, in the case of political polling, we aren't trying to match up with the universe of U.S. citizens but the universe of U.S. voters.  Determine the characteristics of voters is a major challenge that is in constant flux.  

I addition, when we create weights, we could end up with a few people having a disproportionate effect on the final outcome - dramatically increasing the possible error rate. Yesterday, the New York Times ran a fantastic story by Nate Cohn illustrating exactly how this can happen.  Here are the first few paragraphs:

There is a 19-year-old black man in Illinois who has no idea of the role he is playing in this election.

He is sure he is going to vote for Donald J. Trump.

And he has been held up as proof by conservatives — including outlets like Breitbart News and The New York Post — that Mr. Trump is excelling among black voters. He has even played a modest role in shifting entire polling aggregates, like the Real Clear Politics average, toward Mr. Trump.

How? He’s a panelist on the U.S.C. Dornsife/Los Angeles Times Daybreak poll, which has emerged as the biggest polling outlier of the presidential campaign. Despite falling behind by double digits in some national surveys, Mr. Trump has generally led in the U.S.C./LAT poll. He held the lead for a full month until Wednesday, when Hillary Clinton took a nominal lead.

Our Trump-supporting friend in Illinois is a surprisingly big part of the reason. In some polls, he’s weighted as much as 30 times more than the average respondent, and as much as 300 times more than the least-weighted respondent.

Here's a figure they produced showing how this sort of "extreme" weighting affects the polling result reported:

The problem here is that when one individual in the sample counts 30 times more than the typical respondent, the effective sample size is actually something much smaller than actual sample size, and the "margin of error" is something much higher than +/- 3%.

There are many additional types of biases and errors that can influence survey results (e.g., How was the survey question asked? Is there an interviewer bias? Is the sample drawn from a list of all likely voters?).   This doesn't make polling useless.  But, it does mean that one needs to be a savvy consumer of polling results.  It's also why it's often useful to look at aggregations across lots of polls or, my favorite, betting markets.

Land Use in the United States

In our departmental seminar on Friday, we had a speaker with a background in forestry.  He showed some graphs related to the amount of forest land in the United States, and I have to say I was a bit surprised how much land is in forest.  

Here is some useful (if not slightly dated) figures on land use from the USDA Economic Research Service. The figure from a longer document shows the breakdown:

Of all the land in the U.S., only 14.8% is in cropland used for crops (it's 17.7% in the contiguous 48 states).  27.1% is in grassland or pasture (32.3% in the 48 contiguous states).  About a quarter of the land (both in the US as a whole and in the lower 48) is in forest that is not grazed, and another 5.6 to 6.7% is in grazed forest land.   By the way: Special uses includes: "rural transportation, national/ State parks, wilderness and wildlife areas, national defense and industrial areas, and farmsteads and farm roads" and miscellaneous land includes "marshes, open swamps, bare rock areas, desert, and tundra."

Also from the 2007 report:

Total cropland increased in the late 1940s, declined from 1949 to 1964, increased from 1964 to 1978, and decreased again from 1978 to 2007. Between 2002 and 2007, total cropland decreased by 34 million acres to its lowest level since this series began in 1945 . . .

These are useful statistics in light of the common sorts of things I read like "agriculture has more impact on the environment than any other human activity" or "agriculture is the biggest threat to the environment."  

Zilberman on the Slow and Natural Food Movment

David Zilberman, an agricultural economist at UC Berkeley, has an interesting blog post on the slow and natural food movements. The timing of his piece is impeccable given the long, aggressive defense of the food movement Michael Pollan just wrote in the New York Times Magazine. After a bit of praise for the movements, Zilberman gets to some critiques.

Here are the core criticisms:

However, most of these bodies of thought emphasize advocacy and are short on analysis. In particular, they underemphasize several factors. First, they underemphasize tradeoffs and costs. There are tradeoffs on the demand side, where consumers choose food based on cost, taste, and convenience. Fast food is a huge industry for a reason. The development of ready-to-cook and ready-to-eat meals, modern equipment (electric stoves, refrigerators, and microwaves), and modern supermarkets have been contributors in enabling women to join the job market. At the same time, there are tradeoffs on the supply side between cost of production and technology.

and

Second, the naturalized paradigms undervalue the importance of technology in production and distribution. Modern lifestyle is the result of immense innovations in medicine, biology, communication, etc. I am very aware of the risks that technologies pose, but when I see a poor farmer in Ivory Coast with a cell phone and bicycle, I realize the power of technology. ... The challenge is how to use it appropriately and spread its distribution broadly rather than giving up on it.

and

Third, the naturalist paradigm underestimates the importance of heterogeneity among people and regions. Differences in income lead to different food choices. ... There is a huge difference between farmers in Iowa that obtain more than 10 tons/Hectare of corn and farmers in Africa that may obtain 1.5 tons/Hectare. ... I don’t expect people to use the same techniques everywhere, and that different technologies are appropriate in different locations.

On his last point, I full agree:

Heterogeneity brings me to a larger point. There is a place for both industrial and naturalized agricultural systems. The naturalization paradigm is leading to the emergence of higher-end restaurants and fresh food supply linking the farmer to the consumer, each of which have limited reach but are important source of income and innovation in agriculture. At the same time, the majority of people will be dependent on industrialized agriculture. The two can coexist and coevolve.

Assorted Links

  • The New York Times published several letters responding to my piece on environmental improvements at large farms.  I'm happy to have sparked a conversation!
  • My paper with Trey Malone on the effect of California's animal welfare laws on egg prices has been officially published by the Journal of Agricultural and Resource Economics.  I should note that the state of Massachusetts has a ballot initiative before voters this November that will, if passed, make its egg-housing policies similar to those in California.
  • Speaking of Trey, another of our co-authored papers was finally released by the Journal of Entrepreneurship and Public Policy.  The paper looks at the effect of state alcohol laws on the number and type of breweries in the state.  "We find that allowing breweries to sell beers on-premises as well as allowing for breweries to self-distribute have statistically significant relationships with the number of microbreweries, brewpubs, and breweries."
  • Interesting result from analyses of NHANES data: "A very small percentage (2.1%) of the U.S. population aged 1 year and older identified themselves as vegetarians; and within this group, only about 3% were true vegans - they did not report consuming any animal protein sources on any given day" and "Among these self identified vegetarians, almost half reported consumption of meat, poultry, or seafood."  

Why is the milk at the back of the store?

That was the question asked in a Planet Money podcast, which re-aired earlier this week.  

The conventional answer was provided by the food writer Michael Pollan:

I’ve come to understand the landscape of a grocery store as a brilliantly designed landscape to get us to buy as much food and as much expensive food as possible. So my general impression is that the milk is in the back.

And it’s - but - and it’s not just that the milk is in the back. It’s also usually very far from the bread. Both of them are very common items that everybody needs, and so it makes you cover a lot of ground if you want them.

Another perspective was provided by the economist Russ Roberts:

Russ thinks the reason the milk is in the back is practical. It’s easier to keep the milk cold if it’s there. The delivery trucks come into the back of the store. Milk goes right into this refrigerated room that’s often right behind the cooler where you grab your milk. No one has to lug the milk through the store to some cooler in the front.

ROBERTS: Milk spoils very easily. I was told that for every degree of temperature it rises, it loses a day of being available and being sellable. So the argument I’m making, which is kind of a radical argument, is that you and I want the milk in the back, even though it’s a little less convenient. If it were in the front, it would be more expensive, and we’re not willing to pay that extra price. So I think they’re actually doing what we want, not what they want.

How can we know whose view is right?  To answer the question, we'd need to observe a world where milk doesn't need to be refrigerated and then see where grocery stores - in this alternative universe - place the milk.

As it turns out, such an alternative universe actually exists!  It's called France.

I spent part of 2011 on sabbatical and Paris, and when first grocery shopping I was surprised to find the milk often sitting right next to the laundry detergent or the cereal, unrefrigerated.  How is this possible? For reasons that aren't entirely clear to me, much of the milk sold in France is ultra-high temperature (UHT) pasteurized.  The process makes the milk "shelf stable" - it doesn't spoil when left unrefrigerated (I personally thought it tasted pretty terribly).

So, where do French grocery stores stock the milk?  I only have anecdotal evidence based on my own shopping experiences, but by and large I'd say it was NOT at the back of the store. It was often situated somewhere near the center of the store.  Moreover, some stores sold both UHT milk and "regular" refrigerated milk, and the refrigerated milk as typically at the back in a refrigerated case while the UHT milk was situated elsewhere in the store.  

My take: Russ Roberts 1, Michael Pollan 0.

P.S.  There is another line of evidence in favor of Russ's view.  Where do you typically find (unrefrigerated) soy milk in your grocery store?  In our local Walmart, it's in the center of the store, not at the back.