A recent article in the journal PLoS ONE by the anti-sugar crusader Robert Lustig and three other co-authors has created quite a stir by purporting to show that increased sugar consumption causes diabetes. In the paper, the authors hold up just shy of saying "cause" but that is the inference drawn by many in the media (see for example this story in Bloomberg among other places) who say things like:
Excessive sugar consumption may be the main driver of a global rise in diabetes,
Moreover, on Mark Bittman's NYT Blog, the author, Lustig, is cited as saying:
This study is proof enough that sugar is toxic. Now it’s time to do something about it.
There is no way a study like this (comparing differences across countries) can firmly establish causation. So, at a minimum the study indicates an interesting (and perhaps suggestive) correlation that might warrant a randomized control trial. Nonetheless, I was intrigued and wanted to check out the evidence for myself.
The evidence by Lustig and colleagues comes by linking data on diabetes prevalence rates across countries (which I was able to easily find online here) and data from the UN FAO on the availability of calories from different food stuffs in different countries (after a bit of digging, I was also able to find it online here - go the the "food balance sheets"). After a bit of effort, I downloaded both data sets for the most recent years available, merged them, and checked out the claims made in the paper.
At first blush, I find very similar results to the ones reported in the paper. Holding constant total calories available, a simple linear regression shows that for every 100 kcal increase in sugar availability, the prevalence of diabetes goes up by 1.3 percentage points (say from 8.5% (the sample mean) to 9.8%). The estimated equation is:
(% with diabetes)=1.067+0.013*(per-capita available sugar kcal)+0.001*(per capita total available kcal)
My estimate is a little higher than the one reported in the paper probably because I'm not controlling for other factors (like GDP, kcal intake from meat, etc.) as the authors did. Moreover, I'm using data on diabetes from 2012 whereas the authors used 2011 and older data (note: I use data from 174 countries in my estimates). The only coefficient significant at the p=0.05 level in the above equation is the 0.013 estimate associated with sugar.
So far so good - the correlation is confirmed.
But let's get to the nitty gritty of the interpretation. The data is at the country level. So, what this implies is that a country that increases per-capita sugar availability by 100kcal will tend to have a 1.3 percentage point increase in the percent of the population with diabetes.
But, we don't really care about countries per se. We care about people. There are a lot more people in some countries than others. In the data set, the range is from a low of 0.00066 million adults to 980 million adults. Shouldn't this factor into the analysis? If we care about how many people in the world have diabetes, we'd better pay a lot more attention to China than to Luxembourg.
We know from the mini-scandal associated with the claim that small schools outperform larger ones (see one account here) that outcomes from small schools (or small countries) tends to be a lot more variable (with more outliers) than data from large schools (or large countries). That's just basic statistics.
Intuitively, we should want a larger country to count more than a smaller one. After all, there are many more people in larger countries - so if we want to think about the prevalence of diabetes in the world (rather than the average prevalence rate across countries), we'd want to calculate a weighted average, where larger countries get more weight (because they have more people). The more people, the higher the weight.
Likewise, when we want to run analyses like the one above, we want to give more weight to countries with more people. We can do this by running a weighted regression, where each country gets a weight proportional to it's population size. This converts the equation to one about how countries differ to one about how individuals differ. Stated differently, the weighted regression places the estimates at the level of the individual (picked at random from any country) rather than the level of the country (picked at random from a group of countries).
Here is the equation I get when I weight by a country's adult population:
(% with diabetes)=0.692+0.002*(total available sugar kcal)+0.002*(total available kcal)
Now, the effect of sugar falls dramatically (and most importantly, it is no longer statistically significant at standard levels; the p-value is 0.074). A 100 kcal increase in per-capita sugar availability only increases the % with diabetes by 0.2 (rather than 1.3 as previously estimated). Moreover, total energy from all sources is now significant and roughly the same magnitude as sugar. Thus, what matters in this framework is total kcal from any food source. Moreover this regression suggests that a sugar calorie is roughly the same as any other calorie insofar as affecting diabetes.
The paper at PLoS ONE says "regressions are population weighted." But, I'm wondering that is indeed the case. It could be true. I don't have access to all their data and I'm not including all their controls.
I'm happy to share the data and SAS code with anybody who cares to see it.
The nice thing about the web is that you get feedback. Here's an update. The source that reports diabetes prevalence actually reported three measures. In the regressions above, I used national prevalence (total number with diabetes divided by total population). However, as indicated at the data source here, they also report some sort of age adjusted measure that is likely more useful in comparing across countries that might have different mean ages.
When I use this "IGT comparative prevalence" measure, as they call it, then I get exactly the opposite of the results mentioned above. When the data are NOT weighted, the sugar coefficient is only 0.0019 (p-value 0.27). But, when the data ARE weighted by adult population, the sugar coefficient is 0.01277 (p-value < 0.001).
So, there is an interesting mix of things going on here between the population, weighting, and age adjustment. Just out of curiosity, and for some robustness checks, I did two things. First, I re-ran the "preferred" model with population weighting using "IGT comparative prevalence" diabetes but included population as an explanatory variable. When I do this, sugar is no longer statistically significant (the estimate is 0.00242 with a p-value of 0.107), but population is (the estimate suggests larger populations have lower diabetes prevalence). I can't quite figure out what is going on here but there has to be something weird going on in the sense that the model is weighting by population and the dependent variable (and independent variables) are per-capita (i.e., are divided by population), that might be producing some unexpected results.
Second, I ran a quantile regression to see how the results hold up at the median (rather than the mean, which is more sensitive to outliers), I find that (using IGT comparative prevalence and adult population as a weight with only sugar and total calories as explanatory vars) the sugar effect, at the median, is 0.0148 but the 95% confidence interval is (-0.0191, 0.0217) when using the SAS default rank method of calculating standard errors. The 95% confidence interval changes to (0.0041, 0.0254) when using an alternative resampling method. So, whether the median effect is statistically significant depends on which method of calculating standard errors is used.
Here is the plot of the "sugar effect" at each quantile. The first shows the 95% confidence intervals determined by the resampling method and the second uses the SAS default (I have to admit that I'm not sure which method is preferred in this case).