21 de Março, 2009
Integrated nutrition, lifestyle and health database: epidemiological information for an improved understanding of diseases of civilization
Autor: O Primitivo. Categoria: Saúde
Figure: "Nutrition, Lifestyle and Health Database",
Canibais e Reis (2009).
Nutrition, lifestyle and health data sources
There is a lot of freely available, on-line, nutrition and health statistics. Unfortunately, most of this data is dispersed and in different formats and, because of this, it becomes relatively difficult to compare data from those different sources. WHO publishes health data but doesn’t care about food statstics. FAO has food data statistics, but doesn’t provides any health data. Because epidemiological data is an essential tool, that crosses these types of data, for studying and formulating assotiations between modern lifestyle and the diseases of civilization, and because I couldn’t find any freely available on-line database, I decided to build my own database. I started by collecting data from some of the sources listed in the end of this article. To be more precise, the main sources I used were the following:
- FAO Statistical Yearbook 2005-2006 - consumption, for world nutritional data (energy intake, macronutrient distribution, etc. for more than 180 countries);
- FAOSTAT consumption - crops and FAOSTAT consumption - livestock and fish, for detailed world nutrition data (animal and vegetal products, cereals, fruits, sugars, olive oil, butter, fish, meat, and much more);
- Health statistics (British Heart Found), for european and world disease statistics (heart related, obesity, diabetes, blood pressure, total cholesterol);
- WHO Global Health Atlas and WHO Statistical Information System, for general world health statistics (mortality, socio-economical, drinking water, sanitation, tobacco use, etc.).
Because of the importance of Hormone D I also included two sun exposure related parameters: latitude (average latitude of each country) and the corresponding annual insolation. (In the end, I didn’t find any important correlation of these parameters with health and disease as I previously expected). Also and for my disapointment, I could’nt find any recent world data on saturated data, only data from 1998, so this database doesn’t include data on saturated fat. If you know where I can find it, please let me know by commenting on this article.
After joining all the raw data into a single Excel workbook with several worksheets (DATABASE-FAOybk-BHFstats-WHOsys.zip), the major problem arised: merging everything into an homogeneous format. Because most data was available for year 2003, I had to choose this year for most parameters. Some of them refer to nearby years, for example, total cholesterol data refers to 2005. The fully integrated database obtained this way includes data from 167 countries and a total 106 parameters (nutritional, health, lifestyle and disease related parameters).
167 countries reduced to 86 countries
Unfortunately this "full" database of 167 countries (see worksheet ‘FULL-2003′) has many missing values, in quite a few countries, especially on what concerns the world’s obesity and blood pressure data from the Brithish Heart Foundation and the world socio-economic data from the World Heath Organization. So, in order to get a realy complete database, without any missing values, 81 countries (lines) had to be removed, as well as 6 parameters (columns). I don’t consider this a major problem as the final database (see worksheet ‘FINAL-2003′) still includes 86 countries and 100 parameters, and this is more than enough to get some interesting conclusions. At last, a truly complete database!
Perhaps the first question to be asked is how reliable is this database? Well, the data sources must be considered reliable, as they are provided by well known international and world health authorities.You may check them by visiting the links provided in the end of this message. Concerning the work I’ve done here, it must be noticed that processing such relatively huge amounts of data, available from different sources and in variable formats, without anyone helping me and validating this work, it is certainly prone to some degree of error. Because of this, I ask anyone usign this data to always double check their results against the original sources, specially if you find some unexpected correlations. Also, please report to me any error that you may find.
Correlation does not imply causation
After the above warnings, one could ask what can be done with this data, a total 100 parameters from 86 world countries? Well, if you’re into statistical analysis, and this includes canonical analysis, redundancy analysis, principal component analysis, etc., then quite a lot of hypothesis and conclusions can be formulated and proved. But first, the most simple analysis to begin with is trying to understand how each parameter relates to the other parameters. In statistical analysis this is called correlation analysis and most certainly everyone, that studied some math at the university, understands what it means and how it is done (see worksheet ‘CORREL-2003′) .
In the correlation analysis I’ve done, I first divided the existing 100 parameters into: 70 ‘nutrition & lifestyle’ parameters; and 30 ‘health and disease’ related parameters. These can be combined in 2100 (70 x 30) pairs of parameters and their correlations can be ranked. That’s what I did in worksheets ‘BEST’ and ‘WORST’. Notice that in epidemiological observational data, ‘correlation’ does not necessarily imply ‘causality’. This is a common mistake that some investigators do due to a-priori conjectures that don’t exist in realiity. Regarding this issue, please read the Wikipedia correlation does not imply causation article and also the correlations does not equal causation article by Stephan, the author of the Whole Health Source blog.
Confounding factors
According to Wikipedia, "a confounding variable, or confounding factor, lurking variable or confounder, is an extraneous variable in a statistical model that correlates (positively or negatively) with both the dependent variable and the independent variable. The methodologies of scientific studies therefore need to control for these factors to avoid a type 1 error; an erroneous ‘false positive’ conclusion that the dependent variables are in a causal relationship with the independent variable. Such a relation between two observed variables is termed a spurious relationship".
Perhaps the strongest confounding factor in epidemiological studies involving several countries accross different continents, as Stephan already noticed me, is wealth. Richer countries have more access to food and thus live with higher energy intakes. There are a few exceptions to this rule, of course, and I would quote here Portugal, in 2003 ranked nr. 2 in the world with 3750 kcal/day (against only 2780 kcal in 1980), right after the nr. 1 USA with 3770 kcal (I personally don’t believe this but thats the FAO data).
Because energy intake is a primary key parameter in health, probably much more than macronutrient distribution, physical activity or even food quality combined (I know many people don’t agree with this but the high correlation of r=0.82 between energy intake to years of life lost to non-communicable diseases, or diseases of civilization, strongly suggests this - but doesn’t prove it), it must also be the strongest confounder parameters of all. Perhaps one possible solution to overcome this dificulty would be to limit the number of countries according to classes of energy intake, and this is what I’ve donne in this database by providing a filtering option by "energy ranges". Or even better, if sufficient data is available from several consecutive years, different countries could be compared with data from distinct years, as much as energy intake for them all is matched to a certain value or a narrow range around that same value, for example 2800 kcal.
Visualizing data
Another way of analyzing data, and possibly the best one, is to draw some charts. This would be easy if we only had a few countries and a couple of parameters, but our database has now 100 parameters. For this purpose, I created a special Excel worksheet (see ‘CHARTS’) that allows the automated production of graphs, based in the combination of any 2 parameters listed in the database. This will allow for a total 10.000 different graphs, but of course most of them aren’t correlated at all. I chose here not to limit the combination of parameters to those 70×30, because you might want to correlate, for example, ‘energy intake’ with ’sugars and sweeteners’. This might be interesting if you want to explore the hypothesis that a variation of a certain parameter does not produce any effect in some other parameter. Se below for a further discussion on this subject.
Please notice that these automatic charts require Excel macros activated, because the Visual Basic language is necessary for the automated generation of the referred charts (I couldn’t get this working on Open Office’s Calc or even with GO-OO, because their VBA support is still too limited to run this specific Excel/VBA example. I hope the open source community solves this problem in the next couple of years as VBA is an incredible development environment that would be extremely useful in the Open Office environment when fully implemented and compatible with Microsoft VB language.)
Multivariate statistical analysis
Things now get a bit more complicated because we have a lot of parameters, exactly 70 ‘nutrition & lifestyle’ related parameters and 30 ‘health and disease’ parameters, and we would like to understand how some of those 70 parameters influence some of the other 30 parameters. Am I making myself clear here? For example, if you want to understand the influence of food and lifestyle on mortality and longevity, you might choose ‘energy intake’, ‘fat intake’ and ‘tobacco consumption’ and try to understand how these 3 parameters influence ‘addult mortality rate’, ‘mortality rate for cardiovascular disease’ and ‘healthy life expectancy’. Get the idea?
But only to get things worst, as we already know, many of these parameters can not be considered statistically independent. Fortunately mathematicians came up with a method called canonical correlation analysis, in which the variance of a set of parameters is decomposed in its orthogonal components in such a way that the variance of each parameter is then represented by the variance (eigenvalues) of its projections. I suppose most multivariate analysis books explain this better than me so I will not develop this subject here. I must say that, despite being no expert in this area, I did some CCA on the parameters I found most relevant for a set of conditions using the shareware version of XLSTAT, which can be download here.
If you are experienced in multivariate analysis I would love to hear some comments and suggestions from you on what concern the analysis I did here. I realy need help in this area from an expert in statistics.
Some inconvenient truths?
If you arrived here and want to try the database for yourself, please download it from the link below (DATABASE-FAOybk-BHFstats-WHOsys.zip). As I already referred, this is an Excel workbook, which includes several worksheets and that requires VBA /macro support in order to produce the described automated Excel charts. I consider this database just a preliminary version, that may contain errors, so I urge you to be very careful and double check any data you extract from it against the original FAO-stat, WHO-sis and BHF data, which can be accessed from the list of links in the end of this article. This is an ongoing effort so please report any errors that you may find so that I can correct them in future versions of this database.
If everything is correct and no major errors are found, we can start looking at the best correlated parameters and their corresponding graphs, and perhaps start finding by ourselves some, I would call them, ‘inconvenient truths’. For example, as many of us already know and as some health blogs recently reported, namely the Whole Health Source blog and the Hyperlipid blog, which by the way are two excellent blogs that I visit on a daily basis, total cholesterol doesn’t seem to be related to heart disease at all as we are constantly told by health authorities. For example, have a look at this graph cholesterol-cardiovasc-men.gif and you’ll understand that it takes a lot of imagination to draw a trend line, between ‘total cholesterol, men, 2005′ and ‘cardiovascular disease mortality, both sexes, 2002′, anywhere in this graph.
Also, if you graph ‘total cholesterol, female, 2005′ against ‘life expectancy at birth, female, 2003′ you will get this cholesterol-longevity-female.gif, a truly high correlation of r=0.73, where an increased total cholesterol level supposedly would provide higher longevity for females. I don’t believe the contrary, that lower total cholesterol levels would provide longer life as this would be against evidence. As we have seen before, and it is never too much to repeat this epidemiological mantra, ‘correlation does not imply causation’. On the other hand, I suppose a strong inverse correlation between ‘total cholesterol, men, 2005′ and ‘adult mortality rate, both sexes, 2003′ necessarily means that higher cholesterol levels by themselves can not imply higher mortality. In other words, if two variables are highly correlated, despite one of them not necessarily causing the other, we can assume non-causality for the exactly inverse relation. Is this clear, or not?
Final comments
If you try this type of parametric analysis or even the canonical analysis as described above, you may find yourself doing science and perhaps history. I’m not jokking! I tell this because, until now, I haven’t seen any analysis of this type with more than only a few countries, and the classical example of this epidemiological observational analysis, and of what can go wrong if we use only a very small sample of the whole data, is what Dr. Ancel Keys did in the early 50’s when he ‘established’ his never proven dietary fat-heart disease hypothesis. This has been a subject of some previous posts of mine, written in portuguese, so I will not repeat them here.
Just to finish this article, it must be noticed that this Excel database, despite being the result of some realy hard and careful (unpaid) work, is provided as it is, without any warranty whatsoever, either explicit or implicit, of adequacy for any purpose. Anyway, despite of this necessary disclaimer, I sincerely believe it works fine, and you can test it just to make sure it realy does, and perhaps improve it and/or eventually report any errors that you may find, so that future versions of this free database are more reliable and error free.
In the next few weeks I’ll start my own data mining on this data, using some techinques that are familiar to me, like mutiple non-linear regression and a powerfull genetic algorithm called gene expression programming, which was created by a portuguese scientist.
Nota aos leitores portugueses: Este artigo foi publicado excepcionalmente em inglês tendo em vista facilitar o intercâmbio de ideias com outros bloguers, interessados neste tipo de informação estatística mas que não falam a língua portuguesa. Felizmente grande parte dos portugueses e brasileiros sabe ler e/ou entende o inglês, mas o contrário já não se verifica com tanta frequência.
Nutrition, lifestyle & health database:
DATABASE-FAOybk-BHFstats-WHOsys.zip (1.95 Mb)
Last updated: 29.03.2009; Only minor changes.
On-line freely available databases:
WHO Global Health Atlas
WHO Statistical Information System
WHO burden of disease: 2004 update
WHO comparative health risks
WHO disease and injury estimates 2004
WHO disease and injury country estimates
WHO data sources and methods
WBank burden of disease & risk factors
FAOSTAT consumption - crops
FAOSTAT consumption - livestock and fish
FAO Statistical Yearbook 2004 - consumption
FAO Statistical Yearbook 2004 - countries
Health statistics (British Heart Found)
European Cardiovascular Statistics
Obesity, physical activity and diet in England
Related websites:
Epidemiology (Wikipedia)
XLSTAT, Statistical software for MS Excel
BioEstat 5.0 (portuguese software)
Correlation (Wikipedia)
Correlation does not imply causation (Wikipedia)
Confunding variable (Wikipedia)
Canonical analysis (Wikipedia)
Understanding canonical correlation analysis
Canonical Analysis (StatSoft)
Why There Is No Statistical Test for Confounding, Why Many Think There Is, and Why They Are Almost Right
Em português:
Uma Breve Introdução à Epidemiologia (Rev. Vigilância em Saúde Pública)
Texto Introdutório de Epidemiologia (Univ. Federal do Rio de Janeiro)
Os Caminhos da Estatística e suas Incursões pela Epidemiologia
Epidemiologia, demografia, história da medicina (Paulo Lotufo)