07 de Agosto, 2010
CHINA STUDY: Which diet components are associated with more and less vascular disease mortality?
Following to the CHINA PROJECT database I compiled a few weeks ago, I just got the ideia of testing which combinations of diet items would be associated with more or less of a certain health condition. So, instead of just correlating single diet or nutrient items with mortality items, as it was done in the CHINA PROJECT/STUDY, why not to correlate linear combinations of diet items, trying somehow to identify which combinations of foods are actually promoting those mortalities? I’m not an epidemiologist and I don’t know if this is a common procedure or not, but I suppose this is an interesting hypothesis. For example, let’s start by trying to find which groups of diet components from the CHINA PROJECT data are associtated with more or less vascular disease mortality (M067-VASC STRc). So, in order to do this, we must select from the CHINA PROJECT those diet items that are more important. You can its find the full list of 639 items here. The vascular disease mortality item is this one:
M067 - VASC-STRc - mortality ALL VASCULAR DISEASE EXCEPT STROKE AGE 35-69 (stand. rate/100,000) (ICD9 390-459, excl 430-8 & 416-7)
Source: CHINA PROJECT Mortality Correlations, M067-VASC STRc (Adapted).
In the table above we have some of the most relevant/significant correlations, found in the CHINA PROJECT, of a few diet items with vascular mortality. These numbers, theoreticaly ranging from -100 to +100, are correlations represented as percentages. So, for example, the correlation of item M067-VASC STRc with item D038 - WHTFLOUR (diet survey WHEAT FLOUR INTAKE g/day/reference man) is +65%, or r=+0.65, which is a strong positive and significant association given the usual large sample size of the CHINA PROJECT items. Since we have all the data from the CHINA PROJECT, we can now combine some diet items, those that can be added, and see what would be the correlation of these grouped items with, for example, our item M067. This way, we could find which combinations of diet components better explain a higher or lower vascular disease mortality. So, from the items list I decided to choose these 22 items (Diet Survey) that I find most relevant:
D037 - RICE - diet survey RICE INTAKE (g/day/reference man, air-dry basis)
D038 - WHTFLOUR - diet survey WHEAT FLOUR INTAKE (g/day/reference man, air-dry basis)
D039 - OTHCEREAL - diet survey OTHER CEREAL INTAKE (g/day/reference man, air-dry basis)
D040 - STCHTUBER - diet survey STARCHY TUBER INTAKE (g/day/reference man, fresh weight)
D041 - LEGUME - diet survey LEGUME AND LEGUME PRODUCT INTAKE (g/day/reference man, fresh weight)
D042 - LIGHTVEG - diet survey LIGHT COLOURED VEGETABLE INTAKE (g/day/reference man, fresh weight)
D043 - GREENVEG - diet survey GREEN VEGETABLE INTAKE (g/day/reference man, fresh weight)
D045 - FRUIT - diet survey FRUIT INTAKE (g/day/reference man, fresh weight)
D046 - NUTS - diet survey NUT INTAKE (g/day/reference man, as-consumed basis)
D047 - MILK - diet survey MILK AND DAIRY PRODUCTS INTAKE (g/day/reference man, as-consumed basis)
D048 - EGGS - diet survey EGG INTAKE (g/day/reference man, as-consumed basis)
D049 - MEAT - diet survey MEAT INTAKE (red meat and poultry) (g/day/reference man, as-consumed basis)
D050 - REDMEAT - diet survey RED MEAT (pork, beef, mutton) INTAKE (g/day/reference man, as-consumed basis)
D051 - POULTRY - diet survey POULTRY INTAKE (g/day/reference man, as-consumed basis)
D052 - FISH - diet survey FISH INTAKE (g/day/reference man, as-consumed basis)
D053 - ANIMFAT - diet survey ADDED ANIMAL FAT (for cooking, spreading etc) INTAKE (g/day/reference man)
D054 - VEGOIL - diet survey ADDED VEGETABLE OIL (for cooking etc) INTAKE (g/day/reference man)
D056 - STCHSUGAR - diet survey PROCESSED STARCH AND SUGAR INTAKE (g/day/reference man, as-consumed basis)
D057 - ADDEDSALT - diet survey INTAKE OF ADDED SALT (g/day/reference man)
D058 - SPICE - diet survey SPICE INTAKE (g/day/reference man)
D060 - BEER - diet survey BEER INTAKE (g/day/reference man)
D061 - WINE - diet survey WINE INTAKE (g/day/reference man)
There is just a "small" problem now. How many different groups of diet items we can generate by combining these 22 diet items? If we only had 2 diet items, we would have 4 possible diets items combined: diet1=item1, diet2=item2, diet3=item1+item2 and diet4=none. If we had 3 items, we would have 8 possible groups (try all combinations and you’lll find 8). So, if we have n items, we have 2×2x…x2 (n times) possible diets, or 2^n. For n=22 diet items, we have 4.194.304 different diets. Yes, more that four million groups (the are not real diets, only combinations of diet items, which is slightly different) can be generated from only 22 food items. So, in order to solve this large problem, we need to build a computer routine to calculate the correlation coefficients of all our 4.2 million groups of items with the mortality item M067-VASC STRc. The groups with higher positive association with this last parameter might be good candidates for "vascular disease promoting diets". On the other hand, those with higher negative association could be possibly considered "anti-vascular disease diets", because or their eventually protective role.
So, the code for the program I wrote is in the end of this post, and it can also be downloaded, along with all the necessary files, from here. This is for Visual Basic 6 and the input files and executable are included. There are 2 input files: china-project/data1989.csv, which is all the CHINA PROJECT 1989 data; and the myparam.txt file, which is the list of parameters we want to correlate. The first line of this file is the mortaility item (M067-VASC STRc) and the other 22 lines are the diet items we want to correlate with the mortality item. Simply execute the Project1.exe file (it has no virus, but anyway allways scan executables downloaded from the Internet with your virus scanner before running them), wait about 2 hours, and the results will be printed to file myoutput.txt. Notice this program runs in background and it doesn’t opens any window. You can know it’s running (or cancel the process) by looking in the Task Mannager for Project1.exe and/or by opening the file myoutput.txt during execution. After each 1 million correlations are calculated (this takes about 30 minutes on my old Intel Core 2 Duo E6550 @ 2.33GHz with 2Gb of RAM), the program writes a line to this file with this information. If you want to test correlations with different parameters, you can change the values of myparam.txt, but you can’t change the number of lines of that file, which should allways be 23, the first line for the mortality item (you can change this to any item different from M067) and the other 22 lines for diet items (you can also change these items, but not their total number) that will be combined in all possible manners and correlated with the item in the first line. If you want to test more or less than 22 items, you must change the code and recompile the program.
The final result of this run will look like this: myoutput.txt. The program only outputs correlations that are better than the previous ones found, otherwise the output file would easily reach several megabytes (I got this problem in the first versions of my program). This procedure left me thinking for a while, but I decided not to change the program. This way, we might miss some important correlations, but we will never miss the strongest of them. Anyway, I didn’t changed it because I just wanted a sample of several good positive/negative correlations that could provide some degree of sensibility to the effect of introducing or omiting some food items and their effect on correlations. Since I got about 70 results for each positive/negative group, I thought this would be enough for my purpose. If you believe this might introduce some kind of bias, please let me know. We could also save all results with an absolute correlation above, for example, 0.6, this might be an even better procedure. Anyway, let’s analyse the results we got with this possibly less perfect method. As you can notice, the output file has several similar lines, like this one:
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 -0.2119527
There are 22 numbers, 1 or 0, that represent which diet items are present (1) and which are absent (0). In the example above, only the two last items are present, and those are D060 and D061. The last number is the correlation of this group of diet items (D060+D061) with the mortality item M067. So, in order to analyse all the output in an easier way, I’ve created the Excel file mycorrelations.xls ranking some of the most relevant positive and negative associations. This Excel worksheet was also useful to validate the correlations calculated from the computer program, which are equal to those calculated in Excel, so the computer routine is doing the correct calculations. For easier analysis of the results, I created these 2 pages, please print them in color if possible, m067-diets-correlations.pdf. The first half of 1st page represents the diets with higher positive correlations with M067-VASC STRc. The red cells are for items that appear to more critical and could promote vascular disease in the amounts consumed in the context of the CHINA PROJECT. These items are (on the left, it’s the correlation of that item with M067):
+65.1 D038 - WHTFLOUR - diet survey WHEAT FLOUR INTAKE (g/day/reference man, air-dry basis)
+49.5 D047 - MILK - diet survey MILK AND DAIRY PRODUCTS INTAKE (g/day/reference man, as-consumed basis)
+22.9 D058 - SPICE - diet survey SPICE INTAKE (g/day/reference man)
+15.4 D042 - LIGHTVEG - diet survey LIGHT COLOURED VEGETABLE INTAKE (g/day/reference man, fresh weight)
+14.7 D057 - ADDEDSALT - diet survey INTAKE OF ADDED SALT (g/day/reference man)
-13.8 D043 - GREENVEG - diet survey GREEN VEGETABLE INTAKE (g/day/reference man, fresh weight)
-8.7 D053 - ANIMFAT - diet survey ADDED ANIMAL FAT (for cooking, spreading etc) INTAKE (g/day/reference man)
Notice from page 1 of m067-diets-correlations.pdf that the highest correlation, among the 4.2 million diet items combinations, is only r=+0.71, actually very very close to the correlation of a single item, D038 - WHTFLOUR, which is r=+0.65. We must never forget that an association doesn’t prove causation, but given this extraodinary result and the scientific/evolutionary information we have about wheat flour and their derived products, there is no doubt that consuming a lot of wheat/bread/pasta, the so called "healthy" mediterranean whole grains that nutritionists highly recommend, helps promote vascular diseases. Dr. William Davis, a known cardiologist and blogger, even calls wheat "the nicotine of food". In fact, this seams to be the most powerful single diet item that, strongly and significantly associates with M067-VASC STRc, when compared to whatever imaginable groups of diet items. Also wheat is positively associated with several types of diseases & mortality, as the results of the CHINA PROJECT show. I suppose this is another quite powerful result agains wheat, but you be the judge.
Given the high correlation of D047-MILK and the mild correlations of D057-ADDEDSALT and D058-SPICE with M067, we could also argue that milk and, with a minor role, also salt and spice intake, might promote, or are associated with, vascular diseases. On what concerns these diseases, D045-FRUIT, D049-MEAT, D050-REDMEAT and D054-VEGOIL are most probably neutral, since they don’t appear in the list of strongest positive associations and also don’t look protective in this type analysis. I must say that D054-VEGOIL is a surprise to me, so a more indepth analysis would be necessary. The types of vegetable oils and their amounts might explain this relatively unexpected result. I I don’t believe vegetables promote vascular diseases, after all their correlation with M067 is negative and their apparent contribution here is probably masked by the increase of some other item like wheat or milk.
On the other hand, there are quite a few diet items that look to be protective against vascular diseases. From this analysis of 4.2 million diet items combined, I’ve identified these probably protective items:
-49.6% D037 - RICE - diet survey RICE INTAKE (g/day/reference man, air-dry basis)
-40.4% D052 - FISH - diet survey FISH INTAKE (g/day/reference man, as-consumed basis)
-33.2% - D048 - EGGS - diet survey EGG INTAKE (g/day/reference man, as-consumed basis)
-32.8% D051 - POULTRY - diet survey POULTRY INTAKE (g/day/reference man, as-consumed basis)
-19.2% D061 - WINE - diet survey WINE INTAKE (g/day/reference man)
-18.0% D060 - BEER - diet survey BEER INTAKE (g/day/reference man)
-3.7% D039 - OTHCEREAL - diet survey OTHER CEREAL INTAKE (g/day/reference man, air-dry basis)
The correlations on the left represent the correlation of that item with M067. The maximum negative association is for rice, fish and poultry consumption. The maximum negative association I could obtain with the exaustive combination of food items was only r=-0.64. Again, the combination of several food items, just a few of them or a lot of them, don’t explain (from a statistical point of view) the presense or abscense of vascular diseases much better that just the associations with some isolated diet items. Anyway, the difference here is a bit higher that in the case before, where wheat proves to be a powerful vascular disease promotor. In my analysis, fish, eggs, maybe poultry, and rice consumption are clearly protective. Also wine and beer look protective, don’t ask me why about this last item but it does help increase the negative association with vascular diseases.
Regarding other conditions, I just tried other simultations with disapointing resutls. For example, for M029-Colorectal cancer, M045-Diabetes, M062-Hypertension and M065-Stroke, the maximum positive associations that I found, among the 4.2 million diet items combined, were +0.505, +0.364, +0.762 and +0.459, respectively. Except for hypertension, being milk consumption the major problematic item, all others not so strong correlations suggest that there must be much more to cancer, diabetes and stroke than just the diet. Since we don’t know exactly what, we can simply call it "lifestyle".
PS: Some last notes that are important to retain from this analysis. No food item has an absolute value on what concerns protection against a certain disease or health disorder (Unless it’s a clearly bad food, like wheat is for the vascular system and not only for the vascular system). Foods work sinergisticaly, each food item has its intrinsic value but it also has the subjective value of being in replacement of some other food. And this can be good, bad or neutral depending on the replacement. For example, replacing Coca-Cola with beer is probably best that replacing water with beer. You get the idea, don’t you? Also, because the maximum absolute correlations obtained here (+0.71 and -0.65), which and still far from +1/-1, should give us a clue that there is more to vascular diseases than just the diet. Diet is certainly important because it explains a lot, but lifestyle, enviromental polution, urbanization/industrialisation and some other factors we don’t even know, can not be forgoten.
Download: m067-diets-correlations.pdf
'Var Declaration Dim file_name As String Dim fnum As Integer Dim whole_file As String Dim lines As Variant Dim one_line As Variant Dim num_rows As Long Dim num_cols As Long Dim data1989() As String Dim R As Long Dim C As Long Dim vector(85) As Variant Dim mycontrol As String Dim myparam() As Variant Dim correlmin As Single Dim correlmax As Single Private Sub Form_Load() file_name = App.Path & "\data1989.csv" ' Load the file. fnum = FreeFile Open file_name For Input As fnum whole_file = Input$(LOF(fnum), #fnum) Close fnum ' Break the file into lines. lines = Split(whole_file, vbCrLf) ' Dimension the array. num_rows = UBound(lines) one_line = Split(lines(0), ",") num_cols = UBound(one_line) + 1 ReDim data1989(0 To num_rows, 1 To num_cols) ' Copy the data into the array. For R = 0 To num_rows - 1 If Len(lines(R)) > 0 Then one_line = Split(lines(R), ",") For C = 1 To num_cols data1989(R, C) = one_line(C - 1) Next C End If Next R 'Get mycontrol and myparam file_name = App.Path & "\myparam.txt" ' Load the file. fnum = FreeFile Open file_name For Input As fnum whole_file = Input$(LOF(fnum), #fnum) Close fnum ' Break the file into lines. lines = Split(whole_file, vbCrLf) ' Dimension the array. num_rows = UBound(lines) If lines(num_rows) = "" Then num_rows = num_rows - 1 ReDim myparam(0 To num_rows, 0 To 1) For i = 0 To num_rows myparam(i, 0) = lines(i) For j = 1 To 639 If data1989(0, j) = myparam(i, 0) Then myparam(i, 1) = j Next j Next i 'Open output file file_name = App.Path & "\myoutput.txt" fnum = FreeFile Open file_name For Output As fnum Counter = 0 correlmin = 999 correlmax = -999 For p01 = 0 To 1 For p02 = 0 To 1 For p03 = 0 To 1 For p04 = 0 To 1 For p05 = 0 To 1 For p06 = 0 To 1 For p07 = 0 To 1 For p08 = 0 To 1 For p09 = 0 To 1 For p10 = 0 To 1 For p11 = 0 To 1 For p12 = 0 To 1 For p13 = 0 To 1 For p14 = 0 To 1 For p15 = 0 To 1 For p16 = 0 To 1 For p17 = 0 To 1 For p18 = 0 To 1 For p19 = 0 To 1 For p20 = 0 To 1 For p21 = 0 To 1 For p22 = 0 To 1 Counter = Counter + 1 If Counter / 1000000# = Int(Counter / 1000000#) Then Print #fnum, Counter, Date, Time For i = 1 To 85 vector(i) = p01 * Val(data1989(i, myparam(1, 1))) _ + p02 * Val(data1989(i, myparam(2, 1))) _ + p03 * Val(data1989(i, myparam(3, 1))) _ + p04 * Val(data1989(i, myparam(4, 1))) _ + p05 * Val(data1989(i, myparam(5, 1))) _ + p06 * Val(data1989(i, myparam(6, 1))) _ + p07 * Val(data1989(i, myparam(7, 1))) _ + p08 * Val(data1989(i, myparam(8, 1))) _ + p09 * Val(data1989(i, myparam(9, 1))) _ + p10 * Val(data1989(i, myparam(10, 1))) vector(i) = vector(i) _ + p11 * Val(data1989(i, myparam(11, 1))) _ + p12 * Val(data1989(i, myparam(12, 1))) _ + p13 * Val(data1989(i, myparam(13, 1))) _ + p14 * Val(data1989(i, myparam(14, 1))) _ + p15 * Val(data1989(i, myparam(15, 1))) _ + p16 * Val(data1989(i, myparam(16, 1))) _ + p17 * Val(data1989(i, myparam(17, 1))) _ + p18 * Val(data1989(i, myparam(18, 1))) _ + p19 * Val(data1989(i, myparam(19, 1))) _ + p20 * Val(data1989(i, myparam(20, 1))) _ + p21 * Val(data1989(i, myparam(21, 1))) _ + p22 * Val(data1989(i, myparam(22, 1))) For j = 1 To 22 If Not (IsNumeric(data1989(i, myparam(j, 1)))) Then vector(i) = "" Next j Next i 'calcular correls aqui mycorrel = correl_mycontrol_vector If mycorrel < correlmin Or mycorrel > correlmax Then Print #fnum, p01; p02; p03; p04; p05; p06; p07; p08; p09; p10; _ p11; p12; p13; p14; p15; p16; p17; p18; p19; p20; _ p21; p22; mycorrel End If If mycorrel < correlmin Then correlmin = mycorrel If mycorrel > correlmax Then correlmax = mycorrel Next p22, p21, p20, p19, p18, p17, p16, p15, p14, p13, p12, _ p11, p10, p09, p08, p07, p06, p05, p04, p03, p02, p01 Close fnum End End Sub Function correl_mycontrol_vector() As Single n = 0: x = 0: y = 0: xy = 0: X2 = 0: Y2 = 0 For i = 1 To 85 If IsNumeric(data1989(i, myparam(0, 1))) And IsNumeric(vector(i)) Then n = n + 1 x = x + data1989(i, myparam(0, 1)) y = y + vector(i) xy = xy + data1989(i, myparam(0, 1)) * vector(i) X2 = X2 + data1989(i, myparam(0, 1)) ^ 2 Y2 = Y2 + vector(i) ^ 2 End If Next i correl_mycontrol_vector = 0 If (n * X2 - x ^ 2) <> 0 And (n * Y2 - y ^ 2) <> 0 Then correl_mycontrol_vector = (n * xy - x * y) / (n * X2 - x ^ 2) ^ 0.5 / (n * Y2 - y ^ 2) ^ 0.5 End If End Function