In order to investigate how diabetes incidents depend on some characteristics of pregnant and diabetes status (class variables), all other physical …


Jiawei Qi

Diabetes analysis of Pima Indian women 1 Introduction
In order to investigate how diabetes incidents depend on some characteristics of Pima Indian women, a study was conducted on 768 randomly selected Pima Indian female patients whose age were at least 21 Characteristics of the patients, including number of times pregnant, and age years were recorded And some other important physical measures that might be closely related to diabetes are also taken These measures are plasma glucose concentration a 2 hours in an oral glucose tolerance test, diastolic blood pressure mm Hg, triceps skin fold thickness mm, 2-Hour serum insulin mu U/ml, body mass index weight in kg/height in m2, and diabetes pedigree function The objective of this study is to investigate how the diabetes incidents are affected by the patients characteristics and the measurements In this study, we will use a generalized linear model to investigate the relationships between the diabetes incidents and the patients characteristics and the measurements That is the logit value of the probability that a Pima Indian woman has diabetes is a linear combination of the covariates stated above R software is utilized for
modeling and statistical analysis In summary, the logit value of the probability of a Pima Indian woman having diabetes depend on the covariates stated above with their higher order terms and their interactions Symbols Preg x1 Glu x 2 Dias x 3 Thick x 4 Insul x 5 Mass x 6 Pedi x 7 Age x8 class x1 Type Continous Continous Continous Continous Continous Continous Continous Continous Catigorical Description Number of times pregnant Plasma glucose concentration a 2 hours in an oral glucose tolerance test Diastolic blood pressure mm Hg Triceps skin fold thickness mm 2-Hour serum insulin mu U/ml Body mass index weight in kg/height in m2 Diabetes pedigree function Age years Class variable 0 or 1

1

Jiawei Qi

j

i, j

im

Indicate the relationship between the j-th explanatory variables and the logit of the probability coefficients Indicate the relationship between the interaction of the i-th and j-th explanatory variables and the logit of the probability coefficients Indicate the relationship between the mth order term of the i-th explanatory variables and the logit of the probability Table 1: Denotations in data modeling and analysis

coefficients

2 Data Modeling 21 Data
Preparation
Generally speaking except the patients characteristics such as number of times pregnant and diabetes status class variables, all other physical measurements should not be 0 Because it does not make sense to see a living person with plasma glucose concentration 0 But in the datasets, we can see there are many zeroes So we conclude that there are many missing data The simplest way to deal with the missing data is to delete them and only keep the complete for analysis If we use this method in our case, we will delete half of the original data Thus we lose much information and this leads to the inaccuracy of our final model We consider using an appropriate method to estimate the miss values and set up our model upon the fixed dataset In fact, there are many other techniques to estimate the missing values, such as maximum likelihood method, multiple imputation method, fully Bayesian method, and weighted estimating equations method Since our data are stored in a matrix form, we will use the method called estimation of missing values in a matrix by a k-th nearest neighboors algorithm And other advantages of this method are simple to code and the method are fully compatible
with the generalized linear model coding in R And you can see more details in EMV package in R

22 Preliminary data plots
Before we fit our model, we have a preliminary look at the data The followings are the frequency versus all covariates and response variable and box-plots of each covariate We can see that there may be some extremes point in Figure 2 We will do more analysis about the extreme points when fitting the model

2

Jiawei Qi

Figure 1: Frequency distributions of all covariates and response variable

Figure 2: Box-plots of covariates

3

Jiawei Qi

23 Model fitting
Each person has only two possibilities about having diabetes or not, which is expressed by the class variable 1 and 0 We will set up Bernoulli models for each person Our starting model is:
log it i 0 j x i , j
j 1 8

, i 1,,768

Where i stands for the probability of positive for diabetes for the i-th patient From fitting the model, we have that the residual deviance is 71179 on 759 degrees of freedom The coefficients are:
Intercept -9189 preg 01242 Glu 00384 dias -00094 thick 00035 Insul -95e-4 mass 00939 pedi 08696 age 00132

We know that the dispersion parameter for binomial distribution is 1 Now
our approximation to the dispersion parameter is 71179/759, less than 1, which means we have under-dispersion The reason for this under-dispersion is that our fitt ing is so good such that the residuals are all very small Since we need to add interaction effects and high order effects of the covariates in the minimal model, we do not care whether some terms are not significant in the minimal model if you do the F test for them Now we add interactions into the minimal model Here we mean we add up to the interaction terms among four covariates And we use the minimal model as our starting point to select best model
log it i 0 j x i , j
j 1
8

i , j , k ,l

i , j , k ,l

xi x j x k xl

, i 1,,768

The new coefficients are:
Intercept preg glu Dias -00102 Age 01389 thick:mass -000352 thick 01498 preg:age -00079 preg:thick -00064

-19083
insul -001 insul:pedi -00056

06271
mass 020822 Glu:age -00012

008367
pedi 20144 insul:age 000035

And this model is significantly different from the minimal model The residuals deviance has decreased to 67563 over 753 degrees of freedom We do not consider high order terms of the covariates in the previous two models, but they may significantly
change our whole model and leads to totally different predictions So we will add the high order term of each covariate to the model and test whether the high order terms of the covariates are affecting the model significantly or not After we do the fits and tests, the conclusion comes to that the cubic term of mass and square term of age are both very important to our model And without these terms, the model is significantly different After we check the significance of each term by

4

Jiawei Qi

partial F test, here comes our final model, which is: class glu insul mass pedi age Imass2 Imass3 Iage2 insul:pedi
22 33 2 log it i 0 2 x 2 5 x 5 7 x 7 6 x 6 6 x 6 6 x 6 8 x8 82 x8 5,7 x 5 x 7

for i 1,,768
In the F test, we use the significance level 005 We may have more or less terms according to the significance level used in the F testThe coefficients are:
Intercept -4575086e01 age 3163241e-01 Glu 3784792e-02 Imass2 -6584153e-02 insul 2737438e-03 Imass3 5551122e-04 mass 2585759e00 Iage2 -3563231e-03 Pedi 1925024e00 insul:pedi -5465680e-03

And the residuals deviance of the final model is 66736 on 758 degree of freedoms

24 Diagnostic Checking
It is very
important to perform diagnostic checking while building an appropriate regression model We will use some plots to demonstrate that our final model is appropriate

Figure 3: Diagnostic plots of the model

5

Jiawei Qi

Figure 4: Partial residuals vs covariates

Figure 5: fitted values plot

6

Jiawei Qi

Figure 6: Residuals plots In figure 3, from the cook statistic plots, we can see there are 5 influential points in the dataset If we refit our model without there suspicious points, we will find that these points will not significantly affect our model In figure 4, we can see that in the first, third and fourth plots, the partial residuals have no any regular patterns, which indicates that the probability of test positive for diabetes may not depend on the number of times pregnant, diastolic blood pressure and triceps skin fold thickness And in the 6th and 8th plots, the partial residuals have certain curved patterns It suggested that the probability of test positive for diabetes may not depend on higher order terms of mass and age These have been proved in our model Figure 5 is fitted values plot And figure 6 is the residuals plot We can see that all the residuals are within a
certain band of 0 without any specific pattern In summary, our fitted model is appropriate for the data we have

25 Predictions
To a 30-year old Pima Indian woman who has the following medical record: number of times pregnant 4; plasma glucose concentration a 2 hours in an oral glucose tolerance test 120; diastolic blood pressure 70 mm Hg; triceps skin fold

7

Jiawei Qi

thickness 20 mm; 2-hour serum insulin 80 mu U/ml; weight 82 kg; height 16m; diabetes pedigree function 05 By our model, we have the prediction value for log it -04485 , thus 039 This means that this Pima Indian woman has 39 of having diabetes

3 Conclusion and further discussion
The final conclusion is the model we obtained in section 2 For the model we fit above, its form is simple and it is convenient to predict the probability of test positive for diabetes And it is very easy to construct prediction intervals by our model There are a few things to be considered more carefully The first is about the missing values in the original data Except the method we use in our model, there are many other methods that can be used to handle the missing data But which method for missing values for this study is
the best depends on the compare of the predications and real data Secondly, how to deal with the extreme data or influential points is still a problem We need to do more careful analysis on them to determine whether we should keep them in our model

8

Source:math.ucsb.edu

del.icio.us:In order to investigate how diabetes incidents depend on some characteristics of  pregnant and diabetes status (class variables), all other physical ... digg:In order to investigate how diabetes incidents depend on some characteristics of  pregnant and diabetes status (class variables), all other physical ... spurl:In order to investigate how diabetes incidents depend on some characteristics of  pregnant and diabetes status (class variables), all other physical ... newsvine:In order to investigate how diabetes incidents depend on some characteristics of  pregnant and diabetes status (class variables), all other physical ... blinklist:In order to investigate how diabetes incidents depend on some characteristics of  pregnant and diabetes status (class variables), all other physical ... furl:In order to investigate how diabetes incidents depend on some characteristics of  pregnant and diabetes status (class variables), all other physical ... reddit:In order to investigate how diabetes incidents depend on some characteristics of  pregnant and diabetes status (class variables), all other physical ... fark:In order to investigate how diabetes incidents depend on some characteristics of  pregnant and diabetes status (class variables), all other physical ... Y!:In order to investigate how diabetes incidents depend on some characteristics of  pregnant and diabetes status (class variables), all other physical ...