green and yellow beaded necklace

Data Analysis Using R Programming

The dataset heartdata.csv contains observations on the percentage of people biking to work each day, the percentage of people smoking, and the percentage of people with heart disease in 500 counties.

We want to explain heart disease using biking and smoking rates using the following model:
Heart. Disease ᵢ = β₁ + β₂ bikingᵢ + β₃ smokingᵢ + ϵᵢ
We assume that all ϵᵢ and all explanatory variables are independent and that ϵᵢ are independently distributed with expectation 0 and variance σ²
Step 1:
Load the data into R
Follow these four steps to import and load the data:
  1. In RStudio, go to File > Import dataset > From Text (base).
  2. Choose the data file heartdata, and an Import Dataset window pops up.
  3. In the Data Frame window, you should see an X (index) column and columns listing the data for each of the variables (biking, smoking, and heart.disease).
  4. Click on the Import button and the file should appear in your Environment tab on the upper right side of the RStudio screen.
After you’ve loaded the data, check that it has been read in correctly using summary().
> summary(heartdata)
       id            biking          smoking        heart.disease    
 Min.   :  1.0   Min.   : 1.119   Min.   : 0.5259   Min.   : 0.5519  
 1st Qu.:125.2   1st Qu.:20.205   1st Qu.: 8.2798   1st Qu.: 6.5137  
 Median :249.5   Median :35.824   Median :15.8146   Median :10.3853  
 Mean   :249.5   Mean   :37.788   Mean   :15.4350   Mean   :10.1745  
 3rd Qu.:373.8   3rd Qu.:57.853   3rd Qu.:22.5689   3rd Qu.:13.7240  
 Max.   :498.0   Max.   :74.907   Max.   :29.9467   Max.   :20.4535 

Step 2:
Make sure your data meet the assumptions.
We can use R to check that our data meet the four main assumptions for linear regression.
1. Collinearity
Use the cor() function to test the relationship between your independent variables and make sure they aren’t too highly correlated.

> cor(heartdata$biking, heartdata$smoking)

[1] 0.01513618

It shows a positive correlation which implies that smoking and biking vary in the same direction. The correlation between heart disease and biking and smoking found above is 0.015 which is close to 0 and indicates that the variables are independent, that is as one variable increases, there is no tendency for the other variable to either decrease or increase. 

2. Normality
Use the hist() function to test whether your dependent variable follows a normal distribution.

hist(heartdata$heart.disease)

The histogram for people with heart disease is not perfectly bell-shaped but it is quite close to normally distributed. 

hist(heartdata$biking)

In the histogram for biking, there is some evidence of skewness and does not exactly bell-shaped and normally distributed.

hist(heartdata$smoking)

In the histogram for smoking, there is some evidence of skewness and does not exactly bell-shaped and normally distributed.

ggqqplot(heart.disease)

The Q-Q plot (Quantile-Quantile Plot) can be used to determine whether or not the residuals of a model follow a normal distribution. The points on the plot roughly form a straight diagonal line, which indicates that the normality assumption is met. 

3. Linearity
Check this using two scatter plots: one for biking and heart disease, and one for smoking and heart disease

model <- lm(heart.disease ~ biking, data = heartdata)

plot(model, 1)

The red line is approximately horizontal at zero. The residual plot shows no fitted pattern. We can assume a linear relationship between biking and heart disease.

model <- lm(heart.disease ~ smoking, data = heartdata)

plot(model, 1)

There is no pattern in the residual plot between smoking and heart disease as the red line is approximately horizontal at zero. We can assume a linear relationship between smoking and heart disease. 

4. Homoskedasticity
Check this after fitting the model.

> lmtest::bptest(model)

studentized Breusch-Pagan test

data:  model

BP = 5.7775, df = 2, p-value = 0.05564

Based on the Breush-Pagan test above, the P-Value is 0.056 which is above 0.05. The null hypothesis is not rejected and we can say that there is homoskedasticity.  

Step 3:
Perform the linear regression analysis
Run the regression using the lm() function and assign the regression output to the object named model.1
Print the summary of the regression using the summary function summary() i.e, summary(model.1)
Interpret all the statistics shown in the model summary.
> model.1 <- lm(heart.disease ~ smoking + biking, data = heartdata)> summary(model.1)
Call:lm(formula = heart.disease ~ smoking + biking, data = heartdata)
Residuals:    Min      1Q  Median      3Q     Max -2.1789 -0.4463  0.0362  0.4422  1.9331 
Coefficients:             Estimate Std. Error t value Pr(>|t|)    (Intercept) 14.984658   0.080137  186.99   <2e-16 ***smoking      0.178334   0.003539   50.39   <2e-16 ***biking      -0.200133   0.001366 -146.53   <2e-16 ***—Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.654 on 495 degrees of freedomMultiple R-squared:  0.9796, Adjusted R-squared:  0.9795 F-statistic: 1.19e+04 on 2 and 495 DF,  p-value: < 2.2e-16
As shown in the output is the formula R used to fit the data. The predictors are smoking and biking, and the response variable is people with heart disease, in the heart data.
Based on the coefficients table we can see that the slope is saying that: The percentage of people with heart disease increased by 0.18% in people who smoke, as expected.The percentage of people with heart disease decreased by 0.2% in people who bike to work every day.More than 90% of the variation in people with heart disease is explained by the two variables as given by the R².There is a positive relationship between people with heart disease and people who smoke (t value=50.39 > 1.96) and there is a negative relationship between people who bike to work every day and people with heart disease (t value=-146.53 < -1.96). But both p-values are very close to 0 which is less than 5% which is statistically significant p-value. It indicates that we can reject the null hypothesis which allows us to conclude that there is a relationship between the percentage of people with heart disease with the percentage of people who smoke and the percentage of people who bike to work every day.The average amount that the response (heart disease) will deviate from the true regression line (residual standard error) is 0.645, on average. Heart disease = 14.98 + 0.19xsmoking -0.2xbiking.
Diagnostic plots:Check that the model is actually a good fit for the data by running this code:par(mfrow=c(2,2))plot(model.1)par(mfrow=c(1,1))Comment on the diagnostics ### Visualizationmodel.1 <- lm(heart.disease ~ smoking + biking, data = heartdata)par(mfrow=c(2,2))plot(model.1)par(mfrow=c(1,1))

The diagnostic plots show residuals in four different ways:For the Residuals vs Fitted scatter plots, the x-axis displays the fitted values and the y-axis displays the residuals. From the plot, we can see that the spread of the residuals doesn’t change with the fitted values that are the spread is constant which indicates homoskedasticity.For the Scale-Location scatter plots, the red line is roughly horizontal across the plot. The assumption of homoskedasticity is likely satisfied for the regression model. The spread of the residuals is roughly equal at all fitted values. There is no clear pattern among the residuals as well. The residuals are randomly scattered around the red line with roughly equal variability at all fitted values. For the Normal QQ scatter plots, we can see that the data points near the tails don’t fall exactly along the straight line, but for the most part, they appear to be normally distributed along a straight line.For the Residuals vs Leverage plots, we are looking at how the spread of standardized residuals changes as the leverage increases. It can also be used to detect heteroskedasticity and non-linearity. Here the spread of standardized residuals changes as a function of leverage. It appears to decrease, indicating heteroskedasticity. Besides, We can see that observation #196 lies closest to the border of Cook’s distance, but it doesn’t fall outside the dashed line. This means there are not any influential points in our regression model.
We will use the following method to visualize our results:Plotting the relationship between biking and heart disease at different levels of smoking. In this example, smoking will be treated as a factor with three levels, just for the purposes of displaying the relationships in our data.Follow these steps:1. Create a new data frame with the information needed to plot the model:Use the function expand.grid() to create a data frame with the parameters you supply. Within this function you will:Create a sequence from the lowest to the highest value of your observed biking data;Choose the minimum, mean, and maximum values of the variable smoking, in order to make 3 levels of smoking over which to predict rates of heart disease.plotting.data<-expand.grid(biking = seq(min(heart.data$biking), max(heart.data$biking), length.out=30),smoking=c(min(heart.data$smoking), mean(heart.data$smoking), max(heart.data$smoking)))You should change the data frame name heart.data to whatever you named the R data frame after importing heartdata.csv.Review the function expand.grid() using (?expand.grid) and make sure that you understand it. Also review the video and lecture slides from Lecture number 3 to understand how to make predictions using R. This involves making a dataframe with the x values that you want to predict the y for.Predict the values of heart disease based on your linear modelNext save your ‘predicted y’ values as a new column in the dataset you just created usingplotting.data$predicted.y <- predict(model.1, newdata=plotting.data)Round the smoking numbers to two decimalsplotting.data$smoking <- round(plotting.data$smoking, digits = 2)Change the ‘smoking’ variable into a factorThis allows you to plot the interaction between biking and heart disease at each of the three levels of smoking we chose.plotting.data$smoking <- as.factor(plotting.data$smoking)Make a scatter plot of the original data (heart.disease vs. biking) using ggplotheart.plot <- ggplot(heart.data, aes(x=biking, y=heart.disease)) +geom_point()Heart.plotUsing geom_linesadd the regression linesheart.plot <- heart.plot + geom_line(data=plotting.data, aes(x=biking, y=predicted.y, color=smoking), size=1.25)heart.plotMake the graph ready for publication by adding a theme and labelsheart.plot <- heart.plot + theme_bw() + labs(title = “Rates of heart disease (% of population) \n as a function of biking to work and smoking”, x = “Biking to work (% of population)”,y = “Heart disease (% of population)”,color = “Smoking \n (% of population)”)heart.plotWe have made a graph that shows the predicted relationship between heart disease and biking at 3 levels of smoking.
Attempt to create a similar graph using only the middle level of smoking and add the prediction band to the plot using dashed or dotted lines. Review the function predict() to see how to add confidence or prediction limits.new.data<-expand.grid(biking = seq(min(heartdata$biking), max(heartdata$biking), length.out=30),                       smoking=c(mean(heartdata$smoking)))new.datanew.data$band.y <- predict(model.1, newdata=new.data, interval=”predict”, level=0.95)new.data$smoking<- round(new.data$smoking, digits = 2)new.data$smoking <- as.factor(new.data$smoking)heart.plot2<-ggplot(heartdata, aes(x=biking, y=heart.disease)) + geom_point() +stat_smooth(method=lm) +   geom_line(data=new.data, aes(x=biking, y=band.y[,”fit”], color=smoking), size=1.25) heart.plot2 <- heart.plot2 + geom_line(data=new.data, aes(y=band.y[,”lwr”]), col=”coral2″, linetype=”dashed”)heart.plot2 <- heart.plot2 + geom_line(data=new.data, aes(y=band.y[,”upr”]), col=”coral2″, linetype=”dashed”)heart.plot2heart.plot2 <- heart.plot2 + theme_bw() + labs(title = “Rates of heart disease (% of population) \n as a function of biking to work and smoking”,                                               x = “Biking to work (% of population)”,                                               y = “Heart disease (% of population)”,                                               color = “Smoking \n (% of population)”.)heart.plot2
We have made a graph that shows the predicted relationship between heart disease and biking at the middle level of smoking (15.44%). The graph also shows upper and lower dashed lines showing the prediction interval that says there is a 95% chance of the percentage of people biking to work each day and the percentage of people with heart disease in 500 counties at the middle level of smoking.

in

Leave a Reply

Your email address will not be published. Required fields are marked *