In this report, we have the effects of manual or automatic transmission for efficient fuel usage in cars. The data we have used comes from the 1974 edition of our Motor Trend Magazine. We have predicted that the weight of the car was a significant confounder in our analysis, and the choice of manual or automatic depends on it.
We transform all relevant variables to their corresponding factors and the transmission (am) data to its actual value (0 = automatic, 1 = manual)
mtcars$cyl <- factor(mtcars$cyl) mtcars$gear <- factor(mtcars$gear) mtcars$vs <- factor(mtcars$vs) mtcars$carb <- factor(mtcars$carb) mtcars$am[mtcars$am =="0"] <- "automatic" mtcars$am[mtcars$am =="1"] <- "manual" mtcars$am <- as.factor(mtcars$am) str(mtcars)
From the exploratory analysis (Figure 1, Appendix) done through scatterplot of all the variables in the dataset we can observe that there is a significant correlation between mpg and the other variables of interest like cyl, disp, hp, draft, wt, vs and am. Note that we are also interested in exploring the relation between the mpg and its effects of car transmission type, we explore from box-and-whisker plot that there is a steady increase in mpg when the transmission for the car used is manual.
We build several linear regression models based on factorized variables we preprocessed in the processing step above and try to find out the best model and compare it with the base model using anova. After model selection, we also perform analysis of residuals.
In order to choose the best model, we use the stepwise selection (forward, backward, both) using the stepAIC( ) function from the MASS package.
library(MASS) fit <- lm(mpg~.,data=mtcars) bestmodel <- stepAIC(fit, direction="both")
summary(bestmodel)$r.squared
## [1] 0.8659
We now derive from the anova for different cases involving a. transmission, b. all variables and c. best fit variable combination of predictors cyl, hp, wt and am.
## Analysis of Variance Table ## ## Model 1: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb ## Model 2: mpg ~ am ## Model 3: mpg ~ cyl + hp + wt + am ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 15 120 ## 2 30 721 -15 -600 4.99 0.0018 ** ## 3 26 151 4 570 17.75 1.5e-05 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Please see the residual plots for our chosen regression model. We can conclude the following from the plot:
We want to identify the impact of an observation on the regression coefficients, and one approach is to consider how much the regression coefficient values change if the observation was not considered.
leverage<-hatvalues(bestmodel) head(sort(leverage,decreasing=TRUE),3)
## Maserati Bora Lincoln Continental Toyota Corona ## 0.4714 0.2937 0.2778
From the cooks distance plot above we can confirm our analysis was correct, as the same cars are mentioned in the residual plots.
From the t-test for mpg as the outcome and am as predictor, we clearly see that the manual and automatic transmissions are significatively different.
t.test(mpg ~ am, data = mtcars)$statistic
## t ## -3.767
From our multiple regression analysis above, we conclude the following:
Scatterplot matrix of mtcars
pairs(mpg~ ., data=mtcars)
Boxplot of miles per gallon by transmission type
boxplot(mpg ~ am, data = mtcars, col = "red", ylab = "miles per gallon")
Residual plot for our regression model
plot(bestmodel)
Cooks distance plot for regression model
plot(bestmodel, which=c(4,6))