C3.Q1 Write a complete model description for the analysis of the difference in masses of apples and oranges.
The answer very closely follows that given for the effect of sex on birth mass from chapter 2.
I fitted a model to investigate the difference in masses of apples and oranges according to
,
where indexes data from individual fruits,
is a model intercept (g), corresponding to the mean mass of apples,
is a coefficient representing the difference, in g, in mean mass between apples and oranges, also in grams.
is an indicator variable, which takes a value of one if a given mass record (
) is for an orange, and zero otherwise;
are residuals.
C3.Q2 I explained that lm() recodes factor levels as zeros and ones, so that the difference between two groups becomes the same as the regression of the response variable on the re-coded 0/1 data (see particularly the visual aid I attempted in figure 3.2). Prove this to yourself by manually re-coding fruit type as zeros and ones, and checking that it does the same thing as having apples and oranges coded as different factor levels.
The trick is to code an indicator variable that takes a value of one for records involving apples, and a value of zero for records involving apples.
# recall that the data frame for the fruit example
# was called fakeData
fakeData$IsOrange<-(fakeData$fruit=="orange")+0
Recall from Using R that adding zero to a Boolean variable converts values to zeros and values to ones. We can use this new variable as a predictor in a regression.
newMod<-lm(mass~IsOrange,data=fakeData)
summary(newMod)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 49.92817 0.9186333 54.35049 2.039420e-21
## IsOrange 21.00659 1.2991436 16.16957 3.646652e-12
You will note that these are exactly the same coefficients as we estimated in using the factor as a predictor, rather than the zeros and ones. That is because conversion of the factor variable to zeros and ones, just as we did manually, is exactly what the function did. So, what is the mean for apples?
For any mass value of an apple, the value of IsOrange will be zero. So, the mean for apples is given by the intercept (49.9 g) plus zero times the slope of the regression of fruit mass on the IsOrange variable, so it will just be the intercept (49.9 g). The mean for oranges is obtained as the intercept (49.9 g), plus one (the value of for all oranges) times the coefficient for IsOrange (21 g), which is 49.9 g + 21 g = 70.9 g.
We can re-check that these reconstructed means exactly match the data.
tapply(fakeData$mass,fakeData$fruit,mean)
## apple orange
## 49.92817 70.93477
If this doesn’t yet make sense, it should be an indication that some more attention to chapter 3 would be useful before moving on.