End of chapter questions for elementary statistical modelling, chapter 3

C3.Q1 Write a complete model description for the analysis of the difference in masses of apples and oranges.

The answer very closely follows that given for the effect of sex on birth mass from chapter 2.

I fitted a model to investigate the difference in masses y of apples and oranges according to
y_i = \alpha + \delta \cdot \phi_{orange,i}+ e_i,
where i indexes data from individual fruits, \alpha is a model intercept (g), corresponding to the mean mass of apples, \delta is a coefficient representing the difference, in g, in mean mass between apples and oranges, also in grams.  \phi_{orange,i} is an indicator variable, which takes a value of one if a given mass record (y_i) is for an orange, and zero otherwise; e_i are residuals.

C3.Q2 I explained that lm() recodes factor levels as zeros and ones, so that the difference between two groups becomes the same as the regression of the response variable on the re-coded 0/1 data (see particularly the visual aid I attempted in figure 3.2). Prove this to yourself by manually re-coding fruit type as zeros and ones, and checking that it does the same thing as having apples and oranges coded as different factor levels.

The trick is to code an indicator variable that takes a value of one for records involving apples, and a value of zero for records involving apples.

# recall that the data frame for the fruit example
# was called fakeData
fakeData$IsOrange<-(fakeData$fruit=="orange")+0

Recall from Using R that adding zero to a Boolean variable converts values to zeros and values to ones. We can use this new variable as a predictor in a regression.

newMod<-lm(mass~IsOrange,data=fakeData)
summary(newMod)$coefficients
##             Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 49.92817  0.9186333 54.35049 2.039420e-21
## IsOrange    21.00659  1.2991436 16.16957 3.646652e-12

You will note that these are exactly the same coefficients as we estimated in using the factor as a predictor, rather than the zeros and ones. That is because conversion of the factor variable to zeros and ones, just as we did manually, is exactly what the function did. So, what is the mean for apples?

For any mass value of an apple, the value of IsOrange will be zero. So, the mean for apples is given by the intercept (49.9 g) plus zero times the slope of the regression of fruit mass on the IsOrange variable, so it will just be the intercept (49.9 g). The mean for oranges is obtained as the intercept (49.9 g), plus one (the value of for all oranges) times the coefficient for IsOrange (21 g), which is 49.9 g + 1 \cdot 21 g = 70.9 g.

We can re-check that these reconstructed means exactly match the data.

tapply(fakeData$mass,fakeData$fruit,mean)
##    apple   orange 
## 49.92817 70.93477

If this doesn’t yet make sense, it should be an indication that some more attention to chapter 3 would be useful before moving on.