End of chapter questions for elementary statistical modelling, chapter 9

C9.Q1 There are some baby unicorns with very large birth masses, in the range of 5 kg. Check that these lovely big unicornlings are not drivers of our key results when we relate birth mass to maternal age at parity.

Birth masses greater than 4.5 kg are quite rare, and are very much on the fringe of the distribution of residuals (this can be seen clearly form, for e.g., figure 2.4:

table(unicorns$BirthWt>4.5)

# subset the data to remove the biggest outliers
d_sub<-subset(unicorns,unicorns$BirthWt<4.5)
# fit a new model to the data and look at 
# the parameter estimates, especially the slope
newModel<-lm(BirthWt~MumAge,data=d_sub)
summary(newModel)$coefficients

##               Estimate  Std. Error   t value      Pr(>|t|)
## (Intercept) 2.00062151 0.038665696 51.741510 1.485894e-289
## MumAge      0.04646412 0.006783165  6.849917  1.264459e-11

For comparison, here is the original model.

summary(mMatAge)$coefficients

##               Estimate  Std. Error   t value      Pr(>|t|)
## (Intercept) 2.01193964 0.040634103 49.513573 5.417337e-276
## MumAge      0.04667381 0.007128639  6.547366  9.176103e-11

This seems to me like very little change.

C9.Q2 What important aspect of checking for data points with potentially undue effects on our analysis was missing from the operation I told you to conduct in question 1? Rectify the situation.

In and of themselves, outlier residuals are not that important. High leverage comes from data points with extreme values of the predictor variable(s), and those data points that are really influential are those with extreme values of both the predictor and the response. A look at the raw data (say in the background to figure 2.4) doesn’t suggest any serious concerns. However, four out of five of the data points from the oldest mothers are below the regression line. So perhaps we could see what happens if we remove mothers with ages greater than ten years:

d_sub<-subset(unicorns,unicorns$MumAge<=10)
newModel<-lm(BirthWt~MumAge,data=d_sub)
summary(newModel)$coefficients

##               Estimate  Std. Error   t value      Pr(>|t|)
## (Intercept) 1.99197799 0.042037864 47.385328 1.380742e-260
## MumAge      0.05132986 0.007561621  6.788208  1.917646e-11

This regression is a tiny bit steeper. The difference is on the order of 5 g of offspring mass per year of maternal age. This is about 10% of the estimated slope, and about 0.25% of the mean mass of a newborn unicorn. This is not a serious concern for most biological purposes.

If there is any set of data points that could be severely pulling the regression upwards, it might be those with age values greater than, say eight or above, and masses greater than, say 3 kg. Let’s check what happens if we remove these.

# a bit of care is needed to make sure just to
# cut out the extreme values of both variables.
# I got it wrong myself on the first attempt.
d_sub<-subset(unicorns,unicorns$MumAge<8
     | unicorns$BirthWt<3)
newModel<-lm(BirthWt~MumAge,data=d_sub)
summary(newModel)$coefficients

##               Estimate  Std. Error   t value      Pr(>|t|)
## (Intercept) 2.06551699 0.039646291 52.098618 3.353146e-289
## MumAge      0.03116324 0.007080192  4.401468  1.188839e-05

In this last exercise, we cut out a bunch of data points in the upper-right portion of the scatter of the raw data. It is therefore inevitable that the fitted regression equation will have a less steep slope. It is hard to imagine a situation where the difference between the estimated slope of 0.031 $kg \cdot year^{-1}$ , vs. the estimate of 0.047 $kg \cdot year^{-1}$ for the whole dataset, would make a major effect on biological conclusions.

Share this: