The regression line predicts the average y value associated with a given x value. Note that is also necessary to get a measure of the spread of the y values around that average. To do this, we use the root-mean-square error (r.m.s. error).

To construct the r.m.s. error, you first need to determine the residuals. Residuals are the difference between the actual values and the predicted values. I denoted them by , where is the observed value for the ith observation and is the predicted value.

They can be positive or negative as the predicted value under or over estimates the actual value. Squaring the residuals, averaging the squares, and taking the square root gives us the r.m.s error. You then use the r.m.s. error as a measure of the spread of the y values about the predicted y value.

As before, you can usually expect 68% of the y values to be within one r.m.s. error, and 95% to be within two r.m.s. errors of the predicted values. These approximations assume that the data set is football-shaped.

Squaring the residuals, taking the average then the root to compute the r.m.s. error is a lot of work. Fortunately, algebra provides us with a shortcut (whose mechanics we will omit).

The r.m.s error is also equal to times the SD of y.

Thus the RMS error is measured on the same scale, with the same units as .

The term is always between 0 and 1, since r is between -1 and 1. It tells us how much smaller the r.m.s error will be than the SD.

For example, if all the points lie exactly on a line with positive slope, then r will be 1, and the r.m.s. error will be 0. This means there is no spread in the values of y around the regression line (which you already knew since they all lie on a line).

The residuals can also be used to provide graphical information. If you plot the residuals against the x variable, you expect to see no pattern. If you do see a pattern, it is an indication that there is a problem with using a line to approximate this data set.

To use the normal approximation in a vertical slice, consider the points in the slice to be a new group of Y's. Their average value is the predicted value from the regression line, and their spread or SD is the r.m.s. error from the regression.

Then work as in the normal distribution, converting to standard units and eventually using the table on page 105 of the appendix if necessary.