Excel Companion Appendix C |
| Line Fitting Using Least Squares | Linefit.xls |
Mathematics modeling often starts with the collection of discrete data points and leads to the discovery of relationships between variables that can be approximated by mathematical functions. In Appendix A of the Excel Companion you learned how to display the graph of a function. In this Appendix you will be introduced to basic concepts and mechanics related to the process of fitting functions to data. This process, called regression, is fully explored in statistics courses where its power and limitations are described in detail. We introduce the basic concept here since it is natural for students to be curious about where functions come from.
Open the file Linefit.xls. The first sheet contains a
collection of points (X,Y) whose scatterplot suggests a linear
relationship between X and Y.
Figure C_1
Obviously, any line that we would accept as a reasonable fit
to this data must have a negative slope. As our first estimate of
the slope, we might guess that each 10 unit increase in X
corresponds to a unit drop of about 5 units in Y. If we were to
maneuver the edge of a transparent ruler on this scatterplot with
a slope equal to -0.5 then a reasonable guess for the y
intercept might be 265.

Figure C_2
In the worksheet Y=mX+b , the range A6:A34 has been named X, the range B6:B34 has been named Y, D2 has been named m, and D3 has been named b. Column B contains the actual values of Y. Column C contains estimated values of Y calculated with a slope m and a y-intercept b. Column D contains the difference between the actual Y and the estimated Y. This type of error is known as a "residual". Some residuals are positive and some negative. Column E contains the squares of each of these errors. We square the errors before adding to avoid having positive and negative errors cancel each other out. Cell H2 contains the Sum of the Squares of these Errors (SSE ) in column E. Observe what happens to SSE when you make slight changes to the slope and y-intercept. If, for example, you change the slope to -0.45 and the intercept to 262, the value of SSE decreases to 85.095. This line might be considered a better fit than Y = -0.5X + 265 under the criterion of "lower SSE". Another line, Y = -0.35 X + 257, produces an SSE of 75.575, which is even lower. Is it possible to determine the values of m and b that result in the least sum of the squares of the residuals?
| One way to find the slope and intercept of the line that satisfies a "least squares" criterion is to use Excel's Solver Add-in under Tools. Set H2 to be the target cell, and minimize SSE by changing the slope and intercept in cells D2 and D3. |
|
|
The results that we want appear in cells D2 and D3 and
represent the slope and intercept of that line which minimizes
SSE. According to the results produced by Solver, the
minimum SSE of 40.0595 corresponds to a line whose slope is
-.40184 and whose y-intercept is 260.1179.
Graphical calculators and statistical software programs have
built-in regression tools that can easily produce information
about regression lines or curves given the coordinates of the
points. Excel has such a tool built into its charting utility. It
is called "Trendline" and is illustrated below.
Figure C_4
4. Click on the Options tab and select Display Equation on Chart and Display R-Squared Value on the Chart.
R-Squared is computed using SSE and is one measure of the goodness of fit; the better the fit, the closer R-Squared is to 1. Next click OK.
Figure C_5
Trendline tells us that the line of best fit has the equation
y = -0.4018x + 260.12 and that the value of R-Squared is =
0.9803. Note that this is the same line that we obtained by
minimizing SSE using Solver. Trendlines are often used to make
predictions but must be used cautiously, in context, and with
common sense.
Figure C_6
In the trendline shown above, X = 95 would give us an
extrapolated Y value near 222. When X = 200 the corresponding
extrapolated value for Y is below 180. Note that X = 200 is far
from the observed data in the interval 13 < X <
85. Very little confidence can be associated with this second
extrapolation regardless of the context.
![]()
Copyright © Joseph F. Aieta, Babson
College 1997