Excel Companion Appendix C

 

Line Fitting Using Least Squares

Linefit.xls

Mathematics modeling often starts with the collection of discrete data points and leads to the discovery of relationships between variables that can be approximated by mathematical functions. In Appendix A of the Excel Companion you learned how to display the graph of a function. In this Appendix you will be introduced to basic concepts and mechanics related to the process of fitting functions to data. This process, called regression, is fully explored in statistics courses where its power and limitations are described in detail. We introduce the basic concept here since it is natural for students to be curious about where functions come from.

Open the file Linefit.xls. The first sheet contains a collection of points (X,Y) whose scatterplot suggests a linear relationship between X and Y.

Figure C_1

Obviously, any line that we would accept as a reasonable fit to this data must have a negative slope. As our first estimate of the slope, we might guess that each 10 unit increase in X corresponds to a unit drop of about 5 units in Y. If we were to maneuver the edge of a transparent ruler on this scatterplot with a slope equal to -0.5 then a reasonable guess for the y intercept might be 265.


Figure C_2

In the worksheet Y=mX+b , the range A6:A34 has been named X, the range B6:B34 has been named Y, D2 has been named m, and D3 has been named b. Column B contains the actual values of Y. Column C contains estimated values of Y calculated with a slope m and a y-intercept b. Column D contains the difference between the actual Y and the estimated Y. This type of error is known as a "residual". Some residuals are positive and some negative. Column E contains the squares of each of these errors. We square the errors before adding to avoid having positive and negative errors cancel each other out. Cell H2 contains the Sum of the Squares of these Errors (SSE ) in column E. Observe what happens to SSE when you make slight changes to the slope and y-intercept. If, for example, you change the slope to -0.45 and the intercept to 262, the value of SSE decreases to 85.095. This line might be considered a better fit than Y = -0.5X + 265 under the criterion of "lower SSE". Another line, Y = -0.35 X + 257, produces an SSE of 75.575, which is even lower. Is it possible to determine the values of m and b that result in the least sum of the squares of the residuals?

 

One way to find the slope and intercept of the line that satisfies a "least squares" criterion is to use Excel's Solver Add-in under Tools. Set H2 to be the target cell, and minimize SSE by changing the slope and intercept in cells D2 and D3.

 

FigureC_3

The results that we want appear in cells D2 and D3 and represent the slope and intercept of that line which minimizes SSE. According to the results produced by Solver, the minimum SSE of 40.0595 corresponds to a line whose slope is -.40184 and whose y-intercept is 260.1179.

Graphical calculators and statistical software programs have built-in regression tools that can easily produce information about regression lines or curves given the coordinates of the points. Excel has such a tool built into its charting utility. It is called "Trendline" and is illustrated below.

  1. Activate the chart containing the scatterplot by double -clicking on the border. The border changes to a rope - like frame. In some cases the chart is displayed in its own window.
  2. Activate the data series by clicking the mouse in the vicinity of the points in the scatterplot. Points will be displayed with a black, square background.
  3. From the main menu select Insert… Trendline…and then choose Linear as the Trend/Regression Type.

Figure C_4

4. Click on the Options tab and select Display Equation on Chart and Display R-Squared Value on the Chart.

R-Squared is computed using SSE and is one measure of the goodness of fit; the better the fit, the closer R-Squared is to 1. Next click OK.

Figure C_5

Trendline tells us that the line of best fit has the equation y = -0.4018x + 260.12 and that the value of R-Squared is = 0.9803. Note that this is the same line that we obtained by minimizing SSE using Solver. Trendlines are often used to make predictions but must be used cautiously, in context, and with common sense.

Figure C_6

In the trendline shown above, X = 95 would give us an extrapolated Y value near 222. When X = 200 the corresponding extrapolated value for Y is below 180. Note that X = 200 is far from the observed data in the interval 13 < X < 85. Very little confidence can be associated with this second extrapolation regardless of the context.

Return to Table of Contents  


Copyright © Joseph F. Aieta, Babson College 1997