Regression Analysis Using Excel

Regression Analysis

The goal of regression analysis is to describe the relationship between two variables based on observed data and to predict the value of the dependent variable based on the value of the independent variable. Even though we can make such predictions, this doesn’t imply that we can claim any causal relationship between the independent and dependent variables.

Definition 1: If y is a dependent variable and x is an independent variable, then the linear regression model provides a prediction of y from x of the form

Regression model formula

where α + βx is the deterministic portion of the model and ε is the random error. We further assume that for any given value of x the random error ε is normally and independently distributed with mean zero.

Observation: In practice, we will build the linear regression model from the sample data using the least-squares method. Thus we seek coefficients a and b such that

Equation straight line

For the data in our sample we will have

image1688

where ŷi is the y value predicted by the model at xi. Thus the error term for the model is given by

image1689

Example 1: For each x value in the sample data from Example 1 of One Sample Hypothesis Testing for Correlation, find the predicted value ŷ corresponding to x, i.e. the value of y on the regression line corresponding to x. Also find the predicted life expectancy of men who smoke 4, 24 and 44 cigarettes based on the regression model.

Forecast prediction Excel

Figure 1 – Obtaining predicted values for data in Example 1

The predicted values can be obtained using the fact that for any i, the point (xi, ŷi) lies on the regression line and so ŷi = a + bxi. E.g. cell K5 in Figure 1 contains the formula =I5*E4+E5, where I5 contains the first x value 5, E4 contains the slope b and E5 contains the y-intercept (referring to the worksheet in Figure 1 of Method of Least Squares). Alternatively, this value can be obtained by using the formula =FORECAST(I5,J5:J19, I5:I19). In fact, the predicted y values can be obtained, as a single unit, by using the array formula TREND. This is done by highlighting the range K5:K19 and entering the array formula =TREND(J5:J19, I5:I19) followed by pressing Ctrl-Shft-Enter.

The predicted values for x = 4, 24 and 44 can be obtained in a similar manner using any of the three methods defined above. The second form of the TREND formula can be used. E.g. to obtain the predicted values of 4, 24 and 44 (stored in N19:N21), highlight range O19:O21, enter the array formula =TREND(J5:J19,I5:I19,N19:N21) and then press Ctrl-Shft-Enter. Note that these approaches yield predicted values even for values of x that are not in the sample (such as 24 and 44). The predicted life expectancy for men who smoke 4, 24 and 44 cigarettes is 83.2, 70.6 and 58.1 years respectively.

Definition 2: We use the following terminology:

Picture101

The Residual is the error term of Definition 1. We also define the degrees of freedom dfTdfReg, dfRes, the sum of squares SST, SSReg, SSRes and the mean squares MST, MSReg, MSRes as follows:

Picture100

Property 1:

image1711image1712image1713

ObservationSST is the total variability of y (e.g. the variability of life expectancy in Example 1 of One Sample Hypothesis Testing for Correlation). SSReg represents the variability of y that can be explained by the regression model (i.e. the variability in life expectancy that can be explained by the number of cigarettes smoked), and so by Property 1, SSRes expresses the variability of y that can’t be explained by the regression model.

Thus SSReg/SST represents the percentage of the variability of y that can be explained by the regression model. It turns out that this is equal to the coefficient of determination.

Property 2:
image1720

Property 3:
image1721

Observation: Note that for a sample size of 100, a correlation coefficient as low as .197 will result in the null hypothesis that the population correlation coefficient is 0 being rejected (per Theorem 1 of One Sample Hypothesis Testing for Correlation). But when the correlation coefficient r = .197, then r2 = .039, which means that model variance SSReg is less than 4% of the total variance SST which is quite a small association indeed. Whereas this effect is “significant”, it certainly isn’t very “large”.

Observation: From Property 2, we see that the coefficient of determination r2 is a measure of the accuracy of the predication of the linear regression model. r2 has a value between 0 and 1, with 1 indicating a perfect fit between the linear regression model and the data.

Property 4:
image1724image1725

Definition 3:  The standard error of the estimate is defined as

Standard error estimate

Observation: The second assertion in Property 4 can be restated as

image1728

For large samples \frac{n-1}{n-2} ≈ 1 and so

image1730

Note that if r = .5, then

image5045

which indicates that the standard error of the estimate is still 86.6% of the standard error that doesn’t factor in any information about x; i.e. having information about x only reduces the error by 13.4%. Even if r = .9, then sy.x = .436·sy, which indicates that information about x reduces the standard error (with no information about x) by only a little over 50%.

Property 5:

a)  The sums of the y values is equal to the sum of the ŷ values; i.e. \sum_i{y_i} = \sum_i{\hat{y_i}}

b)  The mean of the y values and ŷ values are equal; i.e. ȳ = the mean of the ŷi

c)  The sums of the error terms is 0; i.e. \sum_i{e_i} = 0

d)  The correlation coefficient of x with ŷ is sign(b); i.e. rxŷ = sign(rxy)

e)  The correlation coefficient of y with ŷ is the absolute value of the correlation coefficient of x with y; i.e. r_{y\hat{y}} = |r_{xy}|

f)  The coefficient of determination of y with ŷ is the same as the correlation coefficient of x with y; i.e. r_{y\hat{y}}^2 = r_{xy}^2

Help with any types of dissertations

Undergraduate dissertation

Choose from hundreds of experts who can assist you in completing your undergraduate dissertation! Prices start at $10 per page, with potential discounts for longer orders or extended deadlines

Master’s dissertation

 

If you are in the process of completing a Master’s degree, we can provide you with an experienced writer to finish your dissertation. We strive to offer a quick turnaround on tailored papers at an affordable price, starting at $10.30 per page.

Ph.D. or doctoral dissertation

 

Hire from among our most skilled experts to save your time and ease your workload. Prices for Ph.D. assistance start at $10.60 per page.

Dissertation Writing Help
Need Our Services?
Thesis Writing Help
Editing & Proofreading Services
Need Help, Whatsapp Us Now