Multiple Linear Regression Analysis

=Multiple Linear Regression Analysis=

Introduction
This chapter expands on the analysis of simple linear regression models and discusses the analysis of multiple linear regression models. A major portion of the results displayed in DOE++ are explained in this chapter because these results are associated with multiple linear regression. One of the applications of multiple linear regression models is Response Surface Methodology (RSM). RSM is a method used to locate the optimum value of the response and is one of the final stages of experimentation. It is discussed in Chapter 9. Towards the end of this chapter, the concept of using indicator variables in regression models is explained. Indicator variables are used to represent qualitative factors in regression models. The concept of using indicator variables is important to gain an understanding of ANOVA models, which are the models used to analyze data obtained from experiments. These models can be thought of as first order multiple linear regression models where all the factors are treated as qualitative factors. ANOVA models are discussed in Chapter 6, Analysis of Experiments.

Multiple Linear Regression Model
A linear regression model that contains more than one predictor variable is called a multiple linear regression model. The following model is a multiple linear regression model with two predictor variables, $$x_1 $$ and $$x_2 $$. (1)

The model is linear because it is linear in the parameters, and. The model describes a plane in the three dimensional space of, and. The parameter is the intercept of this plane. Parameters and  are referred to as partial regression coefficients. Parameter represents the change in the mean response corresponding to a unit change in  when  is held constant. Parameter represents the change in the mean response corresponding to a unit change in  when  is held constant. [Note]

Consider the following example of a multiple linear regression model with two predictor variables, and : (2)

This regression model is a first order multiple linear regression model. This is because the maximum power of the variables in the model is one. The regression plane corresponding to this model is shown in Figure 5.1. Also shown is an observed data point and the corresponding random error,. The true regression model is usually never known (and therefore the values of the random error terms corresponding to observed data points remain unknown). However, the regression model can be estimated by calculating the parameters of the model for an observed data set. This is explained in Chapter 5, Estimating Regression Models Using Least Squares.



Figure 5.1: Regression plane for the model. Figure 5.2 shows the contour plot for the regression model of Eqn. (2). The contour plot shows lines of constant mean response values as a function of and. The contour lines for the given regression model are straight lines as seen on the plot. Straight contour lines result for first order regression models with no interaction terms.



Figure 5.2: Contour plot for the model.

A linear regression model may also take the following form: (3)

A cross-product term,, is included in the model. This term represents an interaction effect between the two variables and. Interaction means that the effect produced by a change in the predictor variable on the response depends on the level of the other predictor variable(s). As an example of a linear regression model with interaction, consider the model given by the equation. The regression plane and contour plot for this model are shown in Figures 5.3 and 5.4, respectively.



Figure 5.3: Regression plane for the model.



Figure 5.4: Contour plot for the model.

Now consider the regression model shown next: (4)

This model is also a linear regression model and is referred to as a polynomial regression model. Polynomial regression models contain squared and higher order terms of the predictor variables making the response surface curvilinear. As an example of a polynomial regression model with an interaction term consider the following equation: (5)

This model is a second order model because the maximum power of the terms in the model is two. The regression surface for this model is shown in Figure 5.5. Such regression models are used in RSM to find the optimum value of the response, (for details see Chapter 9, Response Surface Methods). Notice that, although the shape of the regression surface is curvilinear, the regression model of Eqn. (5) is still linear because the model is linear in the parameters. The contour plot for this model is shown in Figure 5.6.

All multiple linear regression models can be expressed in the following general form: (6)

where denotes the number of terms in the model. For example, the model of Eqn. (5) can be written in the general form using, and  as follows:



Figure 5.5: Regression surface for the model.

Figure 5.6: Contour plot for the model.

Estimating Regression Models Using Least Squares
Consider a multiple linear regression model with predictor variables:

Let each of the predictor variables,, ..., have  levels. Then represents the th level of the th predictor variable. For example, represents the fifth level of the first predictor variable, while  represents the first level of the ninth predictor variable,. Observations,, ..., recorded for each of these levels can be expressed in the following way:

The system of equations shown previously can be represented in matrix notation as follows: (7)

where:

The matrix in Eqn. (7) is referred to as the design matrix. It contains information about the levels of the predictor variables at which the observations are obtained. [Note] The vector contains all the regression coefficients. To obtain the regression model, should be known. is estimated using least square estimates. The following equation is used: (8)

where represents the transpose of the matrix while  represents the matrix inverse. Knowing the estimates,, the multiple linear regression model can now be estimated as: (9)

The estimated regression model is also referred to as the fitted model. The observations,, may be different from the fitted values obtained from this model. The difference between these two values is the residual,. The vector of residuals,, is obtained as: (10) The fitted model of Eqn. (9) can also be written as follows, using from Eqn. (8): (11) where. The matrix,, is referred to as the hat matrix. It transforms the vector of the observed response values,, to the vector of fitted values,.

Example 5.1

An analyst studying a chemical process expects the yield to be affected by the levels of two factors, and. Observations recorded for various levels of the two factors are shown in Table 5.1. The analyst wants to fit a first order regression model to the data. Interaction between and  is not expected based on knowledge of similar processes. Units of the factor levels and the yield are ignored for the analysis.



Table 5.1: Observed yield data for various levels of two factors.

The data of Table 5.1 can be entered into DOE++ using the Multiple Regression tool as shown in Figure 5.7. A scatter plot for the data in Table 5.1 is shown in Figure 5.8. The first order regression model applicable to this data set having two predictor variables is:

where the dependent variable,, represents the yield and the predictor variables, and , represent the two factors respectively. The and  matrices for the data can be obtained as:



Figure: 5.7: Multiple Regression tool in DOE++ with the data in Table 5.1.



Figure 5.8: Three dimensional scatter plot for the observed data in Table 5.1.

The least square estimates,, can now be obtained:

Thus:

and the estimated regression coefficients are, and. The fitted regression model is:

In DOE++, the fitted regression model can be viewed using the Show Analysis Summary icon in the Control Panel. The model is shown in Figure 5.9.



Figure 5.9: Equation of the fitted regression model for the data in Table 5.1.

A plot of the fitted regression plane is shown in Figure 5.10. The fitted regression model can be used to obtain fitted values,, corresponding to an observed response value,. For example, the fitted value corresponding to the fifth observation is:



Figure 5.10: Fitted regression plane for the data of Table 5.1.

The observed fifth response value is. The residual corresponding to this value is:

In DOE++, fitted values and residuals are available using the Diagnostic icon in the Control Panel. The values are shown in Figure 5.11. The fitted regression model can also be used to predict response values. For example, to obtain the response value for a new observation corresponding to 47 units of and 31 units of, the value is calculated using:



Figure 5.11: Fitted values and residuals for the data in Table 5.1.

Properties of the Least Square Estimators, $$\widehat{\beta} $$
The least square estimates,, , ..., are unbiased estimators of , , ..., provided that the random error terms, , are normally and independently distributed. The variances of the s are obtained using the matrix. The variance-covariance matrix of the estimated regression coefficients is obtained as follows: (12)

$$C $$ is a symmetric matrix whose diagonal elements,, represent the variance of the estimated th regression coefficient,. The off-diagonal elements,, represent the covariance between the th and th estimated regression coefficients, and. The value of is obtained using the error mean square,, which can be calculated as discussed in the beginning of Chapter 5, Multiple Linear Regression Analysis. The variance-covariance matrix for the data in Table 5.1 is shown in Figure 5.12. It is available in DOE++ using the Show Analysis Summary icon in the Control Panel. Calculations to obtain the matrix are given in Example 5.3 in Chapter 5, Test on Individual Regression Coefficients. The positive square root of represents the estimated standard deviation of the th regression coefficient,, and is called the estimated standard error of  (abbreviated ). (13)



Figure 5.12: The variance-covariance matrix for the data of Table 5.1.

Hypothesis Tests in Multiple Linear Regression
This section discusses hypothesis tests on the regression coefficients in multiple linear regression. As in the case of simple linear regression, these tests can only be carried out if it can be assumed that the random error terms,, are normally and independently distributed with a mean of zero and variance of.

Three types of hypothesis tests can be carried out for multiple linear regression models:

Test for Significance of Regression
The test for significance of regression in the case of multiple linear regression analysis is carried out using the analysis of variance. The test is used to check if a linear statistical relationship exists between the response variable and at least one of the predictor variables. The statements for the hypotheses are:

The test for is carried out using the following statistic:

where is the regression mean square and  is the error mean square. If the null hypothesis,, is true then the statistic follows the  distribution with  degrees of freedom in the numerator and  degrees of freedom in the denominator. [Note] The null hypothesis,, is rejected if the calculated statistic, , is such that:

Calculation of the Statistic $$F_0 $$
To calculate the statistic, the mean squares and  must be known. As explained in Chapter 4, the mean squares are obtained by dividing the sum of squares by their degrees of freedom. For example, the total mean square,, is obtained as follows: (14)

where is the total sum of squares and  is the number of degrees of freedom associated with. In multiple linear regression, the following equation is used to calculate : [Note] (15)

where is the total number of observations,  is the vector of observations (that was defined in Chapter 5, Estimating Regression Models Using Least Squares),  is the identity matrix of order  and  represents an  square matrix of ones. The number of degrees of freedom associated with, , is. Knowing and  the total mean square,, can be calculated.

The regression mean square,, is obtained by dividing the regression sum of squares, , by the respective degrees of freedom, , as follows: (16)

The regression sum of squares,, is calculated using the following equation: (17)

where is the total number of observations,  is the vector of observations,  is the hat matrix (that was defined in Chapter 5, Estimating Regression Models Using Least Squares) and  represents an  square matrix of ones. The number of degrees of freedom associated with, , is , where is the number of predictor variables in the model. Knowing and  the regression mean square,, can be calculated.

The error mean square,, is obtained by dividing the error sum of squares, , by the respective degrees of freedom, , as follows: (18)

The error sum of squares,, is calculated using the following equation: (19)

where is the vector of observations,  is the identity matrix of order  and  is the hat matrix. The number of degrees of freedom associated with, , is , where is the total number of observations and  is the number of predictor variables in the model. Knowing and, the error mean square, , can be calculated. The error mean square is an estimate of the variance,, of the random error terms,.

Example 5.2

The test for the significance of regression, for the regression model obtained for the data in Table 5.1, is illustrated in this example. The null hypothesis for the model is:

The statistic to test is:

To calculate, first the sum of squares are calculated so that the mean squares can be obtained. Then the mean squares are used to calculate the statistic to carry out the significance test.

The regression sum of squares,, can be obtained as:

The hat matrix, is calculated as follows using the design matrix  from Example 5.1:

Knowing, and , the regression sum of squares, , can be calculated:

The degrees of freedom associated with is, which equals to a value of two since there are two predictor variables in the data in Table 5.1. Therefore, the regression mean square is:

Similarly to calculate the error mean square,, the error sum of squares, , can be obtained as:

The degrees of freedom associated with is. Therefore, the error mean square,, is:

The statistic to test the significance of regression can now be calculated as:

The critical value for this test, corresponding to a significance level of 0.1, is:

Since,  is rejected and it is concluded that at least one coefficient out of  and  is significant. In other words, it is concluded that a regression model exists between yield and either one or both of the factors in Table 5.1. The analysis of variance is summarized in Table 5.2.



Table 5.2: ANOVA table for the significance of regression test in Example 5.2.

Test on Individual Regression Coefficients ($$t $$ Test)
The $$t $$ test is used to check the significance of individual regression coefficients in the multiple linear regression model. Adding a significant variable to a regression model makes the model more effective, while adding an unimportant variable may make the model worse. The hypothesis statements to test the significance of a particular regression coefficient,, are:

The test statistic for this test is based on the distribution (and is similar to the one used in the case of simple linear regression models in Chapter 4): (20)

where the standard error,, is obtained from Eqn. (13). The analyst would fail to reject the null hypothesis if the test statistic, calculated using Eqn. (20), lies in the acceptance region:

This test measures the contribution of a variable while the remaining variables are included in the model. For the model, if the test is carried out for , then the test will check the significance of including the variable in the model that contains  and  (i.e. the model ). Hence the test is also referred to as partial or marginal test. In DOE++, this test is displayed in the Regression Information table.

Example 5.3

The test to check the significance of the estimated regression coefficients for the data in Table 5.1 is illustrated in this example. The null hypothesis to test the coefficient is:

The null hypothesis to test can be obtained in a similar manner. To calculate the test statistic,, we need to calculate the standard error using Eqn. (13).

In Example 5.2, the value of the error mean square,, was obtained as 30.24. The error mean square is an estimate of the variance,. Therefore:

The variance-covariance matrix of the estimated regression coefficients is:

From the diagonal elements of, the estimated standard error for and  is:

The corresponding test statistics for these coefficients are:

The critical values for the present test at a significance of 0.1 are:

Considering, it can be seen that does not lie in the acceptance region of. The null hypothesis,, is rejected and it is concluded that is significant at. This conclusion can also be arrived at using the value noting that the hypothesis is two-sided. The value corresponding to the test statistic, , based on the  distribution with 14 degrees of freedom is:

Since the value is less than the significance,, it is concluded that  is significant. The hypothesis test on can be carried out in a similar manner.

As explained in Chapter 4, in DOE++, the information related to the test is displayed in the Regression Information table as shown in Figure 5.13. In this table, the test for  is displayed in the row for the term Factor 2 because  is the coefficient that represents this factor in the regression model. Columns labeled Standard Error, T Value and P Value represent the standard error, the test statistic for the test and the  value for the  test, respectively. These values have been calculated for in this example. The Coefficient column represents the estimate of regression coefficients. These values are calculated using Eqn. (8) as shown in Example 5.1. The Effect column represents values obtained by multiplying the coefficients by a factor of 2. This value is useful in the case of two factor experiments and is explained in Chapter 7. Columns labeled Low CI and High CI represent the limits of the confidence intervals for the regression coefficients and are explained in Chapter 5, Confidence Interval on Regression Coefficients. The Variance Inflation Factor column displays values that give a measure of multicollinearity. This is explained in Chapter 5, Multicollinearity.



Figure 5.13: Regression results for the data in Table 5.1.

Test on Subsets of Regression Coefficients (Partial F Test)
The partial F test can be considered to be the general form of the test mentioned in the previous section. This is because the test simultaneously checks the significance of including many (or even one) regression coefficients in the multiple linear regression model. Adding a variable to a model increases the regression sum of squares,. The test is based on this increase in the regression sum of squares. The increase in the regression sum of squares is called the extra sum of squares.

Assume that the vector of the regression coefficients,, for the multiple linear regression model, , is partitioned into two vectors with the second vector, , containing the last regression coefficients, and the first vector, , containing the first  coefficients as follows:

with:

The hypothesis statements to test the significance of adding the regression coefficients in to a model containing the regression coefficients in  may be written as:

The test statistic for this test follows the distribution and can be calculated as follows: (21)

where is the increase in the regression sum of squares when the variables corresponding to the coefficients in  are added to a model already containing, and  is obtained from Eqn. (18). The value of the extra sum of squares is obtained as explained in the next section.

The null hypothesis,, is rejected if. Rejection of leads to the conclusion that at least one of the variables in, ... contributes significantly to the regression model. [Note] In DOE++, the results from the partial test are displayed in the ANOVA table.

Types of Extra Sum of Squares
The extra sum of squares can be calculated using either the partial (or adjusted) sum of squares or the sequential sum of squares. The type of extra sum of squares used affects the calculation of the test statistic of Eqn. (21). In DOE++, selection for the type of extra sum of squares is available in the Options tab of the Control Panel as shown in Figure 5.14. The partial sum of squares is used as the default setting. The reason for this is explained in the following section on the partial sum of squares.



Figure 5.14: Selection of the type of extra sum of squares in DOE++.

Partial Sum of Squares
The partial sum of squares for a term is the extra sum of squares when all terms, except the term under consideration, are included in the model. For example, consider the model: (22)

Assume that we need to know the partial sum of squares for. The partial sum of squares for is the increase in the regression sum of squares when  is added to the model. This increase is the difference in the regression sum of squares for the full model of Eqn. (22) and the model that includes all terms except. These terms are, and. The model that contains these terms is: (23)

The partial sum of squares for can be represented as  and is calculated as follows:

For the present case, and. It can be noted that for the partial sum of squares contains all coefficients other than the coefficient being tested.

DOE++ has the partial sum of squares as the default selection. This is because the test explained in Chapter 5, Test on Individual Regression Coefficients, is a partial test, i.e. the  test on an individual coefficient is carried by assuming that all the remaining coefficients are included in the model (similar to the way the partial sum of squares is calculated). The results from the test are displayed in the Regression Information table. The results from the partial test are displayed in the ANOVA table. To keep the results in the two tables consistent with each other, the partial sum of squares is used as the default selection for the results displayed in the ANOVA table.

The partial sum of squares for all terms of a model may not add up to the regression sum of squares for the full model when the regression coefficients are correlated. If it is preferred that the extra sum of squares for all terms in the model always add up to the regression sum of squares for the full model then the sequential sum of squares should be used.

Example 5.4

This example illustrates the partial test using the partial sum of squares. The test is conducted for the coefficient corresponding to the predictor variable  for the data in Table 5.1.

The regression model used for this data set in Example 5.1 is:

The null hypothesis to test the significance of is:

The statistic to test this hypothesis is:

where represents the partial sum of squares for,  represents the number of degrees of freedom for  (which is one because there is just one coefficient, , being tested) and  is the error mean square that can be obtained using Eqn. (18) and has been calculated in Example 5.2 as 30.24. [Note]

The partial sum of squares for is the difference between the regression sum of squares for the full model,, and the regression sum of squares for the model excluding ,. The regression sum of squares for the full model can be obtained using Eqn. (31) and has been calculated in Example 5.2 as. Therefore:

The regression sum of squares for the model is obtained as shown next. First the design matrix for this model,, is obtained by dropping the second column in the design matrix of the full model, (the full design matrix, , was obtained in Example 5.1). The second column of corresponds to the coefficient  which is no longer in the model. Therefore, the design matrix for the model,, is:

The hat matrix corresponding to this design matrix is. It can be calculated using. Once is known, the regression sum of squares for the model, can be calculated using Eqn. (17) as:

Therefore, the partial sum of squares for is:

Knowing the partial sum of squares, the statistic to test the significance of is:

The value corresponding to this statistic based on the  distribution with 1 degree of freedom in the numerator and 14 degrees of freedom in the denominator is: [Note_8]

Assuming that the desired significance is 0.1, since value < 0.1,  is rejected and it can be concluded that  is significant. The test for can be carried out in a similar manner. In the results obtained from DOE++, the calculations for this test are displayed in the ANOVA table as shown in Figure 5.15. Note that the conclusion obtained in this example can also be obtained using the test as explained in Example 5.3 in Chapter 5, Test on Individual Regression Coefficients. The ANOVA and Regression Information tables in DOE++ represent two different ways to test for the significance of the variables included in the multiple linear regression model.



Figure 5.15: ANOVA results for the data in Table 5.1.

Sequential Sum of Squares
The sequential sum of squares for a coefficient is the extra sum of squares when coefficients are added to the model in a sequence. For example, consider the model: (24)

The sequential sum of squares for is the increase in the sum of squares when  is added to the model observing the sequence of Eqn. (24). Therefore this extra sum of squares can be obtained by taking the difference between the regression sum of squares for the model after was added and the regression sum of squares for the model before  was added to the model. The model after is added is as follows: (25)

This is because to maintain the sequence of Eqn. (24) all coefficients preceding must be included in the model. These are the coefficients, , , and.

Similarly the model before is added must contain all coefficients of Eqn. (25) except. This model can be obtained as follows: (26)

The sequential sum of squares for can be calculated as follows:

For the present case, and. It can be noted that for the sequential sum of squares contains all coefficients proceeding the coefficient being tested.

The sequential sum of squares for all terms will add up to the regression sum of squares for the full model, but the sequential sum of squares are order dependent.

Example 5.5

This example illustrates the partial test using the sequential sum of squares. The test is conducted for the coefficient corresponding to the predictor variable  for the data in Table 5.1. The regression model used for this data set in Example 5.1 is:

The null hypothesis to test the significance of is:

The statistic to test this hypothesis is:

where represents the sequential sum of squares for,  represents the number of degrees of freedom for  (which is one because there is just one coefficient, , being tested) and  is the error mean square that can obtained using Eqn. (18) and has been calculated in Example 5.2 as 30.24. [Note]

The sequential sum of squares for is the difference between the regression sum of squares for the model after adding, , and the regression sum of squares for the model before adding ,.

The regression sum of squares for the model is obtained as shown next. First the design matrix for this model,, is obtained by dropping the third column in the design matrix for the full model, (the full design matrix, , was obtained in Example 5.1). The third column of corresponds to coefficient  which is no longer used in the present model. Therefore, the design matrix for the model,, is:

The hat matrix corresponding to this design matrix is. It can be calculated using. Once is known, the regression sum of squares for the model  can be calculated using Eqn. (17) as:

The regression sum of squares for the model is equal to zero since this model does not contain any variables. Therefore:

The sequential sum of squares for is:

Knowing the sequential sum of squares, the statistic to test the significance of is:

The value corresponding to this statistic based on the  distribution with 1 degree of freedom in the numerator and 14 degrees of freedom in the denominator is: [Note]

Assuming that the desired significance is 0.1, since value < 0.1,  is rejected and it can be concluded that  is significant. The test for can be carried out in a similar manner. This result is shown in Figure 5.16.



Figure 5.16: Sequential sum of squares for the data in Table 5.1.

Confidence Intervals in Multiple Linear Regression
Calculation of confidence intervals for multiple linear regression models are similar to those for simple linear regression models explained in Chapter 4, Simple Linear Regression Analysis.

Confidence Interval on Regression Coefficients
A 100 percent confidence interval on the regression coefficient,, is obtained as follows: (27)

The confidence interval on the regression coefficients are displayed in the Regression Information table under the Low CI and High CI columns as shown in Figure 5.13.

Confidence Interval on Fitted Values, $$\widehat{y}_i $$
A 100 percent confidence interval on any fitted value,, is given by: (28)

where:

In Example 5.1 (Chapter 5, Estimating Regression Models Using Least Squares), the fitted value corresponding to the fifth observation was calculated as. The 90% confidence interval on this value can be obtained as shown in Figure 5.17. The values of 47.3 and 29.9 used in the figure are the values of the predictor variables corresponding to the fifth observation in Table 5.1.



Figure 5.17: Confidence interval for the fitted value corresponding to the fifth observation in Table 5.1.

Confidence Interval on New Observations
As explained in Chapter 4, Simple Linear Regression Analysis, the confidence interval on a new observation is also referred to as the prediction interval. The prediction interval takes into account both the error from the fitted model and the error associated with future observations. A 100 percent confidence interval on a new observation,, is obtained as follows:

where:

,..., are the levels of the predictor variables at which the new observation,, needs to be obtained.

In multiple linear regression, prediction intervals should only be obtained at the levels of the predictor variables where the regression model applies. In the case of multiple linear regression it is easy to miss this. Having values lying within the range of the predictor variables does not necessarily mean that the new observation lies in the region to which the model is applicable. For example, consider Figure 5.18 where the shaded area shows the region to which a two variable regression model is applicable. The point corresponding to th level of first predictor variable,, and th level of the second predictor variable, , does not lie in the shaded area, although both of these levels are within the range of the first and second predictor variables respectively. In this case, the regression model is not applicable at this point.



Figure 5.18: Predicted values and region of model application in multiple linear regression.

Measures of Model Adequacy
As in the case of simple linear regression, analysis of a fitted multiple linear regression model is important before inferences based on the model are undertaken. This section presents some techniques that can be used to check the appropriateness of the multiple linear regression model.

Coefficient of Multiple Determination, $$R^2 $$
The coefficient of multiple determination is similar to the coefficient of determination used in the case of simple linear regression. It is defined as: (30)

indicates the amount of total variability explained by the regression model. The positive square root of is called the multiple correlation coefficient and measures the linear association between  and the predictor variables,, ....

The value of increases as more terms are added to the model, even if the new term does not contribute significantly to the model. An increase in the value of cannot be taken as a sign to conclude that the new model is superior to the older model. A better statistic to use is the adjusted statistic defined as follows: (31)

The adjusted only increases when significant terms are added to the model. Addition of unimportant terms may lead to a decrease in the value of.

In DOE++, and values are displayed as R-sq and R-sq(adj), respectively. Other values displayed along with these values are S, PRESS and R-sq(pred). As explained in Chapter 4, the value of S is the square root of the error mean square,, and represents the "standard error of the model."

PRESS is an abbreviation for prediction error sum of squares. It is the error sum of squares calculated using the PRESS residuals in place of the residuals,, in Eqn. (19). The PRESS residual,, for a particular observation, , is obtained by fitting the regression model to the remaining observations. Then the value for a new observation,, corresponding to the observation in question, , is obtained based on the new regression model. The difference between and  gives. The PRESS residual,, can also be obtained using , the diagonal element of the hat matrix, , as follows: (32)

R-sq(pred), also referred to as prediction, is obtained using PRESS as shown next: (33)

The values of R-sq, R-sq(adj) and S are indicators of how well the regression model fits the observed data. The values of PRESS and R-sq(pred) are indicators of how well the regression model predicts new observations. For example, higher values of PRESS or lower values of R-sq(pred) indicate a model that predicts poorly. Figure 5.19. shows these values for the data in Table 5.1. The values indicate that the regression model fits the data well and also predicts well.



Figure 5.19: Coefficient of multiple determination and related results for the data in Table 5.1.

Residual Analysis
Plots of residuals,, similar to the ones discussed in the previous chapter for simple linear regression, are used to check the adequacy of a fitted multiple linear regression model. The residuals are expected to be normally distributed with a mean of zero and a constant variance of. In addition, they should not show any patterns or trends when plotted against any variable or in a time or run-order sequence. Residual plots may also be obtained using standardized and studentized residuals. Standardized residuals,, are obtained using the following equation: (34)

Standardized residuals are scaled so that the standard deviation of the residuals is approximately equal to one. This helps to identify possible outliers or unusual observations. However, standardized residuals may understate the true residual magnitude, hence studentized residuals,, are used in their place. Studentized residuals are calculated as follows: (35)

where is the th diagonal element of the hat matrix,. External studentized (or the studentized deleted) residuals may also be used. These residuals are based on the PRESS residuals mentioned above in the Coefficient of Multiple Determination, R2 section. The reason for using the external studentized residuals is that if the th observation is an outlier, it may influence the fitted model. In this case, the residual will be small and may not disclose that th observation is an outlier. The external studentized residual for the th observation,, is obtained as follows: (36)

Residual values for the data of Table 5.1 are shown in Figure 5.20. These values are available using the Diagnostics icon in the Control Panel. Standardized residual plots for the data are shown in Figures 5.21 to 5.23. DOE++ compares the residual values to the critical values on the distribution for studentized and external studentized residuals. For other residuals the normal distribution is used. For example, for the data in Table 5.1, the critical values on the distribution at a significance of 0.1 are  and  (as calculated in Example 5.3, Chapter 5, Test on Individual Regression Coefficients). The studentized residual values corresponding to the 3rd and 17th observations lie outside the critical values. Therefore, the 3rd and 17th observations are outliers. This can also be seen on the residual plots in Figures 22 and 23.



Figure 5.20: Residual values for the data in Table 5.1.

Figure 5.21: Residual probability plot for the data in Table 5.1.

Figure 5.22: Residual versus fitted values plot for the data in Table 5.1.

Figure 5.23: Residual versus run order plot for the data in Table 5.1.

Outlying $$x $$ Observations
Residuals help to identify outlying observations. Outlying observations can be detected using leverage. Leverage values are the diagonal elements of the hat matrix,. The values always lie between 0 and 1. Values of greater than  are considered to be indicators of outlying  observations. [Note]

Influential Observations Detection
Once an outlier is identified, it is important to determine if the outlier has a significant effect on the regression model. One measure to detect influential observations is Cook's distance measure which is computed as follows: (37) To use Cook's distance measure, the values are compared to percentile values on the  distribution with  degrees of freedom. If the percentile value is less than 10 or 20 percent, then the th case has little influence on the fitted values. However, if the percentile value is close to 50 percent or greater, the th case is influential, and fitted values with and without the th case will differ substantially. [10]

Example 5.6

Cook's distance measure can be calculated as shown next. The distance measure is calculated for the first observation of the data in Table 5.1. The remaining values along with the leverage values are shown in Figure 5.24.

The standardized residual corresponding to the first observation is:

Cook's distance measure for the first observation can now be calculated as:

The 50th percentile value for is 0.83. Since all values are less than this value there are no influential observations.



Figure 5.24: Leverage and Cook's distance measure for the data in Table 5.1.

Lack-of-Fit Test
The lack-of-fit test for simple linear regression discussed in Chapter 4 may also be applied to multiple linear regression to check the appropriateness of the fitted response surface and see if a higher order model is required. Data for replicates may be collected as follows for all  levels of the predictor variables:

The sum of squares due to pure error,, can be obtained as discussed in the previous chapter as:

The number of degrees of freedom associated with are:

Knowing, sum of squares due to lack-of-fit, , can be obtained as: [Note]

The number of degrees of freedom associated with are:

The test statistic for the lack-of-fit test is:

Polynomial Regression Models
Polynomial regression models are used when the response is curvilinear. The equation shown next presents a second order polynomial regression model with one predictor variable:

Usually, coded values are used in these models. Values of the variables are coded by centering or expressing the levels of the variable as deviations from the mean value of the variable and then scaling or dividing the deviations obtained by half of the range of the variable. (38)

The reason for using coded predictor variables is that many times and  are highly correlated and, if uncoded values are used, there may be computational difficulties while calculating the  matrix to obtain the estimates,, of the regression coefficients using Eqn. (8).

Qualitative Factors
The multiple linear regression model also supports the use of qualitative factors. [Note] For example, gender may need to be included as a factor in a regression model. One of the ways to include qualitative factors in a regression model is to employ indicator variables. Indicator variables take on values of 0 or 1. For example, an indicator variable may be used with a value of 1 to indicate female and a value of 0 to indicate male.

In general indicator variables are required to represent a qualitative factor with  levels. As an example, a qualitative factor representing three types of machines may be represented as follows using two indicator variables:

An alternative coding scheme for this example is to use a value of -1 for all indicator variables when representing the last level of the factor:

Indicator variables are also referred to as dummy variables or binary variables.

Example 5.7

Consider data from two types of reactors of a chemical process shown in Table 5.3 where the yield values are recorded for various levels of factor. Assuming there are no interactions between the reactor type and, a regression model can be fitted to this data as shown next.

Since the reactor type is a qualitative factor with two levels, it can be represented by using one indicator variable. Let be the indicator variable representing the reactor type, with 0 representing the first type of reactor and 1 representing the second type of reactor.

Data entry in DOE++ for this example is shown in Figure 5.25. The regression model for this data is:



Table 5.3: Yield data from two types of reactors for a chemical process.



Figure 5.25: Data from Table 5.3 as entered in DOE++.

The and  matrices for the given data are:

The estimated regression coefficients for the model can be obtained using Eqn. (8) as:

Therefore, the fitted regression model is:

Note that since represents a qualitative predictor variable, the fitted regression model cannot be plotted simultaneously against  and  in a two dimensional space (because the resulting surface plot will be meaningless for the dimension in ). To illustrate this, a scatter plot of the data in Table 5.3 against is shown in Figure 5.26. It can be noted that, in the case of qualitative factors, the nature of the relationship between the response (yield) and the qualitative factor (reactor type) cannot be categorized as linear, or quadratic, or cubic, etc. The only conclusion that can be arrived at for these factors is to see if these factors contribute significantly to the regression model. This can be done by employing the partial test of Chapter 5, Test on Subsets of Regression Coefficients (using the extra sum of squares of the indicator variables representing these factors). The results of the test for the present example are shown in the ANOVA table of Figure 5.27. The results show that (reactor type) contributes significantly to the fitted regression model.



Figure: 5.26: Scatter plot of the observed yield values in Table 5.3 against (reactor type).



Figure 5.27: DOE++ results for the data in Table 5.3.

Multicollinearity
At times the predictor variables included in a multiple linear regression model may be found to be dependent on each other. Multicollinearity is said to exist in a multiple regression model with strong dependencies between the predictor variables.

Multicollinearity affects the regression coefficients and the extra sum of squares of the predictor variables. In a model with multicollinearity the estimate of the regression coefficient of a predictor variable depends on what other predictor variables are included the model. The dependence may even lead to change in the sign of the regression coefficient. In such models, an estimated regression coefficient may not be found to be significant individually (when using the test on the individual coefficient or looking at the  value) even though a statistical relation is found to exist between the response variable and the set of the predictor variables (when using the  test for the set of predictor variables). Therefore, you should be careful while looking at individual predictor variables in models that have multicollinearity. Care should also be taken while looking at the extra sum of squares for a predictor variable that is correlated with other variables. This is because in models with multicollinearity the extra sum of squares is not unique and depends on the other predictor variables included in the model. [Note]

Multicollinearity can be detected using the variance inflation factor (abbreviated ). for a coefficient is defined as: (39) where is the coefficient of multiple determination resulting from regressing the th predictor variable,, on the remaining -1 predictor variables. Mean values of considerably greater than 1 indicate multicollinearity problems.

A few methods of dealing with multicollinearity include increasing the number of observations in a way designed to break up dependencies among predictor variables, combining the linearly dependent predictor variables into one variable, eliminating variables from the model that are unimportant or using coded variables. [Note]

Example 5.8

Variance inflation factors can be obtained for the data in Table 5.1. To calculate the variance inflation factor for, has to be calculated. is the coefficient of determination for the model when is regressed on the remaining variables. In the case of this example there is just one remaining variable which is. If a regression model is fit to the data, taking as the response variable and  as the predictor variable, then the design matrix and the vector of observations are:

The regression sum of squares for this model can be obtained using Eqn. (17) as:

where is the hat matrix (and is calculated using ) and  is the matrix of ones. The total sum of squares for the model can be calculated using Eqn. (31) as:

where is the identity matrix. Therefore:

Then the variance inflation factor for is:

The variance inflation factor for, , can be obtained in a similar manner. In DOE++, the variance inflation factors are displayed in the VIF column of the Regression Information Table as shown in Figure 5.28. Since the values of the variance inflation factors obtained are considerably greater than 1, multicollinearity is an issue for the data in Table 5.1.



Figure 5.28: Variance inflation factors for the data in Table 5.1.