Statistical Background on DOE

=Statistical Background=

Introduction
Variations occur in nature, be it the tensile strength of a particular grade of steel, caffeine content in your energy drink or the distance traveled by your vehicle in a day. Variations are also seen in the observations recorded during multiple executions of a process, even when all factors are strictly maintained at their respective levels and all the executions are run as identically as possible. The natural variations that occur in a process, even when all conditions are maintained at the same level, are often termed as noise. When the effect of a particular factor on a process is studied it becomes extremely important to distinguish the changes in the process caused by the factor from noise. A number of statistical methods are available to achieve this. This chapter covers basic statistical concepts that are useful in understanding the statistical analysis of data obtained from designed experiments. The initial sections of this chapter discuss the normal distribution and related concepts. The assumption of the normal distribution is widely used in the analysis of designed experiments. The subsequent sections introduce the standard normal, Chi-Squared, t and F distributions that are widely used in calculations related to hypothesis testing and confidence bounds. The final sections of this chapter cover hypothesis testing. It is important to gain a clear understanding of hypothesis testing because this concept finds direct application in the analysis of designed experiments to determine whether a particular factor is significant or not. [15]

Random Variables and the Normal Distribution
If you record the distance traveled by your car everyday then these values would show some variation because it is unlikely that your car travels the same distance each day. If a variable X is used to denote these values then is termed as a random variable (because of the diverse and unpredicted values X can have). Random variables are denoted by uppercase letters while a measured value of the random variable is denoted by the corresponding lowercase letter. For example, if the distance traveled by your car on January 1 was 10.7 miles then:


 * $$x=10.7 miles $$

A commonly used distribution to describe the behavior of random variables is the normal distribution. When you calculate the mean and standard deviation for a given data set, you are assuming that the data follows a normal distribution. A normal distribution (also referred to as the Gaussian distribution) is a bell shaped curved (see Figure 3.1). The mean and standard deviation are the two parameters of this distribution. The mean determines the location of the distribution on the $$x $$ axis and is also called the location parameter of the normal distribution. The standard deviation determines the spread of the distribution (how narrow or wide) and is thus called the scale parameter of the normal distribution. The standard deviation, or its square called variance, gives an indication of the variability or spread of data. A large value of the standard deviation (or variance) implies that a large amount of variability exists in the data.



Figure 3.1: Normal probability density functions for different values of mean and standard deviation.

Any curve is also referred to as the probability density function or pdf of the normal distribution as the area under the curve gives the probability of occurrence of for a particular interval. For instance, if you obtained the mean and standard deviation for the distance data of your car as 15 miles and 2.5 miles respectively, then the probability that your car travels a distance between 7 miles and 14 miles is given by the area under the curve covered between these two values which is calculated as 34.4% (see Figure 3.2). This means that on 34.4 days out of every 100 days your car travels, you car can be expected to cover a distance in the range of 7 to 14 miles.



Figure 3.2: Normal probability density function with the shaded area representing the probability of occurrence of data between 7 and 14 miles.

On a normal probability density function, the area under the curve between the values of Mean-(3 x Standard Deviation) and Mean+(3 x Standard Deviation) is approximately 99.7% of the total area under the curve. This implies that almost all the time (or 99.7% of the time) the distance traveled will fall in the range of 7.5 miles and 22.5 miles. Similarly, Mean+/-(2 x Standard Deviation) covers approximately 95% of the area under the curve and  covers approximately 68% of the area under the curve.

Population Mean, Sample Mean and Variance
If data for all of the population under investigation is known, then the mean and variance for this population can be calculated as follows:

Population Mean: (1)

Population Variance: (2)

Here, is the size of the population.

The population standard deviation is the positive square root of the population variance.

Most of the time it is not possible to obtain data for the entire population. For example, it is impossible to measure the height of every male in a country to determine the average height and variance for males of a particular country. In such cases, results for the population have to be estimated using samples. This process is known as statistical inference. Mean and variance for a sample are calculated using the following relations:

Sample Mean: (3)

Sample Variance: (4)

Here, $$n $$ is the sample size.

The sample standard deviation is the positive square root of the sample variance.

The sample mean and variance of a random sample can be used as estimators of the population mean and variance respectively. The sample mean and variance may be referred to as statistics. A statistic is any function of observations in a random sample.

You may have noticed that the denominator in the calculation of sample variance, unlike the denominator in the calculation of population variance, is ($$n-1 $$) and not $$n $$. The reason for this difference is explained in "Unbiased and Biased Estimators."

Central Limit Theorem
The Central Limit Theorem states that for large sample size $$n $$:

1. The sample means from a population are normally distributed with a mean value equal to the population mean, $$\mu $$, even if the population is not normally distributed.

What this means is that if random samples are drawn from any population and the sample mean, $$\bar{x} $$, calculated for each of these samples, then these sample means would follow the normal distribution with a mean (or location parameter) equal to the population mean, $$\mu $$. Thus, the distribution of the statistic, $$\bar{x} $$, would be a normal distribution with mean $$\mu $$. The distribution of a statistic is called the sampling distribution.

2. The variance,$$s^2 $$, of the sample means would be times smaller than the variance of the population,$$\sigma^2 $$. This implies that the sampling distribution of the sample means would have a variance equal to $$\sigma^2/n $$ (or a scale parameter equal to $$\sigma^2/\sqrt{n} $$), where $$\sigma $$ is the population standard deviation. The standard deviation of the sampling distribution of an estimator is called the standard error of the estimator. Thus the standard error of sample mean $$\bar{x} $$ is $$\sigma/\sqrt{n} $$.

In short, the Central Limit Theorem states that the sampling distribution of the sample mean is a normal distribution with parameters and  as shown in Figure 3.3.



Figure 3.3: Sampling distribution of the sample mean. The distribution is normal with the mean equal to the population mean and the variance equal to the th fraction of the population variance.

Unbiased and Biased Estimators
If the mean value of an estimator equals the true value of the quantity it estimates, then the estimator is called an unbiased estimator (see Figure 3.4). For example, assume that the sample mean is being used to estimate the mean of a population. Using the Central Limit Theorem, the mean value of the sample means equals the population mean. Therefore, the sample mean is an unbiased estimator of the population mean.

If the mean value of an estimator is either less than or greater than the true value of the quantity it estimates, then the estimator is called a biased. For example, suppose you decide to choose the smallest observation in a sample to be the estimator of the population mean. Such an estimator would be biased because the average of the values of this estimator would always be less than the true population mean. In other words, the mean of the sampling distribution of this estimator would be less than the true value of the population mean it is trying to estimate. Consequently, the estimator is a biased estimator.



Figure 3.4: Example showing the distribution of a biased estimator which underestimates the parameter in question, along with the distribution of an unbiased estimator.

A case of biased estimation is seen to occur when sample variance, $$s^2 $$, is used to estimate the population variance, $$\sigma^2 $$, if the following relation is used to calculate the sample variance:



The sample variance calculated using this relation is always less than the true population variance. This is because to calculate the sample variance, deviations with respect to the sample mean, $$\bar{x} $$, are used. Sample observations, $$x_i $$, tend to be closer to $$\bar{x} $$ than to $$\mu $$. Thus, the calculated deviations ($$x_i-\bar{x} $$) are smaller. As a result, the sample variance obtained is smaller than the population variance. To compensate for this, ($$n-1 $$) is used as the denominator in place of in the calculation of sample variance. Thus, the correct formula to obtain the sample variance is:



It is important to note that although using ($$n-1 $$) as the denominator makes the sample variance,($$s^2 $$), an unbiased estimator of the population variance,($$\sigma^2 $$), the sample standard deviation, ($$s $$), still remains a biased estimator of the population standard deviation, ($$\sigma $$). For large sample sizes this bias is negligible.

Degrees of Freedom(dof)
Degrees of freedom refer to the number of independent observations made in excess of the unknowns. If there are 3 unknowns and 7 independent observations are taken then the number of degrees of freedom is 4(7-3=4). As another example, two parameters are needed to specify a line, therefore, there are 2 unknowns. If 10 points are available to fit the line, the number of degrees of freedom is 8(10-2=8).

Standard Normal Distribution
A normal distribution with mean $$\mu=0 $$ and variance $$\sigma^2=1 $$ is called the standard normal distribution (see Figure 3.5). Standard normal random variables are denoted by Z. If X represents a normal random variable that follows the normal distribution with mean $$\mu $$ and variance $$\sigma^2 $$, then the corresponding standard normal random variable is:
 * $$Z=(X-\mu)/\sigma $$(5)

Z represents the distance of from the mean  in terms of the standard deviation.



Figure 3.5: Standard normal distribution.

Chi-Squared Distribution
If Z is a standard normal random variable, then the distribution of $$Z^2 $$ is a Chi-Squared distribution(see Figure 3.6). A Chi-Squared random variable is represented by $$X^2 $$. Thus:

(6)



Figure 3.6: Chi-Squared distribution.

The distribution of the variable $$X^2 $$ mentioned in the previous equation is also referred to as centrally distributed Chi-Squared with one degree of freedom. The degree of freedom is one here because here the Chi-Squared random variable is obtained from a single standard normal random variable Z. The previous equation may also be represented by including the degree of freedom into the equation as:


 * $$X_1^2=Z^2 $$

If $$Z_1,Z_2,Z_3...Z_m $$ are independent standard normal random variables then:



is also a Chi-Squared random variable. The distribution of is said to be centrally Chi-Squared with  degrees of freedom, as the Chi-Squared random variable is obtained from  independent standard normal random variables.

If is a normal random variable then the distribution of  is said to be non-centrally distributed Chi-Squared with one degree of freedom. Therefore, is a Chi-Squared random variable and can be represented as:



If $$Z_1,Z_2,Z_3...Z_m $$ are independent normal random variables then:



is a non-centrally distributed Chi-Squared random variable with degrees of freedom.

Student's t Distribution (t Distribution)
If is a standard normal random variable, and  is a Chi-Squared random variable with  degrees of freedom, and both of these random variables are independent, then the distribution of the random variable  such that:

(7)

is said to follow the distribution with  degrees of freedom.



Figure 3.7: distribution.

The distribution is similar in appearance to the standard normal distribution (see Figure 3.7). Both of these distributions are symmetric, reaching a maximum at the mean value of zero. However, the distribution has heavier tails than the standard normal distribution implying that it has more probability in the tails. As the degrees of freedom,, of the distribution approach infinity, the distribution approaches the standard normal distribution.

F Distribution
If and  are two independent Chi-Squared random variables with  and  degrees of freedom, respectively, then the distribution of the random variable  such that:
 * $$F=\frac{\frac{X_u^2}{u}}{\frac{X_\upsilon^2}{\upsilon}} $$

(8)

is said to follow the distribution with  degrees of freedom in the numerator and  degrees of freedom in the denominator. The distribution resembles the Chi-Squared distribution (see Figure 3.8). This is because the random variable, like the Chi-Squared random variable, is non-negative and the distribution is skewed to the right (a right skew means that the distribution is unsymmetrical and has a right tail). The random variable is usually abbreviated by including the degrees of freedom as.



Figure 3.8: F distribution.

Hypothesis Testing
A statistical hypothesis is a statement about the population under study or about the distribution of a quantity under consideration. The null hypothesis,, is the hypothesis to be tested. It is a statement about a theory that is believed to be true but has not been proven. For instance, if a new product design is thought to perform consistently, regardless of the region of operation, then the null hypothesis may be stated as ": New product design performance is not affected by region." Statements in always include exact values of parameters under consideration, e.g. ": The population mean is 100" or simply
 * "$$H_0:\mu=100 $$."

Rejection of the null hypothesis,, leads to the possibility that the alternative hypothesis, , may be true. Given the previous null hypothesis, the alternate hypothesis may be ": New product design performance is affected by region." In the case of the example regarding inference on the population mean, the alternative hypothesis may be stated as ": The population mean is not 100" or simply
 * "$$H_1:\mu $$ &ne; $$100 $$."

Hypothesis testing involves the calculation of a test statistic based on a random sample drawn from the population. The test statistic is then compared to the critical value(s) and used to make a decision about the null hypothesis. The critical values are set by the analyst.

The outcome of a hypothesis test is that we either "reject " or we "fail to reject ." Failing to reject implies that we did not find sufficient evidence to reject. It does not necessarily mean that there is a high probability that is true. As such, the terminology "accept " is not preferred.

Example 3.1

Assume that an analyst wants to know if the mean of a certain population is 100 or not. The statements for this hypothesis can be stated as follows:


 * $$H_0:\mu=100 $$
 * $$H_1:\mu $$ &ne; $$100 $$

The analyst decides to use the sample mean as the test statistic for this test. The analyst further decides that if the sample mean lies between 98 and 102 it can be concluded that the population mean is 100. Thus, the critical values set for this test by the analyst are 98 and 102. It is also decided to draw out a random sample of size 25 from the population.

Now assume that the true population mean is 100 (i.e. ) and the true population standard deviation is 5 (i.e. ). This information is not known to the analyst. Using the Central Limit Theorem, the test statistic (sample mean) will follow a normal distribution with a mean equal to the population mean,, and a standard deviation of , where is the sample size. Therefore, the distribution of the test statistic has a mean of 100 and a standard deviation of. This distribution is shown in Figure 3.9.



Figure 3.9: Acceptance region and critical regions for the hypothesis test in Example 3.1.

The unshaded area in the figure bound by the critical values of 98 and 102 is called the acceptance region. The acceptance region gives the probability that a random sample drawn from the population would have a sample mean that lies between 98 and 102. Therefore, this is the region that will lead to the conclusion of "fail to reject ". On the other hand, the shaded area gives the probability that the sample mean obtained from the random sample lies outside of the critical values. In other words, it gives the probability of rejection of the null hypothesis when the true mean is 100. The shaded area is referred to as the critical region or the rejection region. Rejection of the null hypothesis when it is true is referred to as type I error. Thus, there is a 4.56% chance of making a type I error in this hypothesis test. This percentage is called the significance level of the test and is denoted by. Here or  (area of the shaded region in the figure). The value of is set by the analyst when he/she chooses the critical values.

A type II error is also defined in hypothesis testing. This error occurs when the analyst fails to reject the null hypothesis when it is actually false. Such an error would occur if the value of the sample mean obtained is in the acceptance region bounded by 98 and 102 even though the true population mean is not 100. The probability of occurrence of type II error is denoted by.

Two-Sided and One-Sided Hypotheses
As seen in the previous section, the critical region for the hypothesis test is split into two parts, with equal areas in each tail of the distribution of the test statistic. Such a hypothesis, in which the values for which we can reject are in both tails of the probability distribution, is called a two-sided hypothesis.

The hypothesis for which the critical region lies only in one tail of the probability distribution is called a one-sided hypothesis. For instance, consider the following hypothesis test:


 * $$H_0:\mu=100 $$
 * $$H_1:\mu $$ > $$100 $$

This is an example of a one-sided hypothesis. Here the critical region lies entirely in the right tail of the distribution as shown in Figure 3.10.

The hypothesis test may also be set up as follows:


 * $$H_0:\mu=100 $$
 * $$H_1:\mu $$ < $$100 $$

This is also a one-sided hypothesis. Here the critical region lies entirely in the left tail of the distribution as shown in Figure 3.11.



Figure 3.10: One-sided hypothesis where the critical region lies in the right tail.



Figure 3.11: One-sided hypothesis where the critical region lies in the left tail.

Statistical Inference for a Single Sample
Hypothesis testing forms an important part of statistical inference. As stated previously, statistical inference refers to the process of estimating results for the population based on measurements from a sample. In the next sections, statistical inference for a single sample is discussed briefly.

This section is divided into the following subsections:

Inference on the Mean of a Population When the Variance Is Known
Inference on the Mean of a Population When the Variance Is Known The test statistic used in this case is based on the standard normal distribution. If $$\bar{X} $$ is the calculated sample mean, then the standard normal test statistic is:


 * $$Z_0 $$ = $$\frac{\bar{X}-\mu_0}{\sigma / \sqrt{\eta}} $$(9)

where is the hypothesized population mean,  is the population standard deviation and  is the sample size.

Example 3.2

Assume that an analyst wants to know if the mean of a population,, is 100. The population variance,, is known to be 25. The hypothesis test may be conducted as follows:

1. The statements for this hypothesis test may be formulated as:


 * $$H_0:\mu=100 $$
 * $$H_1:\mu $$ > $$100 $$

It is a clear that this is a two-sided hypothesis. Thus the critical region will lie in both of the tails of the probability distribution.

2. Assume that the analyst chooses a significance level of 0.05. Thus. The significance level determines the critical values of the test statistic. Here the test statistic is based on the standard normal distribution. For the two-sided hypothesis these values are obtained as:


 * $$z_{a/2} $$ = $$z_0.025 $$ = 1.96

and


 * -$$z_{a/2} $$ = -$$z_0.025 $$ = -1.96

3. These values and the critical regions are shown in Figure 3.12. The analyst would fail to reject if the test statistic,, is such that:


 * -$$z_{a/2} $$ $$\leq $$ $$Z_0 $$ $$\leq $$ $$z_{a/2} $$

or


 * -1.96 $$\leq $$ $$Z_0 $$ $$\leq $$ 1.96



Figure 3.12: Critical values and rejection region for Example 3.2 marked on the standard normal distribution.

4. Next the analyst draws a random sample from the population. Assume that the sample size,, is 25 and the sample mean is obtained as. The value of the test statistic corresponding to the sample mean value of 103 is:

Since this value does not lie in the acceptance region, we reject at a significance level of 0.05.

P Value
In the previous example the null hypothesis was rejected at a significance level of 0.05. This statement does not provide information as to how far out the test statistic was into the critical region. At times it is necessary to know if the test statistic was just into the critical region or was far out into the region. This information can be provided by using the value.

The value is the probability of occurrence of the values of the test statistic that are either equal to the one obtained from the sample or more unfavorable to  than the one obtained from the sample. It is the lowest significance level that would lead to the rejection of the null hypothesis,, at the given value of the test statistic. The value of the test statistic is referred to as significant when is rejected. The value is the smallest  at which the statistic is significant and  is rejected.

For instance, in the previous example the test statistic was obtained as. Values that are more unfavorable to in this case are values greater than 3. Then the required probability is the probability of getting a test statistic value either equal to or greater than 3 (this is abbreviated as ). This probability is shown in Figure 3.13 as the dark shaded area on the right tail of the distribution and is equal to 0.0013 or 0.13% (i.e. ). Since this is a two-sided test the value is:


 * p value = 2 x 0.0013 = 0.0026

Therefore, the smallest (corresponding to the test static value of 3) that would lead to the rejection of  is 0.0026.



Figure 3.13: P value for Example 3.2.

Inference on Mean of a Population When Variance Is Unknown When the variance,, of a population (that can be assumed to be normally distributed) is unknown the sample variance, , is used in its place in the calculation of the test statistic. The test statistic used in this case is based on the distribution and is obtained using the following relation:

(10)

The test statistic follows the distribution with  degrees of freedom.

Example 3.3

Assume that an analyst wants to know if the mean of a population,, is less than 50 at a significance level of 0.05. A random sample drawn from the population gives the sample mean,, as 47.7 and the sample standard deviation, , as 5. The sample size,, is 25. The hypothesis test may be conducted as follows:

1. The statements for this hypothesis test may be formulated as:



It is clear that this is a one-sided hypothesis. Here the critical region will lie in the left tail of the probability distribution.

2. Significance level, $$\alpha=0.05 $$. Here, the test statistic is based on the distribution. Thus, for the one-sided hypothesis the critical value is obtained as:



This value and the critical regions are shown in Figure 3.14. The analyst would fail to reject if the test statistic  is such that:

3. The value of the test statistic,, corresponding to the given sample data is:



Since is less than the critical value of -1.7109,  is rejected and it is concluded that at a significance level of 0.05 the population mean is less than 50.

4. P value

In this case the value is the probability that the test statistic is either less than or equal to  (since values less than  are unfavorable to ). This probability is equal to 0.0152.



Figure 3.14: Critical value and rejection region for Example 3.3 marked on the distribution.

Inference on Variance of a Normal Population
The test statistic used in this case is based on the Chi-Squared distribution. If is the calculated sample variance and  the hypothesized population variance then the Chi-Squared test statistic is:

(11)

The test statistic follows the Chi-Squared distribution with degrees of freedom.



Example 3.4

Assume that an analyst wants to know if the variance of a population exceeds 1 at a significance level of 0.05. A random sample drawn from the population gives the sample variance as 2. The sample size,, is 20. The hypothesis test may be conducted as follows:

1. The statements for this hypothesis test may be formulated as:



This is a one-sided hypothesis. Here the critical region will lie in the right tail of the probability distribution.

2. Significance level, $$\alpha=0.05 $$. Here, the test statistic is based on the Chi-Squared distribution. Thus for the one-sided hypothesis the critical value is obtained as:

This value and the critical regions are shown in Figure 3.15. The analyst would fail to reject if the test statistic  is such that:



Figure 3.15: Critical value and rejection region for Example 3.4 marked on the Chi-Squared distribution.

The value of the test statistic corresponding to the given sample data is:

Since is greater than the critical value of 30.1435,  is rejected and it is concluded that at a significance level of 0.05 the population variance exceeds 1.

4. P value

In this case the value is the probability that the test statistic is greater than or equal to 38 (since values greater than 38 are unfavorable to ). This probability is determined to be 0.0059.

Statistical Inference for Two Samples
This section briefly covers statistical inference for two samples and is divided into the following subsections:

Inference on the Difference in Population Means When Variances Are Known
The test statistic used here is based on the standard normal distribution. Let and  represent the means of two populations, and  and  their variances, respectively. Let be the hypothesized difference in the population means and  and  be the sample means obtained from two samples of sizes  and  drawn randomly from the two populations, respectively. The test statistic can be obtained as:

(12)

The statements for the hypothesis test are:



If, then the hypothesis will test for the equality of the two population means.

Inference on the Difference in Population Means When Variances Are Unknown
If the population variances can be assumed to be equal then the following test statistic based on the distribution can be used. Let, , and  be the sample means and variances obtained from randomly drawn samples of sizes  and  from the two populations, respectively. The weighted average,, of the two sample variances is:



$$S_p^2 $$ has ($${n_1+n_2}-2 $$) degrees of freedom. The test statistic can be calculated as:

(13)

$$T_0 $$ follows the distribution with ($${n_1+n_2}-2 $$) degrees of freedom. This test is also referred to as the two-sample pooled t test.

If the population variances cannot be assumed to be equal then the following test statistic is used:

(14)

$$T_0^* $$ follows the distribution with $$\upsilon $$ degrees of freedom. is defined as follows:



Inference on the Variances of Two Normal Populations
The test statistic used here is based on the distribution. If and  are the sample variances drawn randomly from the two populations and  and  are the two sample sizes, respectively, then the test statistic that can be used to test the equality of the population variances is:

(15)

The test statistic follows the distribution with ( -- 1) degrees of freedom in the numerator and ($$n_2 $$ - 1) degrees of freedom in the denominator.

Example 3.5

Assume that an analyst wants to know if the variances of two normal populations are equal at a significance level of 0.05. Random samples drawn from the two populations give the sample standard deviations as 1.84 and 2, respectively. Both the sample sizes are 20. The hypothesis test may be conducted as follows:

1. The statements for this hypothesis test may be formulated as:



It is clear that this is a two-sided hypothesis and the critical region will be located on both sides of the probability distribution.

2. Significance level $$\alpha=0.05 $$. Here the test statistic is based on the distribution. For the two-sided hypothesis the critical values are obtained as:

and



These values and the critical regions are shown in Figure 3.16. The analyst would fail to reject if the test statistic  is such that:



or



3. The value of the test statistic corresponding to the given data is:



Since $$F_0 $$ lies in the acceptance region, the analyst fails to reject $$H_0:\sigma_1^2=\sigma_2^2 $$ at a significance level of 0.05.



Figure 3.16: Critical values and rejection region for Example 3.5 marked on the distribution.