Template:Rank Regression/Least Squares PE

Rank Regression or Least Squares Parameter Estimation
Using the idea of probability plotting, regression analysis mathematically fits the best straight line to a set of points, in an attempt to estimate the parameters. Essentially, this is a mathematically based version of the probability plotting method discussed previously.

Background Theory
The method of linear least squares is used for all regression analysis performed by Weibull++, except for the cases of the three-parameter Weibull, mixed Weibull, gamma and generalized gamma distributions where a non-linear regression technique is employed. The terms linear regression and least squares are used synonymously in this reference. The term rank regression is used instead of least squares, or linear regression, because the regression is performed on the rank values, more specifically, the median rank values (represented on the y-axis). The method of least squares requires that a straight line be fitted to a set of data points, such that the sum of the squares of the distance of the points to the fitted line is minimized. This minimization can be performed in either the vertical or horizontal direction. If the regression is on $$X$$, then the line is fitted so that the horizontal deviations from the points to the line are minimized. If the regression is on Y, then this means that the distance of the vertical deviations from the points to the line is minimized. This is illustrated in the following figure.



Rank Regression on $$Y$$
Assume that a set of data pairs $$({x_1},{y_1})$$, $$({{x}_{2}},{{y}_{2}})$$,..., $$({{x}_{N}},{{y}_{N}})$$ were obtained and plotted, and that the $$x$$ -values are known exactly. Then, according to the least squares principle, which minimizes the vertical distance between the data points and the straight line fitted to the data, the best fitting straight line to these data is the straight line $$y=\hat{a}+\hat{b}x$$ (where the recently introduced $$(\hat{ })$$ symbol indicates that this value is an estimate) such that: .. and where $$\hat{a}$$ and $$\hat b$$ are the least squares estimates of $$a$$ and $$b$$,and $$N$$ is the number of data points. These equations are minimized by estimates of $$\widehat a$$ and $$\widehat{b}$$ such that:


 * $$\hat{a}=\frac{\underset{i=1}{\overset{N}{\mathop{\sum }}}\,{{y}_{i}}}{N}-\hat{b}\frac{\underset{i=1}{\overset{N}{\mathop{\sum }}}\,{{x}_{i}}}{N}=\bar{y}-\hat{b}\bar{x}$$


 * and:


 * $$\hat{b}=\frac{\underset{i=1}{\overset{N}{\mathop{\sum }}}\,{{x}_{i}}{{y}_{i}}-\tfrac{\underset{i=1}{\overset{N}{\mathop{\sum }}}\,{{x}_{i}}\underset{i=1}{\overset{N}{\mathop{\sum }}}\,{{y}_{i}}}{N}}{\underset{i=1}{\overset{N}{\mathop{\sum }}}\,x_{i}^{2}-\tfrac{N}}$$

Rank Regression on X
Assume that a set of data pairs .., $$({x_2},{y_2})$$,..., $$({x_N},{y_N})$$were obtained and plotted, and that the y-values are known exactly. The same least squares principle is applied, this time minimizing the horizontal distance between the data points and the straight line fitted to the data. The best fitting straight line to these data is the straight line $$x=\widehat{a}+\widehat{b}y$$  such that:


 * $$\underset{i=1}{\overset{N}{\mathop \sum }}\,{{(\widehat{a}+\widehat{b}{{y}_{i}}-{{x}_{i}})}^{2}}=min(a,b)\underset{i=1}{\overset{N}{\mathop \sum }}\,{{(a+b{{y}_{i}}-{{x}_{i}})}^{2}}$$

Again, $$\widehat{a}$$ and $$\widehat b$$ are the least squares estimates of  and $$b,$$ and $$N$$ is the number of data points. These equations are minimized by estimates of $$\widehat a$$ and $$\widehat{b}$$ such that:


 * $$\hat{a}=\frac{\underset{i=1}{\overset{N}{\mathop{\sum }}}\,{{x}_{i}}}{N}-\hat{b}\frac{\underset{i=1}{\overset{N}{\mathop{\sum }}}\,{{y}_{i}}}{N}=\bar{x}-\hat{b}\bar{y}$$


 * and:


 * $$\widehat{b}=\frac{\underset{i=1}{\overset{N}{\mathop{\sum }}}\,{{x}_{i}}{{y}_{i}}-\tfrac{\underset{i=1}{\overset{N}{\mathop{\sum }}}\,{{x}_{i}}\underset{i=1}{\overset{N}{\mathop{\sum }}}\,{{y}_{i}}}{N}}{\underset{i=1}{\overset{N}{\mathop{\sum }}}\,y_{i}^{2}-\tfrac{N}}$$

The corresponding relations for determining the parameters for specific distributions (i.e., Weibull, exponential, etc.), are presented in the chapters covering that distribution.

The Correlation Coefficient
The correlation coefficient is a measure of how well the linear regression model fits the data and is usually denoted by $$\rho $$. In the case of life data analysis, it is a measure for the strength of the linear relation (correlation) between the median ranks and the data. The population correlation coefficient is defined as follows:


 * $$\rho =\frac$$

where $${{\sigma }_{xy}}=$$ covariance of  and  $$y$$,  $${{\sigma }_{x}}=$$  standard deviation of  $$x$$ , and  $${\sigma _y} = $$ standard deviation of $$y$$.

The estimator of $$\rho $$ is the sample correlation coefficient, $$\hat{\rho }$$, given by,


 * $$\hat{\rho }=\frac{\underset{i=1}{\overset{N}{\mathop{\sum }}}\,{{x}_{i}}{{y}_{i}}-\tfrac{\underset{i=1}{\overset{N}{\mathop{\sum }}}\,{{x}_{i}}\underset{i=1}{\overset{N}{\mathop{\sum }}}\,{{y}_{i}}}{N}}{\sqrt{\left( \underset{i=1}{\overset{N}{\mathop{\sum }}}\,x_{i}^{2}-\tfrac{N} \right)\left( \underset{i=1}{\overset{N}{\mathop{\sum }}}\,y_{i}^{2}-\tfrac{N} \right)}}$$

The range of $$\hat \rho $$  is  $$-1\le \hat{\rho }\le 1.$$



The closer the value is to $$\pm 1$$, the better the linear fit. Note that +1 indicates a perfect fit (the paired values ( $${{x}_{i}},{{y}_{i}}$$ ) lie on a straight line) with a positive slope, while -1 indicates a perfect fit with a negative slope. A correlation coefficient value of zero would indicate that the data are randomly scattered and have no pattern or correlation in relation to the regression line model.

Comments on the Least Squares Method
The least squares estimation method is quite good for functions that can be linearized.undefined For these distributions, the calculations are relatively easy and straightforward, having closed-form solutions which can readily yield an answer without having to resort to numerical techniques or tables. Further, this technique provides a good measure of the goodness-of-fit of the chosen distribution in the correlation coefficient. Least squares is generally best used with data sets containing complete data, that is, data consisting only of single times-to-failure with no censored or interval data. Chapter 4 details the different data types, including complete, left censored, right censored (or suspended) and interval data.

See also
 * Least Squares/Rank Regression Equations
 * Discussion on using grouped data with regression methods at Grouped Data Parameter Estimation.