Thayer Watkins
Silicon Valley
& Tornado Alley

The Distribution of Linear Regression Coefficients

The purpose of this page is to illustrate that a regression analysis of a data set represents but a single sample of distribution. Thus the numerical value from the regression is only a single sample estimate of the population parameters.

The simple regression model assumes that there is a set of values for an independent variable denoted by {xi: i=1,...,n}. The values of a dependent variable y are generated by the following scheme:

yi = α + βxi + ui

where the ui's are independent random variables with a normal distribution of mean 0 and variance σ2.

A given set of valus for the random variable u constitutes a sample. The parameters α and β are the population values. Regression analysis gives values a and b which are sample estimates of the population parameters. What is shown below is the histogram sample estimates for 2000 repetitive samples.

Let the sample averages for x and y be denoted as x and y. The sample estimates for the variances of x and y and the covariance of y with x are given by the formulas

Var(x) = x2-(x)2
Var(x) = y2-(y)2
Cov(x,y) = xy - xy

With these definitions the formulas for the regression coefficients can be reduced to

b = Cov(x,y)/Var(x)
a = y - b*x

When the expression yi = α + βxi + ui is substituted into the formula for the regression coefficient b the result reduces to:

b = β + Cov(x,u)/Var(x)

and thus the expected value of b is seen to be the population value β because the expected value of Cov(x,u) is zero. The deviation of b from its expected value is b-β, which is thus equal to Cov(x,u). The variance of b can then be shown to be equal to

Var(b) = σ2/(nVar(x))
and hence
Std(b) = σ/(√nStd(x))

It can be shown that

Var(a) = x2Var(b)
and thus
Std(a) = σRmsq(x)/(√nStd(x))
where Rmsq(x) = root mean square of x = (x2)1/2

Below is illustrated the histograms for the sample regression coefficients for 2000 samples.

Whenever the screen display is refreshed a new collection of 2000 samples is generated.

The Joint Distribution of the Sample Regression Coefficients

The sample estimate of slope coefficient is not independent of the estimate of the intercept coefficient. This may be visually observed by looking at the joint distribution. The frequencies of the combinations of a and b are coded as variation in color. In the display below the highest frequency combinations are displayed as bright red. The zero frequency combinations are displayed as black.

The visual display above demonstates the correlation of the estimates of a and b. An analysis indicates that the correlation coefficient is a function entirely of the distribution of the independent variable values. In particular, the correlation coefficient is equal to the negative of the ratio of the mean value of x to the root mean square value of x. For the above case the independent variable has the values {1,2,3,4,5,6,7,8,9,10} and thus the mean value is 5.5. The mean square value is 38.5 and its square root is 6.205. The ratio is 5.5/6.205=0.886 so the correlation coefficient is then -0.886.

HOME PAGE OF applet-magic
HOME PAGE OF Thayer Watkins