appletmagic.com Thayer Watkins Silicon Valley & Tornado Alley USA 


The purpose of this page is to illustrate that a regression analysis of a data set represents but a single sample of distribution. Thus the numerical value from the regression is only a single sample estimate of the population parameters.
The simple regression model assumes that there is a set of values for an independent variable denoted by {x_{i}: i=1,...,n}. The values of a dependent variable y are generated by the following scheme:
where the u_{i}'s are independent random variables with a normal distribution of mean 0 and variance σ^{2}.
A given set of valus for the random variable u constitutes a sample. The parameters α and β are the population values. Regression analysis gives values a and b which are sample estimates of the population parameters. What is shown below is the histogram sample estimates for 2000 repetitive samples.
Let the sample averages for x and y be denoted as x and y. The sample estimates for the variances of x and y and the covariance of y with x are given by the formulas
With these definitions the formulas for the regression coefficients can be reduced to
When the expression y_{i} = α + βx_{i} + u_{i} is substituted into the formula for the regression coefficient b the result reduces to:
and thus the expected value of b is seen to be the population value β because the expected value of Cov(x,u) is zero. The deviation of b from its expected value is bβ, which is thus equal to Cov(x,u). The variance of b can then be shown to be equal to
It can be shown that
Below is illustrated the histograms for the sample regression coefficients for 2000 samples.
Whenever the screen display is refreshed a new collection of 2000 samples is generated.
The sample estimate of slope coefficient is not independent of the estimate of the intercept coefficient. This may be visually observed by looking at the joint distribution. The frequencies of the combinations of a and b are coded as variation in color. In the display below the highest frequency combinations are displayed as bright red. The zero frequency combinations are displayed as black.
The visual display above demonstates the correlation of the estimates of a and b. An analysis indicates that the correlation coefficient is a function entirely of the distribution of the independent variable values. In particular, the correlation coefficient is equal to the negative of the ratio of the mean value of x to the root mean square value of x. For the above case the independent variable has the values {1,2,3,4,5,6,7,8,9,10} and thus the mean value is 5.5. The mean square value is 38.5 and its square root is 6.205. The ratio is 5.5/6.205=0.886 so the correlation coefficient is then 0.886.
HOME PAGE OF Thayer Watkins 