Website Worth

Total Pageviews

Tuesday

Understanding linear regression

Scenario: A sample of 30, two key variables: Health expenditure and Household Income

Question: Do health care expenditure rise with household income?

Model : E = a + bY

E Health care expenditure
a Intercept, or the money spent when income is zero
b the slope of the (imagined) linear line between two variable
Y income

The task now is to find a line that best describes the relationship between two variable.

Method to do the task: Ordinary least squares, or OLS. It finds the best line by minimizing the sum of the squared deviations.

What is 'squared deviations': Deviation is the distance between imagined 'best' line with an actual observation. This deviation is squared.

Why 'minimizing': The best line is the line that goes through a set (a cloud!) of observation, and the distance from each observation to the best line is minimum.

In other words: The fit/best line contains a series of dots. Each dot associates with one observation. The distance between each dot and its associated observation is smallest.

Let's say, after running it in SPSS, now we have this model:

E = 2,000 + 0.2Y

How to interpret it:

"2000" tells us the level of health care expenditure when income is zero. "0.2" says if income increases by one dollar, health care expenditure will increase by 20 cents (0.2*1 dollar).

But: We do not know yet, if the line is really a good one, or it just happens by chance. How can we determine it is a good one? Here we have to use the coefficient of determination, R square, and the t-statistic, t.

Understanding R square and t:

R square ranges between 0 and 1. The closer to 1, the better.

t-statistic of 2 or more: the value of the estimated parameter is at least twice as large as its average deviation. We can place 95 percent confidence in the estimated average value for the parameter.

t-statistic of 3 or more: We can place 99 percent confidence in the estimated parameter.

Now, let's say, SPPSS or SAS or STATA gave us this table:

E = 2,000 + .02 Y R square = .47
  (2.52)    (3.40) N = 30

How to interpret it:

For estimated parameter a (here is 2000), t = 2.52, that means the CI is about 95%

For estimated parameter b (here is 0.2), t = 3.40, that means the CI is about 99%

R square = .47 is what? It means income explains about 47 percent of the variation in health care expenditure.

If you now wonder what other factors that explain variation in health care expenditure, go and read about multiple regression. The rule is same. But you will have in the model more independent variables.