Simple linear regression aims to find a linear relationship
between a response variable and a possible predictor variable by the
method of least squares.
A correlation coefficient is a number between -1 and 1 which
measures the degree to which two variables are linearly related. If
there is perfect linear relationship with positive slope between the
two variables, we have a correlation coefficient of 1; if there is
positive correlation, whenever one variable has a high (low) value,
so does the other. If there is a perfect linear relationship with
negative slope between the two variables, we have a correlation
coefficient of -1; if there is negative correlation, whenever one
variable has a high (low) value, the other has a low (high) value. A
correlation coefficient of 0 means that there is no linear
relationship between the variables. .
A regression equation allows us to express the relationship
between two (or more) variables algebraically. It indicates the
nature of the relationship between two (or more) variables. In
particular, it indicates the extent to which you can predict some
variables by knowing others, or the extent to which some are
associated with others.
A linear regression equation is usually written
Y = a + bX + e
where
Y is the dependent variable
a is the intercept
b is the slope or regression coefficient
X is the independent variable
e is the error term
The equation will specify the average magnitude of the expected
change in Y given a change in X.
The regression equation is often represented on a scatterplot by
a regression line.
The method of least squares is a criterion for fitting a
specified model to observed data. For example, it is the most
commonly used method of defining a straight line through a set of
points on a scatterplot.
The multiple regression correlation coefficient, R2,
is a measure of the proportion of variability explained by, or due
to the regression (linear relationship) in a sample of paired data.
It is a number between zero and one and a value close to zero
suggests a poor model.
A very high value of R2
can arise even though the relationship between the two
variables is non-linear. The fit of a model should never simply be
judged from the R2
value.
Residual (or error) represents unexplained (or residual)
variation after fitting a regression model. It is the difference (or
left over) between the observed value of the variable and the value
suggested by the regression model.
A best regression
model is sometimes developed in stages. A list of several potential
explanatory variables are available and this list is repeatedly
searched for variables which should be included in the model. The
best explanatory variable is used first, then the second best, and
so on. This procedure is known as stepwise regression
Example on simple linear regression
Several college students
applied their recently
obtained statistical
knowledge to their hunt for
the best deal on an
apartment. One of several
determinants of monthly
rent was apartment
Size. The students
collected a sample of 20
apartments with the
following monthly rental
prices and square footage.
Conduct a simple
linearregression analysis.
Example
Example
Example
Example
lThe
coefficient of correlation
r = 0.786, confirms
the positive linear relationship that we observed in the
scatter plot.
l
lNext,
is to perform the regression analysis.
Example
l
lStat>Regression>Regression
lSelect
Rent in Response
lFootage
in Predictors
lClick
Results and choose the second Display.
lClick
OK
Example
Example
Regression Analysis:
Rent versus Footage
The regression equation is
Rent = 184 + 0.314 Footage
PredictorCoefSE CoefTP
Constant183.7051.123.590.002
Footage0.313640.058235.390.000
S = 17.60R-Sq = 61.7%R-Sq(adj) = 59.6%
Analysis of Variance
SourceDFSSMSFP
Regression18986.68986.629.010.000
Residual Error185575.1309.7
Total1914561.7
Example
lThe
first part of the minitab output gives the regression
equation:
The regression equation is
Rent = 184 + 0.314 Footage
Example
l
S
= 17.60R-Sq = 61.7%R-Sq(adj) = 59.6%
lAbout
62% of the total sum of squared errors of the monthly
rents about their mean can be explain by the
regression equation.
Example on multiple linear regressiom
lWhen
a house needs to be appraised for a mortgage or property
taxes, the appraiser typically approaches the problem by
selecting four to six comparablehomes in the area which have sold recently.
lThen
the price is adjusted up or down to reflect differences
between comparable homes.
Example
lSuppose
a homeowner in a residential area is interested in
predicting the value of her home, and has gathered the
following data on homes for sale in her area.
lThe
objective is to develop a useful regression model.
The Data
Scatter Plot
lTo
develop the model, we construct scatter plots to study the
relationship between the response variable (y) and the
independent variable and calculate correlation between all
pairs of variables.
Scatter Plot
lConsider
the following model,
lWith
y = price, x1= bedrooms,
x2= area and x3= age
Scatter Plot
lClick
Graph > Matrix Plot
lSelect
Price, Bedrooms, SqFtArea and Age inGraph Variables
lClick
Options, choose Upper right MatrixDisplay
lOK
Scatter Plot
Scatter Plot
Scatter Plot
Scatter Plot
lWhat
can we say about the scatter plot?
lIn
the first row of the matrix show the relationships
between the response variable and the independent
variables.
Scatter Plot
lThe
first scatter plot indicates a positive relationship
between price and number of bedrooms.
lThe
second scatter plot shows a positive relationship between
price and area of the home.
lThe
last scatter plot has a negative relationship.
The
other three scatter plotsshow the relationship between pairs of independent
variables.
lThe
next step is to find the correlation between response and
the independent variables.
Correlation
lClick
Stat>Basic Statistics>Correlation
lSelect
Price, Bedrooms,SqFtArea and Age in Variables
lOK
Correlation
Correlation
Correlation
Correlation
lThe
evidence found in the scatter plot were supported by the
correlation value in the output.
lThe
highest correlation is between sale price and area of the
home.
lThe
moderately high correlation between number of bedrooms and
area indicates possible multicollinearity.
Regression Equation
lClickStat > Regression>Regression
lSelect
Bedrooms, SqFtArea and Age in Predictors
lClick
Results and choose the second Display
lOK
Regression Equation
Regression Equation
Regression Equation
Least Squares Regression Equation
lThe
following model,
lwith
y = price, x1= bedrooms,
x2= area and x3= age is fit to the
data
Least Squares Regression Equation
lThe
first part of the Minitab output gives the regression
equation,
The regression equation is
Price = 54686+3232 Bedrooms + 33.4 SqFtArea - 672 Age
Coeffient of Determination
lThe
coefficient of determination is R2= 68.4%.
lWhich
mean approximately 68% of the total variation in home
prices is explained by the regression model.
l32%
is not explained by the regression model.
Testing The Usefulness Of The Model
lSome
hypothesis testing must be performed to determine whether
the model is useful in predicting sale price.
lTo
test whether the overall model is useful, the null and
alternative hypotheses are;
Hypotheses Testing
lThe
test statistics F = 23.07 and the p-value = 0.000 are
given in the analysis of variance table.
lSince
the p-value = 0.000, we would reject H0for anylevel.
Hypotheses Testing
lWe
have strong evidence to conclude the model is useful for
predicting the sale price of residential property.
lThe
next step is to test the usefulness of the predictors.
Usefulness Of The Predictors
lThe
least useful predictor is one with the highest p-value,
which in this example is the number of bedrooms.
Usefulness Of The Predictors
lFrom
the regression coefficient table,
PredictorCoefSE CoefTP
Constant54686138213.960.000
Bedrooms323251510.630.535
SqFtArea33.4195.4746.110.000
Age-672.2258.9-2.600.014
lThe
p-value = 0.535 so we do not reject the null hypotheses.
There is not sufficient evidence that the number of
bedrooms is a useful predictor.
New Model
lSince
we do not have enough evidence that the number of bedrooms
is a useful predictor, try to make a new model by
excluding the number of bedrooms.
lRun
the regression analysis again using area and age as
predictors.
New Model
lClick
Stat>Regression>Regression
lSelect
Price in Response
lSelect
SqFtArea and Age in Predictors
lClick
Results and choose the second Display
lOK
New Model
New Model
Residual Analysis
lTo
determine whether the regression model is misspecified,
whether there are unusual observations or outliers.
lThe
model assumes that the errors are independent and that
probability distribution ofis normal with zero mean and a constant varians.
Residual Plots
lClick
Stat>Regression>Regression
lSelect
Price in Response and AqFtArea and Age in
Predictor
lClick
Storage, choose Residuals and Fits
lClick
Results and choose the second Display
lOK
Residual Plots
Residual Plots
Residual Plots
Residual Plots
lClickStat>Regression>Residual Plot
lSelect
RESI1 in Residuals
lSelect
FITS1 in Fits
lEnter
a Title
lOK
Residual Plots
Residual Analysis
Residual Analysis
Residual Analysis
lA
normal distribution of
residuals would plot a
straight line on the normal
plot and as a mount shaped
histogram.
lBoth
plot indicates a normal
distribution.
Residual Analysis
lThe
I Chart and Residuals Vs.
Fits plot a random pattern
in the residuals.
lThe
successful test given with
the outputindicate that
errors are independent and
thereare no outliers or
unusual residuals.
Stepwise Regression
lIt
is a method of selecting,
from a set of independent
variables, those that
produce the best equation.
lIt
then selects the independent
variable that has the
highest partial correlation.
Stepwise Regression
lConsider
the data given previously.
lA
homeowner in a residential area is
interested in predicting the value
of her home and has gathered data
on homes for sale in her area.
lUse
stepwise regression to identify
significant variables.
Stepwise Regression
lClick
Stat>Regression>Stepwise
lSelect
Price in Response
lSelect
Bedrooms,SqFtArea, Age in
Predictors
lOK
Stepwise
Regression
Stepwise Regression
Stepwise Regression
Stepwise Regression
lAt
Step 1, SqFtArea was selected as the
most useful predictor. The model at Step
1 is,
Price = 49046 + 35.0 SqFtArea
with R2= 61.7%
Stepwise Regression
lAt
Step 2, Age was added as a useful predictor.
The model at Step 2 is,
Price = 60794 + 35.4 SqFtArea –645 Age
with R2= 68.0%
Stepwise Regression
lStepwise
regression has selected the variables SqFtArea
and Age, the same variables which we had
selected in the regression analysis process in
previous example.