Statistics W1111
Columbia University
Instructions for lab 4
Part 1a:
Does
spending more money on education have an impact on students' learning?
Some evidence suggests a positive impact, whereas other
evidence suggests hardly any impact at all. In this lab, we'll
look at a particular aspect of this question, namely the association
between statewide expenditures on education and SAT scores.
Open the data set satjse by clicking on the link. The file contains data from 1997 on states in the U.S. for the following variables:
State:
state name.
Expenditure:
Per pupil expenditure in state in thousands of dollars.
Student/Faculty Ratio: Number of faculty per pupil in state.
Salary:
Average salary for teachers in state in thousands of dollars.
Percent taking: Percentage of students taking the SAT, expressed as % times 100.
Verbal:
Average score on verbal part of SAT for students who took it.
Math:
Average score on math part of SAT for students who took it.
Total SAT Score: Average combined score of SAT for students who took it.
Expenditure (100s): Per pupil expenditure in state in hundreds of dollars.
These data were collected by Professor Lynn Guber of the University of Vermont.
Questions:
1) Describe the general trend in the relationship between expenditures and total SAT scores.
2) Describe the general trend in the relationship between percent taking and total SAT scores.
3) Describe the general trend in the relationship between percent taking and expenditures.
4) The political commentator George Will examined a similar
scatter plot between expenditures per pupil and average combined SAT
score. He concluded that, at the state-level, spending more money
per pupil does not cause average SAT scores to increase (The Washington Post, September 12, 1993).
Explain why Will should not make this causal claim based on the scatter
plot. Explain how variables other than expenditures could explain
the pattern in the scatter plot. Use the information in the data
to make your arguments. Failing to cite evidence from the data
will lose credit.
Part 1b:
In
Part 1a saw that the relationship between expenditures and total score
was negative. But, there was a third variable that was strongly
associated with expenditures and SAT scores, namely the percentage
taking the SAT. When we looked at scatter plots involving these
three variables, it appeared that the relationship between the
percentage taking the SAT and the scores was stronger than the
relationship between expenditures and scores, so that percent taking
the exam might explain the apparently negative relationship between
expenditures and scores.
Multiple regression is designed to parcel out the effects of these
variables. Let's run a multiple regression using the SAT data to
estimate the association of SAT scores and expenditures, controlling
for percent taking.
Questions:
1) Fit a multiple regression for "Total SAT score" on
"Expenditures" and "Percent taking." To fit a multiple regression
in Stata you use the same command sequence as with a single variable
regression (see the previous lab if you don't remember) except now you
put all of your independent variables in the Independent Variables box
instead of only one.
a) What are the estimates of the intercept and the coefficients (slopes)? Write each estimate on your report.
b) What percentage of the variation in total SAT scores is explained by this regression?
2) Let's examine plots of residuals versus each of the predictors
to make sure the model fits the data reasonably well. To do this
we need to tell Stata to store the residuals as a new variable.
Using the Stata Command Window type:
predict ehat,residuals
*Note: You can use the "predict" command for other things as well. Typing predict yhat for example adds a new variable with the estimated (values on the regression line) values of y.*
Now you can look at scatterplots of the residuals
versus the predictors in the usual way we look at scatterplots of any
two variables. Nonrandom patterns in these plots (e.g., curves)
indicate the regression assumptions do not hold for these data. Describe what patterns (e.g., random or non-random) you see in the residual plots.
2b) Based on you answer to 5a, do you think the regression
assumptions hold for this model? A simple sentence saying you
think they hold or you think they do not hold will suffice.
Clearly, the plots don't look random for Percent taking. To deal with the curved pattern, let's try using the natural logarithm of Percent taking instead of Percent taking untransformed.
We do this because the graph of y = log(x) is a curve, as is the
relationship between "Total SAT score" and "Percent taking."
Create a new variable for "Log(Percent taking)." Try it by
yourself first (hint: Use the Stata Command Window). If you have
trouble, the correct command is at the bottom of the page.
3) Fit the multiple regression of "Total SAT score" on
"Expenditures" and "Log(Percent taking)." Perform model checks
like those in Question 2. Based on the plots of residuals, do you
think the regression assumptions hold for this model? A simple
sentence saying that you think they hold or do not hold will suffice.
4) Now that we're convinced our assumptions are satisfied, we can start thinking about our conclusions.
a) What are the estimates of the intercept and the coefficients (slopes)? Write each estimate on your report.
b) Write down 95% confidence intervals for the coefficients. How do you interpret these intervals?
c) What percentage of the variation in total SAT scores is explained by this regression?
5)
Based on the regression coefficients from 3a, do expenditures
appear to be positively or negatively associated with total SAT scores,
controlling for the (logarithm of) percent taking the test? A
short answer will suffice.
6a) Would you be willing to claim that raising expenditures
causes SAT scores to increase? Explain in at most two sentences
and use .
6b) Would you be willing to use this regression to make
predictions about SAT scores at the school-level? Explain in at
most two sentences.
Part 2:
Extreme Observations and Influential Points
Follow the link http://www.stat.sc.edu/~west/javahtml/Regression.html
Watch how the regression equation changes as you add more points.
Try adding (i) extreme observations that will drastically change
the nature of your regression line and also try adding a few extreme
observations that don't change the regression equation all that much.
You don't have to turn anything in for this part of the lab.
Part 3:
Draw your own regression line:
Follow the link http://www.ruf.rice.edu/~lane/stat_sim/reg_by_eye/index.html
Try your best using 5 sets of data (ie-play the game 5 times). Record the MSE from your line and the true MSE each time.
Note: This lab draws heavily from assignments used by Professor
Jerry Reiter, Department of Statistical Science, Duke University.
His contribution is greatly appreciated.
*Here's the command:*
gen log_pct_take=log(percent_taking)