Statistics W1111
Columbia University
 

Instructions for lab 4


Part 1a:

Does spending more money on education have an impact on students' learning?    Some evidence suggests a positive impact, whereas other evidence suggests hardly any impact at all.  In this lab, we'll look at a particular aspect of this question, namely the association between statewide expenditures on education and SAT scores.

Open the data set satjse by clicking on the link.   The file contains data from 1997 on states in the U.S. for the following variables:

State:                                  state name.
Expenditure:                      Per pupil expenditure in state in thousands of dollars.
Student/Faculty Ratio:      Number of faculty per pupil in state.
Salary:                               Average salary for teachers in state in thousands of dollars. 
Percent taking:                  Percentage of students taking the SAT, expressed as % times 100.
Verbal:                               Average score on verbal part of SAT for students who took it.
Math:                                 Average score on math part of SAT for students who took it.
Total SAT Score:               Average combined score of SAT for students who took it.
Expenditure (100s):           Per pupil expenditure in state in hundreds of dollars.

These data were collected by Professor Lynn Guber of the University of Vermont.

Questions:

1)  Describe the general trend in the relationship between expenditures and total SAT scores.

2)  Describe the general trend in the relationship between percent taking and total SAT scores.

3)  Describe the general trend in the relationship between percent taking and expenditures.

4)  The political commentator George Will examined a similar scatter plot between expenditures per pupil and average combined SAT score.  He concluded that, at the state-level, spending more money per pupil does not cause average SAT scores to increase (The Washington Post, September 12, 1993).  

Explain why Will should not make this causal claim based on the scatter plot.  Explain how variables other than expenditures could explain the pattern in the scatter plot.  Use the information in the data to make your arguments.  Failing to cite evidence from the data will lose credit.

Part 1b:

In Part 1a saw that the relationship between expenditures and total score was negative.  But, there was a third variable that was strongly associated with expenditures and SAT scores, namely the percentage taking the SAT.  When we looked at scatter plots involving these three variables, it appeared that the relationship between the percentage taking the SAT and the scores was stronger than the relationship between expenditures and scores, so that percent taking the exam might explain the apparently negative relationship between expenditures and scores.

Multiple regression is designed to parcel out the effects of these variables.  Let's run a multiple regression using the SAT data to estimate the association of SAT scores and expenditures, controlling for percent taking.

Questions:


1)  Fit a multiple regression for "Total SAT score" on "Expenditures" and "Percent taking."  To fit a multiple regression in Stata you use the same command sequence as with a single variable regression (see the previous lab if you don't remember) except now you put all of your independent variables in the Independent Variables box instead of only one.   

a)  What are the estimates of the intercept and the coefficients (slopes)? Write each estimate on your report.
b)  What percentage of the variation in total SAT scores is explained by this regression?

2)  Let's examine plots of residuals versus each of the predictors to make sure the model fits the data reasonably well.  To do this we need to tell Stata to store the residuals as a new variable.  Using the Stata Command Window type:

predict ehat,residuals

*Note: You can use the "predict" command for other things as well.  Typing predict yhat  for example adds a new variable with the estimated (values on the regression line) values of y.*

Now you can look at scatterplots of the residuals versus the predictors in the usual way we look at scatterplots of any two variables.  Nonrandom patterns in these plots (e.g., curves) indicate the regression assumptions do not hold for these data.  Describe what patterns (e.g., random or non-random) you see in the residual plots.

2b) Based on you answer to 5a,  do you think the regression assumptions hold for this model?  A simple sentence saying you think they hold or you think they do not hold will suffice.

Clearly, the plots don't look random for Percent taking.   To deal with the curved pattern, let's try using the natural logarithm of Percent taking instead of Percent taking untransformed.  We do this because the graph of y = log(x) is a curve, as is the relationship between "Total SAT score" and "Percent taking."  Create a new variable for "Log(Percent taking)."  Try it by yourself first (hint: Use the Stata Command Window).  If you have trouble, the correct command is at the bottom of the page.

3)  Fit the multiple regression of "Total SAT score" on "Expenditures" and "Log(Percent taking)."  Perform model checks like those in Question 2.  Based on the plots of residuals, do you think the regression assumptions hold for this model?  A simple sentence saying that you think they hold or do not hold will suffice.

4)  Now that we're convinced our assumptions are satisfied, we can start thinking about our conclusions.
a)  What are the estimates of the intercept and the coefficients (slopes)? Write each estimate on your report.
b)  Write down 95% confidence intervals for the coefficients.  How do you interpret these intervals?
c)  What percentage of the variation in total SAT scores is explained by this regression?

5)  Based on the regression coefficients from 3a, do expenditures appear to be positively or negatively associated with total SAT scores, controlling for the (logarithm of) percent taking the test?  A short answer will suffice.

6a)  Would you be willing to claim that raising expenditures causes SAT scores to increase?  Explain in at most two sentences and use .
6b)  Would you be willing to use this regression to make predictions about SAT scores at the school-level?  Explain in at most two sentences.


Part 2:

Extreme Observations and Influential Points

Follow the link http://www.stat.sc.edu/~west/javahtml/Regression.html

Watch how the regression equation changes as you add more points.  Try adding (i) extreme observations that will drastically change the nature of your regression line and also try adding a few extreme observations that don't change the regression equation all that much.  You don't have to turn anything in for this part of the lab.

Part 3:

Draw your own regression line:

Follow the link http://www.ruf.rice.edu/~lane/stat_sim/reg_by_eye/index.html

Try your best using 5 sets of data (ie-play the game 5 times).  Record the MSE from your line and the true MSE each time.

Note: This lab draws heavily from assignments used by Professor Jerry Reiter, Department of Statistical Science, Duke University.  His contribution is greatly appreciated.

*Here's the command:*
gen log_pct_take=log(percent_taking)