statistics W2025 Homework |
Homework 2
Due 5pm Wednesday February 6th in your dropbox on courseworks
Homework 3
Due 5pm Wednesday February 13th in your dropbox on courseworks
Homework 4
The Trientalis dataset comprises information on the number of daughter
tubers produced by mother plants of Trientalis eurpaea and a number of
independent variables. Within the dataset the variables consist of the
dry weight of the mother tuber (in milligrams), the leaf area (in
square millimeters), a fertilizer treatment (which we will ignore),
the number of flowers produced by the mother plant, and the number of
new daughter plants (which is the dependent variable). For some
observations, the weight is not available.
See can you find a satisfactory Poisson regression model for the data. Explain how you arrived at your model.
What is your interpretation of the regression parameters?
I got the data from here.
Due 5pm Friday February 22nd in your dropbox on courseworks
Homework 5
Please prepare a one-page project proposal. Your
project will be due after Spring break and should involve a
substantial data analysis and writeup. The last several chapters of
the Zuur textbook should give you some ideas for the kind of level I
have in mind. For this project, I expect you to use R and one
tools we have been studying (i.e., linear, logistic, or Poisson
regression). I would like you to do this in pairs. You can choose
your own partner or, if you let me know, I'll be happy to assign
you a partner. You might find this website helpful.
Due 5pm Wednesday February 27th - please email to dm2418
Homework 6
This homework concerns data on 250 groups that
went to a park. Each group was questioned about how many fish they
caught (count), how many children were in the group (child), how many
people were in the group (persons), and whether or not they brought a
camper to the park (camper). The primary goal is to model the number
of fish caught as a function of the other variables.
Your assignment is to explore the models we discussed in class
for Poisson and negative binomial regression with and without
adjustment for the excess number of zeroes and choose what you think
is the most appropriate model.
You can read the data with:
read.table(url("http://www.stat.columbia.edu/~madigan/W2025/data/fish.csv"), sep=",", header = TRUE)
Due 5pm Wednesday March 6th in your dropbox on courseworks
Homework 7
This homework concerns data on 462 individuals:
SAH <- read.table("http://stat.columbia.edu/~madigan/W2025/data/SAHmissing.txt",header=TRUE,sep="\t")
The variables are:
sbp: systolic blood pressure
tobacco: cigarettes per day
ldl: LDL cholesterol
adiposity: measure of body fat
famhist: family history of heart disease
typea: score on a test of type A personality
obesity: body mass index
alcohol: ounces per day
age: age
chd: coronary heart disease present (1) or absent (0)
The goal is to build a generalized additive model to predict chd. I have omitted the value of CHD for 42 of the rows. Please email me your predictions for those 42 rows and I will score them.
Due 5pm Friday March 15th in your dropbox on courseworks
Homework 8
Revisit the SAH data from Homework 7. This time just use the first 420 rows.
Hold out a random sample of 42 rows. Compare the performance (as measured by accuracy)
of trees and K-nearest-neighbor methods at predicting chd for the 42 hold-out rows.
Due 5pm Friday April 5th in your dropbox on courseworks
Homework 9
Provide an annotated version of this R code that explains what each line
of the code does.
Due 5pm Friday April 12th in your dropbox on courseworks
Homework 10
The data at:
read.table(url("http://www.stat.columbia.edu/~madigan/W2025/data/temperature.txt"), sep=" ", header = TRUE)
concern year-on-year global temperature changes from 1880 to 1987. The goal is to estiamte the linear slope, i.e.,
the average increase in temperature per year.
(a) Fit a simple linear model. What are the estimated slope and associated confidence interval?
Check the residuals by the usual plot(MyModel) figures. Anything unusual?
(b) Plot the auto-correlation function of the standardized residuals. Anything of concern?
(c) Refit the model using the two approaches from Chapter 6 that we went over in class (corCompSymm and CorAR1).
Do either provide a better fit that the original lm? Do they produce similar estimates of the slope and associated
confidence interval as in (a)?
Due 5pm Friday April 19th in your dropbox on courseworks