|statistics W2025 Homework|
Due 5pm Wednesday February 6th in your dropbox on courseworks
Due 5pm Wednesday February 13th in your dropbox on courseworks
The Trientalis dataset comprises information on the number of daughter tubers produced by mother plants of Trientalis eurpaea and a number of independent variables. Within the dataset the variables consist of the dry weight of the mother tuber (in milligrams), the leaf area (in square millimeters), a fertilizer treatment (which we will ignore), the number of flowers produced by the mother plant, and the number of new daughter plants (which is the dependent variable). For some observations, the weight is not available.
See can you find a satisfactory Poisson regression model for the data. Explain how you arrived at your model.
What is your interpretation of the regression parameters?
I got the data from here.
Due 5pm Friday February 22nd in your dropbox on courseworks
Please prepare a one-page project proposal. Your project will be due after Spring break and should involve a substantial data analysis and writeup. The last several chapters of the Zuur textbook should give you some ideas for the kind of level I have in mind. For this project, I expect you to use R and one tools we have been studying (i.e., linear, logistic, or Poisson regression). I would like you to do this in pairs. You can choose your own partner or, if you let me know, I'll be happy to assign you a partner. You might find this website helpful.
Due 5pm Wednesday February 27th - please email to dm2418
This homework concerns data on 250 groups that went to a park. Each group was questioned about how many fish they caught (count), how many children were in the group (child), how many people were in the group (persons), and whether or not they brought a camper to the park (camper). The primary goal is to model the number of fish caught as a function of the other variables. Your assignment is to explore the models we discussed in class for Poisson and negative binomial regression with and without adjustment for the excess number of zeroes and choose what you think is the most appropriate model. You can read the data with: read.table(url("http://www.stat.columbia.edu/~madigan/W2025/data/fish.csv"), sep=",", header = TRUE)
Due 5pm Wednesday March 6th in your dropbox on courseworks
This homework concerns data on 462 individuals:
SAH <- read.table("http://stat.columbia.edu/~madigan/W2025/data/SAHmissing.txt",header=TRUE,sep="\t")
The variables are:
sbp: systolic blood pressure
tobacco: cigarettes per day
ldl: LDL cholesterol
adiposity: measure of body fat
famhist: family history of heart disease
typea: score on a test of type A personality
obesity: body mass index
alcohol: ounces per day
chd: coronary heart disease present (1) or absent (0)
The goal is to build a generalized additive model to predict chd. I have omitted the value of CHD for 42 of the rows. Please email me your predictions for those 42 rows and I will score them.
Due 5pm Friday March 15th in your dropbox on courseworks
Revisit the SAH data from Homework 7. This time just use the first 420 rows. Hold out a random sample of 42 rows. Compare the performance (as measured by accuracy) of trees and K-nearest-neighbor methods at predicting chd for the 42 hold-out rows.
Due 5pm Friday April 5th in your dropbox on courseworks
Provide an annotated version of this R code that explains what each line of the code does.
Due 5pm Friday April 12th in your dropbox on courseworks
The data at:
read.table(url("http://www.stat.columbia.edu/~madigan/W2025/data/temperature.txt"), sep=" ", header = TRUE)
concern year-on-year global temperature changes from 1880 to 1987. The goal is to estiamte the linear slope, i.e., the average increase in temperature per year.
(a) Fit a simple linear model. What are the estimated slope and associated confidence interval?
Check the residuals by the usual plot(MyModel) figures. Anything unusual?
(b) Plot the auto-correlation function of the standardized residuals. Anything of concern?
(c) Refit the model using the two approaches from Chapter 6 that we went over in class (corCompSymm and CorAR1). Do either provide a better fit that the original lm? Do they produce similar estimates of the slope and associated confidence interval as in (a)?
Due 5pm Friday April 19th in your dropbox on courseworks