Instructions for lab 1
Lab Objective
To explore data with histograms and scatter plots.
Lab Procedures
Unit 1: Let's go to the movies!
What are the characteristics of U.S. movies that make the most
money? Let's address this question with the data set movies.dta.
It comprises data on the 203 top grossing movies of all
time as of February 2000. The file is here. First, download the file to a convenient location. Now open Stata. Click on File - Open, and select movies.dta, then click on Open.
So, this is Stata. Take a moment to get acquainted. Take a look at each of the menu's across the top-- Data, Graphics, Statistics, etc. Do any of the menu options look familiar?
Important: A great feature of Stata is that it keeps the names of the variables for you in a separate window. If you see this window now it should say: ranking, name, type, domestic, foreign, etc. If you don't see this window, press Cntrl-6. Don't go forward until you see the Variables window.There are missing data in this file. We'll ignore them for
simplicity. In general, when confronted with missing data, it is
best to get the advice of a professional statistician before doing
analyses.
Questions:
Data Analysis Tip: The unit of
measurement for the three monetary variables is not stated.
That's
bad practice. Always include a description of the units somewhere on
the file. Based on knowledge of movie revenues, it is clear that
that the unit of measurement is $1,000,000.
1) Describe the distribution of foreign grosses.
That is, say where most values are, note any outliers, and say
whether the distribution is tightly packed around its mean or is spread
out. Also, report the mean and standard deviation. A good answer here would be 2-3 sentences and one graphic (you could make several, but give me the one that best describes the distribution).
Stata Tip: Make a histogram using Graphics-Histogram, then
entering the variable name in the Variable box. If you don't remember the varaible name, just click the variable in the Variables window and it should appear in the Histogram dialogue
box. You'll see lots options to let you customize your
histogram--feel free to explore (remember the 'good practice'
suggestions you gave on the quiz!).
To get summary statstics go to Statistics-Summaries, tables & tests-Summary statistics-Summary statistics. Again, click the variable name in the Variables window to make it appear in the summary statistics dialogue.
To make a boxplot, go to to Graphics-Box plot. You can enter more than one variable in the Variables field if you want. If you do, Stata makes side-by-side boxplots.
2) Which sentence best describes the distributions of domestic
and foreign grosses? You can just write the letter of your choice
on the lab report.
Choice a) Domestic and foreign grosses are very similar.
Choice b) Domestic and foreign grosses have similar
distributional shapes, but foreign grosses tend to be larger than
domestic grosses.
Choice c) Domestic and foreign grosses have similar
distributional shapes, but domestic grosses tend to be larger than
foreign grosses.
Choice d) The two distributions look nothing like each other,
because one has a long left tail and the other has a long right tail.
3) What is the name of the movie that is the clear outlier on
all three monetary variables?
4) We can examine the relationship between world-wide gross
and
movie type using a box plot. To do this we need to expand our
boxplottting skills from the previous question because now we have a
continuous variable and a discrete variable. Go again to Graphics-Boxplot. Put the continuous variable in the Variables field then click the Over tab. In the Over 1 box put the discrete variable in the Variables field. Answer the three
questions below.
a) Out of Comedy and Romance movies,
which one has a distribution of world-wide grosses that is most similar
to the distribution of world-wide grosses for Adventure movies?
Justify your choice in at
most two sentences.
b) Compare the distributions for Comedy
movies and Horror movies. Do they have reasonably similar
medians? Is one more spread out than the other (if so, say which
one)?
Note: This lab draws heavily from assignments used by Professor Jerry Reiter, Department of Statistical Science, Duke University. His contribution is greatly appreciated.