Statistics W1111
  Columbia University
 

Instructions for lab 1


Lab Objective

To explore data with histograms and scatter plots.

Lab Procedures

Unit 1:  Let's go to the movies!

What are the characteristics of U.S. movies that make the most money?  Let's address this question with the data set movies.dta.   It comprises data on the 203 top grossing movies of all time as of February 2000.  The file is here.  First, download the file to a convenient location.  Now open Stata.  Click on File - Open, and select movies.dta, then click on Open.

So, this is Stata.  Take a moment to get acquainted.  Take a look at each of the menu's across the top-- Data, Graphics, Statistics, etc.  Do any of the menu options look familiar? 

Important: A great feature of Stata is that it keeps the names of the variables for you in a separate window.  If you see this window now it should say: ranking, name, type, domestic, foreign, etc.  If you don't see this window, press Cntrl-6.  Don't go forward until you see the Variables window.

There are missing data in this file.  We'll ignore them for simplicity.  In general, when confronted with missing data, it is best to get the advice of a professional statistician before doing analyses.

Questions: 

Data Analysis Tip: The unit of measurement for the three monetary variables is not stated.  That's bad practice. Always include a description of the units somewhere on the file.  Based on knowledge of movie revenues, it is clear that that the unit of measurement is $1,000,000.    

1)  Describe the distribution of foreign grosses.  That is, say where most values are, note any outliers, and say whether the distribution is tightly packed around its mean or is spread out.  Also, report the mean and standard deviation.  A good answer here would be 2-3 sentences and one graphic (you could make several, but give me the one that best describes the distribution).  

Stata Tip:  Make a histogram using Graphics-Histogram, then entering the variable name in the Variable box.  If you don't remember the varaible name, just click the variable in the Variables  window and it should appear in the Histogram dialogue box.  You'll see lots options to let you customize your histogram--feel free to explore (remember the 'good practice' suggestions you gave on the quiz!).

To get summary statstics go to Statistics-Summaries, tables & tests-Summary statistics-Summary statistics.  Again, click the variable name in the Variables window to make it appear in the summary statistics dialogue.

To make a boxplot, go to to Graphics-Box plot.   You can enter more than one variable in the Variables field if you want.  If you do, Stata makes side-by-side boxplots.

2)  Which sentence best describes the distributions of domestic and foreign grosses?  You can just write the letter of your choice on the lab report.
Choice a)  Domestic and foreign grosses are very similar.
Choice b)  Domestic and foreign grosses have similar distributional shapes, but foreign grosses tend to be larger than domestic grosses.
Choice c)  Domestic and foreign grosses have similar distributional shapes, but domestic grosses tend to be larger than foreign grosses.
Choice d)  The two distributions look nothing like each other, because one has a long left tail and the other has a long right tail.

3)  What is the name of the movie that is the clear outlier on all three monetary variables?  

4)  We can examine the relationship between world-wide gross and movie type using a box plot.   To do this we need to expand our boxplottting skills from the previous question because now we have a continuous variable and a discrete variable. Go again to Graphics-Boxplot.   Put the continuous variable  in the Variables field then click the Over tab.  In the Over 1 box put the discrete variable in the Variables field.  Answer the three questions below.

    a)  Out of Comedy and Romance movies, which one has a distribution of world-wide grosses that is most similar to the distribution of world-wide grosses for Adventure movies?  Justify your choice in at most two sentences.
   
    b)  Compare the distributions for Comedy movies and Horror movies.  Do they have reasonably similar medians?  Is one more spread out than the other (if so, say which one)?

    c)  If you directed a movie and wanted to make lots of money worldwide, which type appears to give you the best chance of doing so?  Base your answer on the results of the box plot.

5)  Describe the relationship between domestic gross and foreign gross.  To make a scatter plot, use Graphics, then select Easy Graphics-Scatterplot.  Enter the continuous variable for the vertical axis in the Y box and the continuous variable for the horizontal axis in the X box. (Note: If you're using Stata Version 10--which is what's in the computer labs--you won't have an Easy Graphics menu.  Instead, choose the tab for Two-way comparisons/Scatter.  This is the first option under the graphics menu.  Enter a new graph and put the variables in the corresponding Y and X boxes.)  Items to include in your description are the general trend of the relationship (e.g., positive and linear, negative and linear, some other pattern, no clear pattern) and whether there are any outliers or points that do not fit the pattern.

6)  Report the three pairwise correlations between Foreign, Domestic, and World-wide gross.  You can find all three simultaneously by selecting Statistics-Summaries, tables & tests-Summary statistics-Pairwise correlations, then entering in all three variables into the "Variables" box.    Do the correlations suggest strongly positive linear relationships, weakly positive linear relationships, no linear relationships, weakly negative linear relationships, or strongly negative linear relationships?

7)  Why are the correlations between Domestic and World-wide, and Foreign and World-wide, stronger than than the correlation between Domestic and Foreign?  The answer has to do with the definitions of the variables.

8)  Outliers can have a strong effect on correlations.  Let's check to see if excluding Titanic changes the correlations substantially.  First, open the data editor by clicking the icon in the menu (scroll over the icons to see their label).  To exclude Titanic, highlight the whole row number 1 (for Titanic), select Delete and then Delete observation 1.   Now exit the data editor and be sure to click ok to accept your changes.  Now, re-calculate the correlations in (6). Did the correlations get stronger or weaker?   Does the substance of your conclusions in (6) change very much when excluding Titanic?  

Tip: If you want to see multiple scatterlplots in the same window you can use the Graphics-Scatterplot matrix command. Try it now with foreign, domestic, and worldwide gross.  You don't have to write anything down, just take a look.

Data Analysis Tip:   It is not acceptable to exclude outliers from analyses unless you have a scientific reason to do so (e.g., a data entry error, or maybe the outlying unit is not part of your target population).  Hiding outliers is fudging data to get results you want.  That is dishonest and unethical.  When you see outliers, do analyses with and without them.  When the results do not change much, report the results based on the full data  set, and tell your audience that the results were not sensitive to the outliers.  When the results do change substantially, report both sets of analyses: one with and one without the outliers.  This  honestly informs people that your conclusions are not on very solid ground, because particular data points affect the results greatly.  

Note: This lab draws heavily from assignments used by Professor Jerry Reiter, Department of Statistical Science, Duke University.  His contribution is greatly appreciated.