<?xml version="1.0" encoding="utf-8"?>
<feed version="0.3" xmlns="http://purl.org/atom/ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xml:lang="en">
<title>Statistical Modeling, Causal Inference, and Social Science</title>
<link rel="alternate" type="text/html" href="http://www.stat.columbia.edu/~cook/movabletype/mlm/" />
<modified>2007-11-05T12:45:35Z</modified>
<tagline></tagline>
<id>tag:www.stat.columbia.edu,2007:/~cook/movabletype/mlm/1</id>
<generator url="http://www.movabletype.org/" version="3.34">Movable Type</generator>
<copyright>Copyright (c) 2007, Andrew</copyright>
<entry>
<title>Counting churchgoers</title>
<link rel="alternate" type="text/html" href="http://www.stat.columbia.edu/~cook/movabletype/archives/2007/11/counting_church_1.html" />
<modified>2007-11-05T12:45:35Z</modified>
<issued>2007-11-05T12:41:20Z</issued>
<id>tag:www.stat.columbia.edu,2007:/~cook/movabletype/mlm/1.1383</id>
<created>2007-11-05T12:41:20Z</created>
<content><![CDATA[<p>When studying religoius attendance and voting, it's worth remembering <a href="http://www.stat.columbia.edu/~cook/movabletype/archives/2006/07/counting_church.html">this measurement issue</a>.</p>]]></content>
<author>
<name>Andrew</name>
<url>http://www.stat.columbia.edu/~gelman</url>
<email>gelman@stat.columbia.edu</email>
</author>
<dc:subject>Political Science</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.stat.columbia.edu/~cook/movabletype/mlm/">
<![CDATA[<p>When studying religoius attendance and voting, it's worth remembering <a href="http://www.stat.columbia.edu/~cook/movabletype/archives/2006/07/counting_church.html">this measurement issue</a>.</p>]]>

</content>
</entry>
<entry>
<title>Income, religious attendance, and voting:  recent patterns and trends since 1992</title>
<link rel="alternate" type="text/html" href="http://www.stat.columbia.edu/~cook/movabletype/archives/2007/11/income_religiou_1.html" />
<modified>2007-11-05T14:35:39Z</modified>
<issued>2007-11-05T05:17:39Z</issued>
<id>tag:www.stat.columbia.edu,2007:/~cook/movabletype/mlm/1.1117</id>
<created>2007-11-05T05:17:39Z</created>
<content><![CDATA[<p>I can't say I have much of an explanation for this, but it's interesting:<br />
<!--<br />
<img alt="religion.png" src="http://www.stat.columbia.edu/~cook/movabletype/mlm/religion.png" width="792" height="612" /><br />
--><br />
<a href="http://www.stat.columbia.edu/~cook/movabletype/mlm/religion.png"><img alt="religion_small.png" src="http://www.stat.columbia.edu/~cook/movabletype/mlm/religion_small.png" width="398" height="277" /></a></p>

<p>Church attendance is a strong predictor of how high-income people vote, not such a good predictor for low-income voters.</p>

<p>There's lots of talk about religion and income and voting, but people don't always know that <a href="http://www.stat.columbia.edu/~cook/movabletype/archives/2006/06/treatment_inter.html">interactions are important</a>.</p>

<p>Here are some time trends (from <a href="http://www.stat.columbia.edu/~gelman/research/unpublished/thirds4.pdf">this paper with David Park</a>).  The graphs below show the difference in Republican vote between rich and poor, religious and non-religious, and their interaction (that is, the difference in differences), computed separately for each presidential election year:<br />
<!--<br />
<img alt="interactiontimetrends.png" src="http://www.stat.columbia.edu/~cook/movabletype/mlm/interactiontimetrends.png" width="575" height="200" /><br />
--><br />
<a href="http://www.stat.columbia.edu/~cook/movabletype/mlm/interactiontimetrends.png"><img alt="interactiontimetrends_small.png" src="http://www.stat.columbia.edu/~cook/movabletype/mlm/interactiontimetrends_small.png" width="443" height="154" /></a></p>

<p>As others have noted (although not, as far as I know, looking at interactions), it all started in 1992.  We heard a lot about the Moral Majority back in 1980, but it doesn't seem to have started showing up in voting patterns until Clinton.</p>

<p>You can read more about interactions in the linked article.  The key points are that (a) higher-income voters support the Republicans and have done so for awhile; (b) more recently, churchgoers have supported the Republicans, (c) the difference between churchgoers and non-churchgoers is much greater for the rich than the poor.</p>

<p>P.P.S.  I posted the top graph several months ago but the recent interest in <a href="http://www.stat.columbia.edu/~cook/movabletype/mlm/">these religiosity/income graphs</a> and <a href="http://www.stat.columbia.edu/~cook/movabletype/archives/2007/10/some_cool_graph.html">these state voting maps</a> motivated me to repost. </p>]]></content>
<author>
<name>Andrew</name>
<url>http://www.stat.columbia.edu/~gelman</url>
<email>gelman@stat.columbia.edu</email>
</author>
<dc:subject>Political Science</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.stat.columbia.edu/~cook/movabletype/mlm/">
<![CDATA[<p>I can't say I have much of an explanation for this, but it's interesting:<br />
<!--<br />
<img alt="religion.png" src="http://www.stat.columbia.edu/~cook/movabletype/mlm/religion.png" width="792" height="612" /><br />
--><br />
<a href="http://www.stat.columbia.edu/~cook/movabletype/mlm/religion.png"><img alt="religion_small.png" src="http://www.stat.columbia.edu/~cook/movabletype/mlm/religion_small.png" width="398" height="277" /></a></p>

<p>Church attendance is a strong predictor of how high-income people vote, not such a good predictor for low-income voters.</p>

<p>There's lots of talk about religion and income and voting, but people don't always know that <a href="http://www.stat.columbia.edu/~cook/movabletype/archives/2006/06/treatment_inter.html">interactions are important</a>.</p>

<p>Here are some time trends (from <a href="http://www.stat.columbia.edu/~gelman/research/unpublished/thirds4.pdf">this paper with David Park</a>).  The graphs below show the difference in Republican vote between rich and poor, religious and non-religious, and their interaction (that is, the difference in differences), computed separately for each presidential election year:<br />
<!--<br />
<img alt="interactiontimetrends.png" src="http://www.stat.columbia.edu/~cook/movabletype/mlm/interactiontimetrends.png" width="575" height="200" /><br />
--><br />
<a href="http://www.stat.columbia.edu/~cook/movabletype/mlm/interactiontimetrends.png"><img alt="interactiontimetrends_small.png" src="http://www.stat.columbia.edu/~cook/movabletype/mlm/interactiontimetrends_small.png" width="443" height="154" /></a></p>

<p>As others have noted (although not, as far as I know, looking at interactions), it all started in 1992.  We heard a lot about the Moral Majority back in 1980, but it doesn't seem to have started showing up in voting patterns until Clinton.</p>

<p>You can read more about interactions in the linked article.  The key points are that (a) higher-income voters support the Republicans and have done so for awhile; (b) more recently, churchgoers have supported the Republicans, (c) the difference between churchgoers and non-churchgoers is much greater for the rich than the poor.</p>

<p>P.P.S.  I posted the top graph several months ago but the recent interest in <a href="http://www.stat.columbia.edu/~cook/movabletype/mlm/">these religiosity/income graphs</a> and <a href="http://www.stat.columbia.edu/~cook/movabletype/archives/2007/10/some_cool_graph.html">these state voting maps</a> motivated me to repost. </p>]]>

</content>
</entry>
<entry>
<title>Religiosity and income in the U.S.</title>
<link rel="alternate" type="text/html" href="http://www.stat.columbia.edu/~cook/movabletype/archives/2007/11/religiosity_and.html" />
<modified>2007-11-05T14:52:32Z</modified>
<issued>2007-11-04T11:33:27Z</issued>
<id>tag:www.stat.columbia.edu,2007:/~cook/movabletype/mlm/1.1382</id>
<created>2007-11-04T11:33:27Z</created>
<content><![CDATA[<p>David noticed <a href="http://www.nytimes.com/2007/11/03/technology/03online.html?ref=todayspaper">this article</a> by Dan Mitchell reporting the well-known fact that people in richer countries tend to be less religious.  What about states in the U.S.?  We (that is, David Park, Joe Bafumi, Boris Shor, and I) look at it two ways.</p>

<p>First, here's a scatterplot of the 50 states, plotting average religious attendance vs. average income.  (Religious attendance is on a -2 to 2 scale, from "never" to "more than once a week," and average income was originally in dollars but has been rescaled to be centered at zero.):</p>

<p><!--<img alt="st.rel.inc0004.png" src="http://www.stat.columbia.edu/~cook/movabletype/mlm/st.rel.inc0004.png" width="516" height="444" />--><br />
<img alt="st.rel.inc0004_small.png" src="http://www.stat.columbia.edu/~cook/movabletype/mlm/st.rel.inc0004_small.png" width="450" height="347" /></p>

<p>States that voted for Bush in 2004 are in red and the Kerry-supporting states are blue.  You can see that people in richer states tend to be less religious, although the relation is far from a straight line.  There is also some regional variation (more religious attendance in the south, less in the northeast and west).</p>

<p>Second, here's a plot showing the correlation of religious attendance and individual income <em>within</em> each state.  We get a separate correlation for each state, and so we can plot these.  Here we plot the correlations vs. state income, using the same color scheme:</p>

<p><!--<img alt="corr.st.rel.inc0004.png" src="http://www.stat.columbia.edu/~cook/movabletype/mlm/corr.st.rel.inc0004.png" width="516" height="444" />--><br />
<img alt="corr.st.rel.inc0004_small.png" src="http://www.stat.columbia.edu/~cook/movabletype/mlm/corr.st.rel.inc0004_small.png" width="449" height="351" /></p>

<p>Again, there's quite a bit of variation from state to state, but overall we see a positive correlation between income and religiosity in poor states and a negative correlation in rich states:  To put it another way, in Mississippi, the richer people attend church more.  In Connecticut, the richer people attend church less.</p>

<p>(See also <a href="http://www.stat.columbia.edu/~cook/movabletype/archives/2007/10/some_cool_graph.html">here</a> for more on income and voting by state, and <a href="http://www.stat.columbia.edu/~cook/movabletype/archives/2007/11/income_religiou_1.html">here</a> for more on income, voting, and church attendance.)</p>

<p>P.S.  Typos fixed (thanks to commenters Derek and Sandapanda).</p>

<p>P.P.S.  Colors of Iowa and New Mexico fixed (thanks to commenter David).</p>]]></content>
<author>
<name>Andrew</name>
<url>http://www.stat.columbia.edu/~gelman</url>
<email>gelman@stat.columbia.edu</email>
</author>
<dc:subject>Political Science</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.stat.columbia.edu/~cook/movabletype/mlm/">
<![CDATA[<p>David noticed <a href="http://www.nytimes.com/2007/11/03/technology/03online.html?ref=todayspaper">this article</a> by Dan Mitchell reporting the well-known fact that people in richer countries tend to be less religious.  What about states in the U.S.?  We (that is, David Park, Joe Bafumi, Boris Shor, and I) look at it two ways.</p>

<p>First, here's a scatterplot of the 50 states, plotting average religious attendance vs. average income.  (Religious attendance is on a -2 to 2 scale, from "never" to "more than once a week," and average income was originally in dollars but has been rescaled to be centered at zero.):</p>

<p><!--<img alt="st.rel.inc0004.png" src="http://www.stat.columbia.edu/~cook/movabletype/mlm/st.rel.inc0004.png" width="516" height="444" />--><br />
<img alt="st.rel.inc0004_small.png" src="http://www.stat.columbia.edu/~cook/movabletype/mlm/st.rel.inc0004_small.png" width="450" height="347" /></p>

<p>States that voted for Bush in 2004 are in red and the Kerry-supporting states are blue.  You can see that people in richer states tend to be less religious, although the relation is far from a straight line.  There is also some regional variation (more religious attendance in the south, less in the northeast and west).</p>

<p>Second, here's a plot showing the correlation of religious attendance and individual income <em>within</em> each state.  We get a separate correlation for each state, and so we can plot these.  Here we plot the correlations vs. state income, using the same color scheme:</p>

<p><!--<img alt="corr.st.rel.inc0004.png" src="http://www.stat.columbia.edu/~cook/movabletype/mlm/corr.st.rel.inc0004.png" width="516" height="444" />--><br />
<img alt="corr.st.rel.inc0004_small.png" src="http://www.stat.columbia.edu/~cook/movabletype/mlm/corr.st.rel.inc0004_small.png" width="449" height="351" /></p>

<p>Again, there's quite a bit of variation from state to state, but overall we see a positive correlation between income and religiosity in poor states and a negative correlation in rich states:  To put it another way, in Mississippi, the richer people attend church more.  In Connecticut, the richer people attend church less.</p>

<p>(See also <a href="http://www.stat.columbia.edu/~cook/movabletype/archives/2007/10/some_cool_graph.html">here</a> for more on income and voting by state, and <a href="http://www.stat.columbia.edu/~cook/movabletype/archives/2007/11/income_religiou_1.html">here</a> for more on income, voting, and church attendance.)</p>

<p>P.S.  Typos fixed (thanks to commenters Derek and Sandapanda).</p>

<p>P.P.S.  Colors of Iowa and New Mexico fixed (thanks to commenter David).</p>]]>

</content>
</entry>
<entry>
<title>Comparing red states and blue states:  more pretty graphs</title>
<link rel="alternate" type="text/html" href="http://www.stat.columbia.edu/~cook/movabletype/archives/2007/11/comparing_red_s.html" />
<modified>2007-11-02T14:13:11Z</modified>
<issued>2007-11-02T13:39:45Z</issued>
<id>tag:www.stat.columbia.edu,2007:/~cook/movabletype/mlm/1.1379</id>
<created>2007-11-02T13:39:45Z</created>
<content><![CDATA[<blockquote>"Like upscale areas everywhere, from Silicon Valley to Chicago's North Shore to suburban Connecticut, Montgomery County [Maryland] supported the Democratic ticket in last year's presidential election, by a margin of 63 percent to 34 percent." -- David Brooks, 2001.</blockquote>

<p>Some of the discussions of <a href="http://www.stat.columbia.edu/~cook/movabletype/archives/2007/10/some_cool_graph.html">our red-state, blue-state maps</a> picked up on the differences between where national journalists live (the mid-Atlantic states and California) and other parts of America.  The income-voting pattern is different in the red states and the blue states.  We have some more thoughts on this (<strong>scroll down for the pretty graphs</strong>). </p>

<p>David Brooks, the New York Times op-ed columnist and author of <em>Bobos in Paradise</em> and <em>On Paradise Drive</em>, explored the differences between Red and Blue America in an influential article, "One Nation, Slightly Divisible,'' in the <em>Atlantic Monthly</em> shortly after the 2000 election.  Sometimes described as the liberals' favorite conservative, Brooks embodies the red-blue division within himself.  He has liberal leanings on social issues but understands the enduring appeal of traditional values---"today's young people seem happy with the frankness of the left and the wholesomeness of the right,'' and his economic views are conservative but he sees the need for social cohesion among rich and poor.  His <em>Atlantic</em> article compared Montgomery County, Maryland, the liberal, upper-middle-class suburb where he and his friends live, to rural, conservative Franklin County, Pennsylvania, a short drive away but distant in attitudes and values, with "no Starbucks, no Pottery Barn, no Borders or Barnes & Noble,'' plenty of churches but not so many Thai restaurants, "a lot fewer sun-dried-tomato concoctions on restaurant menus and a lot more meatloaf platters.</p>

<p>Brooks lives in a liberal, well-off area.  It is characteristic of the east and west coasts that the richer areas tend to be more liberal, but in other parts of the country, notably the south, the correlation goes the other way.  A comparable journey in Texas would go from Collin County, a suburb of Dallas where George W. Bush received 71% of the vote, to rural Zavala County in the southwest, where Bush received only 25%.</p>

<p><img alt="scatterplot_texas.png" src="http://www.stat.columbia.edu/~cook/movabletype/mlm/scatterplot_texas.png" width="396" height="306" /></p>

<p>The graph above shows the pattern:  Collin and Zavala (the dark circles on the scatterplot) are the richest and poorest counties in Texas, and there is a clear pattern that poor counties supported the Democrats while the Republicans won in middle-class and rich counties.</p>

<p>When we showed this to a political scientist, he asked about the state capital, noted for its liberal attitudes, vibrant alternative rock scene, and the University of Texas:  "What about Austin?  It must be rich and liberal.''  We looked it up.  Austin is in Travis County and makes up almost all its population.  Travis County has a median household income of $45,000 and gave George W. Bush 53% of the vote, putting it about midway between Collin and Zavala counties in the graph.</p>

<p>By comparison, the next graph shows the counties of Brooks's home state of Maryland:  here there is no clear pattern of county income and Republican vote.  We have indicated Montgomery County, the prototypical wealthy slice of Blue America, in bold, and it is not difficult to find poorer, more Republican-supporting counties nearby as comparisons.  Rich and poor counties look different in Blue America than in Red America.</p>

<p><img alt="scatterplot_maryland.png" src="http://www.stat.columbia.edu/~cook/movabletype/mlm/scatterplot_maryland.png" width="396" height="306" /></p>

<p>We can also look at income and voting for individual voters in each state.  In Texas, there is a strong relation between income and voting:</p>

<p><img alt="texas.png" src="http://www.stat.columbia.edu/~cook/movabletype/mlm/texas.png" width="265" height="263" /></p>

<p>In Maryland, the pattern is much weaker:</p>

<p><img alt="maryland.png" src="http://www.stat.columbia.edu/~cook/movabletype/mlm/maryland.png" width="270" height="259" /></p>

<p>And here, by popular demand, is the notorious Kansas:</p>

<p><img alt="kansas.png" src="http://www.stat.columbia.edu/~cook/movabletype/mlm/kansas.png" width="263" height="254" /></p>

<p>P.S.  Just to be clear, I think Brooks's observations about cultural differences between red and blue America are interesting and important; you just have to be careful when aligning these with income or wealth.</p>]]></content>
<author>
<name>Andrew</name>
<url>http://www.stat.columbia.edu/~gelman</url>
<email>gelman@stat.columbia.edu</email>
</author>
<dc:subject>Political Science</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.stat.columbia.edu/~cook/movabletype/mlm/">
<![CDATA[<blockquote>"Like upscale areas everywhere, from Silicon Valley to Chicago's North Shore to suburban Connecticut, Montgomery County [Maryland] supported the Democratic ticket in last year's presidential election, by a margin of 63 percent to 34 percent." -- David Brooks, 2001.</blockquote>

<p>Some of the discussions of <a href="http://www.stat.columbia.edu/~cook/movabletype/archives/2007/10/some_cool_graph.html">our red-state, blue-state maps</a> picked up on the differences between where national journalists live (the mid-Atlantic states and California) and other parts of America.  The income-voting pattern is different in the red states and the blue states.  We have some more thoughts on this (<strong>scroll down for the pretty graphs</strong>). </p>

<p>David Brooks, the New York Times op-ed columnist and author of <em>Bobos in Paradise</em> and <em>On Paradise Drive</em>, explored the differences between Red and Blue America in an influential article, "One Nation, Slightly Divisible,'' in the <em>Atlantic Monthly</em> shortly after the 2000 election.  Sometimes described as the liberals' favorite conservative, Brooks embodies the red-blue division within himself.  He has liberal leanings on social issues but understands the enduring appeal of traditional values---"today's young people seem happy with the frankness of the left and the wholesomeness of the right,'' and his economic views are conservative but he sees the need for social cohesion among rich and poor.  His <em>Atlantic</em> article compared Montgomery County, Maryland, the liberal, upper-middle-class suburb where he and his friends live, to rural, conservative Franklin County, Pennsylvania, a short drive away but distant in attitudes and values, with "no Starbucks, no Pottery Barn, no Borders or Barnes & Noble,'' plenty of churches but not so many Thai restaurants, "a lot fewer sun-dried-tomato concoctions on restaurant menus and a lot more meatloaf platters.</p>

<p>Brooks lives in a liberal, well-off area.  It is characteristic of the east and west coasts that the richer areas tend to be more liberal, but in other parts of the country, notably the south, the correlation goes the other way.  A comparable journey in Texas would go from Collin County, a suburb of Dallas where George W. Bush received 71% of the vote, to rural Zavala County in the southwest, where Bush received only 25%.</p>

<p><img alt="scatterplot_texas.png" src="http://www.stat.columbia.edu/~cook/movabletype/mlm/scatterplot_texas.png" width="396" height="306" /></p>

<p>The graph above shows the pattern:  Collin and Zavala (the dark circles on the scatterplot) are the richest and poorest counties in Texas, and there is a clear pattern that poor counties supported the Democrats while the Republicans won in middle-class and rich counties.</p>

<p>When we showed this to a political scientist, he asked about the state capital, noted for its liberal attitudes, vibrant alternative rock scene, and the University of Texas:  "What about Austin?  It must be rich and liberal.''  We looked it up.  Austin is in Travis County and makes up almost all its population.  Travis County has a median household income of $45,000 and gave George W. Bush 53% of the vote, putting it about midway between Collin and Zavala counties in the graph.</p>

<p>By comparison, the next graph shows the counties of Brooks's home state of Maryland:  here there is no clear pattern of county income and Republican vote.  We have indicated Montgomery County, the prototypical wealthy slice of Blue America, in bold, and it is not difficult to find poorer, more Republican-supporting counties nearby as comparisons.  Rich and poor counties look different in Blue America than in Red America.</p>

<p><img alt="scatterplot_maryland.png" src="http://www.stat.columbia.edu/~cook/movabletype/mlm/scatterplot_maryland.png" width="396" height="306" /></p>

<p>We can also look at income and voting for individual voters in each state.  In Texas, there is a strong relation between income and voting:</p>

<p><img alt="texas.png" src="http://www.stat.columbia.edu/~cook/movabletype/mlm/texas.png" width="265" height="263" /></p>

<p>In Maryland, the pattern is much weaker:</p>

<p><img alt="maryland.png" src="http://www.stat.columbia.edu/~cook/movabletype/mlm/maryland.png" width="270" height="259" /></p>

<p>And here, by popular demand, is the notorious Kansas:</p>

<p><img alt="kansas.png" src="http://www.stat.columbia.edu/~cook/movabletype/mlm/kansas.png" width="263" height="254" /></p>

<p>P.S.  Just to be clear, I think Brooks's observations about cultural differences between red and blue America are interesting and important; you just have to be careful when aligning these with income or wealth.</p>]]>

</content>
</entry>
<entry>
<title>The effect of voter identification laws on turnout</title>
<link rel="alternate" type="text/html" href="http://www.stat.columbia.edu/~cook/movabletype/archives/2007/11/the_effect_of_v.html" />
<modified>2007-11-02T04:18:31Z</modified>
<issued>2007-11-02T05:07:31Z</issued>
<id>tag:www.stat.columbia.edu,2007:/~cook/movabletype/mlm/1.1381</id>
<created>2007-11-02T05:07:31Z</created>
<content><![CDATA[<p>Mike Alvarez, Delia Bailey, and Jonathan Katz just completed <a href="http://jkatz.caltech.edu/research/files/wp1267.pdf">this paper</a>:</p>

<blockquote>Since the passage of the “Help America Vote Act” in 2002, nearly half of the states have adopted a variety of new identification requirements for voter registration and participation by the 2006 general election. . . . In this paper we document the effect of voter identification requirements on registered voters as they were imposed in states in the 2000 and 2004 presidential elections, and in the 2002 and 2006 midterm elections. Looking first at trends in the aggregate data, we find no evidence that voter identification requirements reduce participation. Using individual-level data from the Current Population Survey across these elections, however, we find that the strictest forms of voter identification requirements — combination requirements of presenting an identification card and positively matching one’s signature with a signature either on file or on the identification card, as well as requirements to show picture identification — have a negative impact on the participation of registered voters relative to the weakest requirement, stating one’s name. . . .</blockquote>

<p>It looks interesting to me.  It's a hard problem to study because the number of states with changes is not large.  Also, I'm still a little baffled by Figures 5 and 6, where it says that Pr(voting) > 90% for "an average registered voter".  Turnout isn't really that high!  I thought it was closer to 70%?  Perhaps something was set to zero rather than the middle of the distribution when computing the average?  It shouldn't make a difference for the comparison but it would be good to get that intercept settled to a reasonable value.  One thing that would also help would be to plot the raw data on top of some of these graphs to make sure the model is doing what it's supposed to be doing.</p>

<p>I was also suspicious that in Figure 6, the confidence intervals have the same width for nonwhite as for white respondents.  There are fewer nonwhites, so I'd think their confidence intervals should be wider.  But since the model has no interactions, I guess it makes sense for the confidence intervals to have the same with here.  Also, I'd take Figures 7-9 and make them as one figure with 8 columns and 3 rows, this would allow easier comparisons.</p>]]></content>
<author>
<name>Andrew</name>
<url>http://www.stat.columbia.edu/~gelman</url>
<email>gelman@stat.columbia.edu</email>
</author>
<dc:subject>Political Science</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.stat.columbia.edu/~cook/movabletype/mlm/">
<![CDATA[<p>Mike Alvarez, Delia Bailey, and Jonathan Katz just completed <a href="http://jkatz.caltech.edu/research/files/wp1267.pdf">this paper</a>:</p>

<blockquote>Since the passage of the “Help America Vote Act” in 2002, nearly half of the states have adopted a variety of new identification requirements for voter registration and participation by the 2006 general election. . . . In this paper we document the effect of voter identification requirements on registered voters as they were imposed in states in the 2000 and 2004 presidential elections, and in the 2002 and 2006 midterm elections. Looking first at trends in the aggregate data, we find no evidence that voter identification requirements reduce participation. Using individual-level data from the Current Population Survey across these elections, however, we find that the strictest forms of voter identification requirements — combination requirements of presenting an identification card and positively matching one’s signature with a signature either on file or on the identification card, as well as requirements to show picture identification — have a negative impact on the participation of registered voters relative to the weakest requirement, stating one’s name. . . .</blockquote>

<p>It looks interesting to me.  It's a hard problem to study because the number of states with changes is not large.  Also, I'm still a little baffled by Figures 5 and 6, where it says that Pr(voting) > 90% for "an average registered voter".  Turnout isn't really that high!  I thought it was closer to 70%?  Perhaps something was set to zero rather than the middle of the distribution when computing the average?  It shouldn't make a difference for the comparison but it would be good to get that intercept settled to a reasonable value.  One thing that would also help would be to plot the raw data on top of some of these graphs to make sure the model is doing what it's supposed to be doing.</p>

<p>I was also suspicious that in Figure 6, the confidence intervals have the same width for nonwhite as for white respondents.  There are fewer nonwhites, so I'd think their confidence intervals should be wider.  But since the model has no interactions, I guess it makes sense for the confidence intervals to have the same with here.  Also, I'd take Figures 7-9 and make them as one figure with 8 columns and 3 rows, this would allow easier comparisons.</p>]]>

</content>
</entry>
<entry>
<title>A statistician does web analytics</title>
<link rel="alternate" type="text/html" href="http://www.stat.columbia.edu/~cook/movabletype/archives/2007/11/how_would_a_sta.html" />
<modified>2007-11-01T20:51:29Z</modified>
<issued>2007-11-01T20:15:50Z</issued>
<id>tag:www.stat.columbia.edu,2007:/~cook/movabletype/mlm/1.1380</id>
<created>2007-11-01T20:15:50Z</created>
<content><![CDATA[<p>I sometimes play with <a href="http://analytics.google.com">Google Analytics</a> to see the number of daily visitors on our blog and where they are coming from. The charts of daily visits look a bit like this:</p>

<p><img alt="googanal.png" src="http://www.stat.columbia.edu/~cook/movabletype/mlm/googanal.png" width="460" height="85" /></p>

<p>Clearly, there is an upwards trend, but the influence of the day of the week messes everything up. I exported the data into a text file, and typed a line into <a href="http://www.r-project.org/">R</a>:<br />
<blockquote><i><br />
plot(stl(ts(read.table('visitors'),frequency=7),s.window="periodic"))</i><br />
</blockquote><br />
<img alt="decompose.png" src="http://www.stat.columbia.edu/~cook/movabletype/mlm/decompose.png" width="482" height="400" /></p>

<p>The trend component shows what I am really interested in: the trough of summer, followed by a relatively consistent rising trend. Every now and then another site will refer to our blog, temporarily increasing the traffic, and Andrew's <a href="http://www.stat.columbia.edu/~cook/movabletype/archives/2007/10/some_cool_graph.html">cool voting plots</a> are responsible for the latest spike. </p>

<p>Setting the <i>stl</i> function's <i>t.window</i> parameter to 14, 21 or more will smooth the trend a bit more. The model is imperfect because new visitors do come in bursts, but leave more slowly. Perhaps we should do a better Bayesian model for time series decomposition, unless someone else has already done this.</p>]]></content>
<author>
<name>Aleks</name>
<url>http://stat.columbia.edu/~jakulin</url>
<email>jakulin@gmail.com</email>
</author>
<dc:subject>Miscellaneous Statistics</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.stat.columbia.edu/~cook/movabletype/mlm/">
<![CDATA[<p>I sometimes play with <a href="http://analytics.google.com">Google Analytics</a> to see the number of daily visitors on our blog and where they are coming from. The charts of daily visits look a bit like this:</p>

<p><img alt="googanal.png" src="http://www.stat.columbia.edu/~cook/movabletype/mlm/googanal.png" width="460" height="85" /></p>

<p>Clearly, there is an upwards trend, but the influence of the day of the week messes everything up. I exported the data into a text file, and typed a line into <a href="http://www.r-project.org/">R</a>:<br />
<blockquote><i><br />
plot(stl(ts(read.table('visitors'),frequency=7),s.window="periodic"))</i><br />
</blockquote><br />
<img alt="decompose.png" src="http://www.stat.columbia.edu/~cook/movabletype/mlm/decompose.png" width="482" height="400" /></p>

<p>The trend component shows what I am really interested in: the trough of summer, followed by a relatively consistent rising trend. Every now and then another site will refer to our blog, temporarily increasing the traffic, and Andrew's <a href="http://www.stat.columbia.edu/~cook/movabletype/archives/2007/10/some_cool_graph.html">cool voting plots</a> are responsible for the latest spike. </p>

<p>Setting the <i>stl</i> function's <i>t.window</i> parameter to 14, 21 or more will smooth the trend a bit more. The model is imperfect because new visitors do come in bursts, but leave more slowly. Perhaps we should do a better Bayesian model for time series decomposition, unless someone else has already done this.</p>]]>

</content>
</entry>
<entry>
<title>How to tell the diference between a theoretical statistician and an applied statistician</title>
<link rel="alternate" type="text/html" href="http://www.stat.columbia.edu/~cook/movabletype/archives/2007/11/how_to_tell_the.html" />
<modified>2007-11-01T15:24:23Z</modified>
<issued>2007-11-01T15:21:06Z</issued>
<id>tag:www.stat.columbia.edu,2007:/~cook/movabletype/mlm/1.1378</id>
<created>2007-11-01T15:21:06Z</created>
<content><![CDATA[<p>The theoretical statistician uses x, the applied statistician uses y (because we reserve x for predictors).</p>]]></content>
<author>
<name>Andrew</name>
<url>http://www.stat.columbia.edu/~gelman</url>
<email>gelman@stat.columbia.edu</email>
</author>
<dc:subject>Miscellaneous Science</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.stat.columbia.edu/~cook/movabletype/mlm/">
<![CDATA[<p>The theoretical statistician uses x, the applied statistician uses y (because we reserve x for predictors).</p>]]>

</content>
</entry>
<entry>
<title>Controversies over posterior predictive checks</title>
<link rel="alternate" type="text/html" href="http://www.stat.columbia.edu/~cook/movabletype/archives/2007/10/controversies_o.html" />
<modified>2007-10-31T13:13:10Z</modified>
<issued>2007-10-31T13:00:16Z</issued>
<id>tag:www.stat.columbia.edu,2007:/~cook/movabletype/mlm/1.1377</id>
<created>2007-10-31T13:00:16Z</created>
<content><![CDATA[<p>This is a long one, but it's good stuff (if you like this sort of thing).  Dana Kelly writes,</p>

<blockquote>We've corresponded on this issue in the past, and I mentioned that I had been taken to task by a referee who claimed that model checks that rely on the posterior predictive distribution are invalid because they use the data twice.</blockquote>]]></content>
<author>
<name>Andrew</name>
<url>http://www.stat.columbia.edu/~gelman</url>
<email>gelman@stat.columbia.edu</email>
</author>
<dc:subject>Bayesian Statistics</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.stat.columbia.edu/~cook/movabletype/mlm/">
<![CDATA[<p>This is a long one, but it's good stuff (if you like this sort of thing).  Dana Kelly writes,</p>

<blockquote>We've corresponded on this issue in the past, and I mentioned that I had been taken to task by a referee who claimed that model checks that rely on the posterior predictive distribution are invalid because they use the data twice.</blockquote>]]>
<![CDATA[<blockquote>I've finally tracked down the paper (<a href="http://www.stat.duke.edu/~berger/papers/98-16.ps">see attached</a>) that I think is behind this criticism, and I'd like to get your thoughts on it.  I've thought about this issue before and after the referee's critique, and I still don't see any basic flaw in using the observed data to calculate a statistic based on the posterior predictive distribution.

<p>There are some points in the paper that I find confusing:</p>

<p>1. On p. 9, the authors state that the posterior predictive p-value is not really a p-value because "for large sample sizes, [it] will essentially equal the classical p-value, which is quite non-Bayesian in character."  It seems that, on these grounds, one could claim the posterior mean of a parameter is non-Bayesian, because, for large sample sizes, it will essentially equal the classical MLE, which is non-Bayesian.</p>

<p>2. On pp. 9-10, the authors mentions the "apparent double use of the data...to convert the prior [into a posterior] and then for computing the tail area."  They then say they will show an example of the unnatural behavior this double use of the data can induce.  The following example seems to me artificial, having no features of any practical problem that I can envision.  In Example 4.4, letting the absolute value of the mean of the observed values go to infinity makes no sense to me, and therefore I don't see the validity of the conclusion about the lower bound on the p-value, "which can be directly traced to the double use of the data."</p>

<p>3. The conditional predictive p-value introduced by the authors in Sec. 4.2 seems computationally impractical.  On p. 12 the authors mention the difficulty of such an approach for discrete data.  Also, the effort on p. 16 to show a connection with a classical p-value seems odd, given the authors' criticism of the posterior predictive p-value on just these grounds.</p>

<p>4. The partial posterior predictive p-value seems more computationally tractable, but is harder to compute than the posterior predictive p-value.  Also, there appear to be some errors in the math on pp. 17-18, but I need to do some more checking to be sure. </p>

<p>I very much appreciate any thoughts you have on this paper and the larger issue of using the posterior predictive distribution for model-checking.  As I've said in the past, I have become a very big fan of these kinds of checks, as described in Ch. 6 of your text.<br />
 <br />
If you think this topic is worth of discussion on your blog, feel free to post it there.  Thanks in advance.</blockquote></p>

<p>My reply:</p>

<p>The posterior predictive statements are indeed Bayesian, as described in Chapter 6 of Bayesian Data Analysis and also in my 1996 paper with Meng and Stern: they are statements about p(y.rep|y), where y is the observed data and y.rep are hypothetical replications from the model.  These are posterior probability statements.  I agree with your statement that, just because a Bayesian calculation is close to a classical calculation, that doesn't make it non-Bayesian.</p>

<p>In addition, I disagree with the double-use claim:  as noted in the probability expression above, the posterior predictive distribution conditions on the data y exactly once.  It's ok to consider these other methods also but, to me, the posterior predictive check is a more direct approach.</p>

<p>We further discuss some of these issues in <a href="http://www.stat.columbia.edu/~gelman/research/published/STS235A.pdf">this discussion of a forthcoming paper by Bayarri and Castellanos</a>.  (Their paper and other discussions can be found on the journal's <a href="http://www.imstat.org/sts/future_papers.html">website</a>.)</p>

<p>Hal Stern also adds, regarding the beginning of Dana's question,</p>

<blockquote>The referee's comment that the p-value is "invalid" is not very precise.  I suspect he/she is referring to the published results re the absence of uniform distribution under the null.   Posterior predictive checks are clearly valid in the sense that a small posterior predictive probability indicates that the observed data feature is extremely unlikely given the model. </blockquote>

<p>Dana then pointed me to a recent paper by Val Johnson citing the lack of asymptotic uniformity of the posterior predictive p-value as a drawback to its use in model checking.</p>

<p>I replied that I am a big fan of Val Johnson's work but that he seems to focus on analytical methods where I favor graphics.  It's a stylistic difference as much as anything--if you look at his paper carefully he does say that posterior predictive checks are ok as diagnostics.  As my colleagues and I and others have discussed in various papers, I don't see why the uniform distribution is necessary.  It's more important to me to have these p-values have direct interpretations as posterior probabilities.</p>]]>
</content>
</entry>
<entry>
<title>Skepticism about empirical studies?</title>
<link rel="alternate" type="text/html" href="http://www.stat.columbia.edu/~cook/movabletype/archives/2007/10/skepticism_abou.html" />
<modified>2007-10-31T13:10:47Z</modified>
<issued>2007-10-31T12:56:10Z</issued>
<id>tag:www.stat.columbia.edu,2007:/~cook/movabletype/mlm/1.1376</id>
<created>2007-10-31T12:56:10Z</created>
<content><![CDATA[<p>Nick Firoozye writes,</p>

<blockquote>I [Firoozye] wanted to point your attention to <a href="http://www.econtalk.org/archives/2007/10/ayres_on_super.html">the following podcast</a> by Ian Ayres on Supercrunchers, where he shows himself an enthusiastic (if perhaps a bit naïve) proponent of the statistical method.  Entertaining, definitely. One thing though that I thought you might be interested in is Russ Roberts’ (the interviewer's) own <a href="http://cafehayek.typepad.com/hayek/2007/10/fooling-ourselv.html">skepticism over the econometric method</a>, which I think probably warrants a response. It may be that Roberts’ own view is due to his now-Austrian economics slant (i.e., somewhat anti-formallist approach) or perhaps to the fact that mainstream econometrics is a frequentist pursuit and one might question the honesty of the results as a consequence.</blockquote>

<p>I don't really have much to add here, except that the problem noted by Roberts (it's hard to know whether to believe a statistical study) is even more of a problem with <em>non-statstical</em> empirical studies (i.e., anecdotes).  I think Roberts might be overstating the problem because he is focusing on issues where he already had a strong personal opinion even before seeing data analyses.  (He mentions the examples of concealed handguns and anti-theft devices on cars.)  But there are a lot of areas where we have only weak opinions which can indeed be swayed by data (<a href="http://www.stat.columbia.edu/~cook/movabletype/archives/2006/01/post_1.html">see here</a> for some examples).  These cases are important in their own right and also can serve as benchmarks for the success of statistical analysis, so that we can trust good analyses more when they're applied to tougher problems.  This is one way that applied statistics proceeds, by exemplary analyses of problems that might not be hugely important on their own terms but serve as useful templates.  Consider, for example, the book by Snedecor and Cochran:  it's full of examples on agricultural field trials.  Sure, these are important, but these methods have been useful in so many other fields.  This is a great example, actually:  Snedecor and his colleagues worked on agricultural trials because they cared about the results--these were not "toy examples" or thought experiments--and the resulting methods endured.</p>]]></content>
<author>
<name>Andrew</name>
<url>http://www.stat.columbia.edu/~gelman</url>
<email>gelman@stat.columbia.edu</email>
</author>
<dc:subject>Economics</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.stat.columbia.edu/~cook/movabletype/mlm/">
<![CDATA[<p>Nick Firoozye writes,</p>

<blockquote>I [Firoozye] wanted to point your attention to <a href="http://www.econtalk.org/archives/2007/10/ayres_on_super.html">the following podcast</a> by Ian Ayres on Supercrunchers, where he shows himself an enthusiastic (if perhaps a bit naïve) proponent of the statistical method.  Entertaining, definitely. One thing though that I thought you might be interested in is Russ Roberts’ (the interviewer's) own <a href="http://cafehayek.typepad.com/hayek/2007/10/fooling-ourselv.html">skepticism over the econometric method</a>, which I think probably warrants a response. It may be that Roberts’ own view is due to his now-Austrian economics slant (i.e., somewhat anti-formallist approach) or perhaps to the fact that mainstream econometrics is a frequentist pursuit and one might question the honesty of the results as a consequence.</blockquote>

<p>I don't really have much to add here, except that the problem noted by Roberts (it's hard to know whether to believe a statistical study) is even more of a problem with <em>non-statstical</em> empirical studies (i.e., anecdotes).  I think Roberts might be overstating the problem because he is focusing on issues where he already had a strong personal opinion even before seeing data analyses.  (He mentions the examples of concealed handguns and anti-theft devices on cars.)  But there are a lot of areas where we have only weak opinions which can indeed be swayed by data (<a href="http://www.stat.columbia.edu/~cook/movabletype/archives/2006/01/post_1.html">see here</a> for some examples).  These cases are important in their own right and also can serve as benchmarks for the success of statistical analysis, so that we can trust good analyses more when they're applied to tougher problems.  This is one way that applied statistics proceeds, by exemplary analyses of problems that might not be hugely important on their own terms but serve as useful templates.  Consider, for example, the book by Snedecor and Cochran:  it's full of examples on agricultural field trials.  Sure, these are important, but these methods have been useful in so many other fields.  This is a great example, actually:  Snedecor and his colleagues worked on agricultural trials because they cared about the results--these were not "toy examples" or thought experiments--and the resulting methods endured.</p>]]>

</content>
</entry>
<entry>
<title>Why does the press cover the horse race, not policies?</title>
<link rel="alternate" type="text/html" href="http://www.stat.columbia.edu/~cook/movabletype/archives/2007/10/why_does_the_pr.html" />
<modified>2007-10-30T17:02:41Z</modified>
<issued>2007-10-30T16:35:16Z</issued>
<id>tag:www.stat.columbia.edu,2007:/~cook/movabletype/mlm/1.1375</id>
<created>2007-10-30T16:35:16Z</created>
<content><![CDATA[<p>Paul Krugman <a href="http://krugman.blogs.nytimes.com/2007/10/29/were-doomed/">writes</a>,</p>

<blockquote>The news media seem determined to <a href="http://www.journalism.org/node/8187">destroy the republic</a>:

<blockquote>In all, 63% of the campaign stories focused on political and tactical aspects of the campaign. That is nearly four times the number of stories about the personal backgrounds of the candidates (17%) or the candidates’ ideas and policy proposals (15%). And just 1% of stories examined the candidates’ records or past public performance, the study found.</blockquote> 

<p>And:</p>

<blockquote>The press’ focus on fundraising, tactics and polling is even more evident if one looks at how stories were framed rather than the topic of the story. Just 12% of stories examined were presented in a way that explained how citizens might be affected by the election, while nearly nine-out-of-ten stories (86%) focused on matters that largely impacted only the parties and the candidates.</blockquote></blockquote>

<p>This has always bothered me too.  One reason Gary and I did our research on <a href="http://www.stat.columbia.edu/~gelman/research/published/bjps1993.pdf">why are American Presidential election campaign polls so variable when votes are so predictable</a> was that we wanted to convince the news media to do more substantive stories and less polling.  Our point was that general elections for president are generally determined by fundamental variables, not short-term news or bandwagon effects--things are different for primary elections, which have multiple candidates and are inherently unstable--and so this horse-race coverage was a waste of time.</p>

<p><strong>Why, then?</strong></p>

<p>Nonetheless, horse-race coverage persists.  I don't know whether it's worse than before--the site linked to by Krugman does not have comparative time series data--but it's still there.  I'd also include the ridiculously-frequent polling as an example of this problem.  Anyway, why is it still happening?</p>

<p>My theory, at least for the general election, is that most of the voters have already decided who they're going to vote for--and even the ones who haven't decided are often more predictable than they realize.  Suppose, for example, that 40% have pretty much already decided they'll vote for the Democrat, 40% will vote for the Republican, and the fight is over the remaining 20%--most of whom do not follow politics closely in any case.  Now think of the audience for political news.  80% of the people don't need to know the candidates' positions--they've already decided their votes--but they're intensely interested in the horse race:  are "we" going to win or lose?  The substantive coverage that Krugman and I might want is really just for 20% of the audience.  So, from that perspective, it makes sense for the media to give people the horse race.  (Yes, survey respondents say they want more of candidates position issues and less on which candidate is leading in the polls--but I don't know that I believe people when they say this.)</p>

<p>That said, when talking about the primary elections, yeah, I think it would make sense for the media to report more on where the candidates stand on issues.</p>

<p>Finally, on thing that does surprise me is that they don't run more stories on the sources of the candidates' money.  Y'know, <a href="http://www.stat.columbia.edu/~cook/movabletype/archives/2007/01/vin_scully_is_a.html">this sort of thing</a>.  It's fun and also could be informative about the candidates.</p>

<p>P.S.  The histogram <a href="http://www.journalism.org/node/8187">here</a> should be a horizontal dotplot.  Trying to read a colored plot with a key on the side--that's bad news.</p>]]></content>
<author>
<name>Andrew</name>
<url>http://www.stat.columbia.edu/~gelman</url>
<email>gelman@stat.columbia.edu</email>
</author>
<dc:subject>Political Science</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.stat.columbia.edu/~cook/movabletype/mlm/">
<![CDATA[<p>Paul Krugman <a href="http://krugman.blogs.nytimes.com/2007/10/29/were-doomed/">writes</a>,</p>

<blockquote>The news media seem determined to <a href="http://www.journalism.org/node/8187">destroy the republic</a>:

<blockquote>In all, 63% of the campaign stories focused on political and tactical aspects of the campaign. That is nearly four times the number of stories about the personal backgrounds of the candidates (17%) or the candidates’ ideas and policy proposals (15%). And just 1% of stories examined the candidates’ records or past public performance, the study found.</blockquote> 

<p>And:</p>

<blockquote>The press’ focus on fundraising, tactics and polling is even more evident if one looks at how stories were framed rather than the topic of the story. Just 12% of stories examined were presented in a way that explained how citizens might be affected by the election, while nearly nine-out-of-ten stories (86%) focused on matters that largely impacted only the parties and the candidates.</blockquote></blockquote>

<p>This has always bothered me too.  One reason Gary and I did our research on <a href="http://www.stat.columbia.edu/~gelman/research/published/bjps1993.pdf">why are American Presidential election campaign polls so variable when votes are so predictable</a> was that we wanted to convince the news media to do more substantive stories and less polling.  Our point was that general elections for president are generally determined by fundamental variables, not short-term news or bandwagon effects--things are different for primary elections, which have multiple candidates and are inherently unstable--and so this horse-race coverage was a waste of time.</p>

<p><strong>Why, then?</strong></p>

<p>Nonetheless, horse-race coverage persists.  I don't know whether it's worse than before--the site linked to by Krugman does not have comparative time series data--but it's still there.  I'd also include the ridiculously-frequent polling as an example of this problem.  Anyway, why is it still happening?</p>

<p>My theory, at least for the general election, is that most of the voters have already decided who they're going to vote for--and even the ones who haven't decided are often more predictable than they realize.  Suppose, for example, that 40% have pretty much already decided they'll vote for the Democrat, 40% will vote for the Republican, and the fight is over the remaining 20%--most of whom do not follow politics closely in any case.  Now think of the audience for political news.  80% of the people don't need to know the candidates' positions--they've already decided their votes--but they're intensely interested in the horse race:  are "we" going to win or lose?  The substantive coverage that Krugman and I might want is really just for 20% of the audience.  So, from that perspective, it makes sense for the media to give people the horse race.  (Yes, survey respondents say they want more of candidates position issues and less on which candidate is leading in the polls--but I don't know that I believe people when they say this.)</p>

<p>That said, when talking about the primary elections, yeah, I think it would make sense for the media to report more on where the candidates stand on issues.</p>

<p>Finally, on thing that does surprise me is that they don't run more stories on the sources of the candidates' money.  Y'know, <a href="http://www.stat.columbia.edu/~cook/movabletype/archives/2007/01/vin_scully_is_a.html">this sort of thing</a>.  It's fun and also could be informative about the candidates.</p>

<p>P.S.  The histogram <a href="http://www.journalism.org/node/8187">here</a> should be a horizontal dotplot.  Trying to read a colored plot with a key on the side--that's bad news.</p>]]>

</content>
</entry>
<entry>
<title>Have your students work in pairs:  some thoughts and self-criticism</title>
<link rel="alternate" type="text/html" href="http://www.stat.columbia.edu/~cook/movabletype/archives/2007/10/have_your_stude.html" />
<modified>2007-10-30T16:23:17Z</modified>
<issued>2007-10-30T09:41:48Z</issued>
<id>tag:www.stat.columbia.edu,2007:/~cook/movabletype/mlm/1.1371</id>
<created>2007-10-30T09:41:48Z</created>
<content><![CDATA[<p>Seth posts <a href="http://www.blog.sethroberts.net/2007/10/28/i-learned-that-if-i-really-wanted-to-i-could-conquer-my-fear/">this account</a> by a college student who went back to her high school to give a guest lecture on depression to "Mr. Tinloy’s 3rd period psychology class."  Her feelings in preparing and delivering her lecture were pretty similar to my own feelings before doing this sort of thing, and I've been doing it for over 20 years!</p>

<p>The college student's presentation seemed to go well--the students were polite, got involved in discussion a bit, and clapped at the end, and the teacher was helpful in keeping things focused--but when she talked with some friends afterward, one said "she was fighting to stay awake, because the topic did not interest her one bit," and another said that "it was boring because she wasn’t all that interested in what I was talking about, but it got more interesting toward the end when other students started to talk. `Nobody likes guest speakers, so it’s okay.'"</p>

<p>I have a few thoughts:</p>

<p>1.  I suspect the student's presentation to the high school kids would've gone even better if she'd had them working in pairs to discuss the material.  When students are working in pairs, they seem less likely to drift off, also with two students there is more of a chance that one of them is interested in the topic.</p>

<p>2.  It's interesting but perhaps not so surprising that depression is not an interesting topic to the high school students.  Maybe they'd be more interested if it were framed in terms of being happy or sad, or good moods and bad moods?  Even those of us who feel far from "depression" get sad or demoralized on occasion.</p>

<p>3.  My own <a href="http://www.stat.columbia.edu/~gelman/research/presentations/">lectures to outside audiences</a>) seem to go well (in that people say nice things to me afterwards about the presentations) but I usually have difficulty getting people actively involved.  It often seems that my talks don't have "hooks" to grab the audience and motivate them to ask questions and think hard.  They more often sit there passively, enjoying it (I hope) but not actively engaged.  Maybe I should have them work in pairs.  I do this for college students and even grad students--it always surprises them, but they like it--but I've rarely had the nerve to try it with nonstudents.</p>

<p>4.  In the continuing theme of not practicing what we preach, I should point out that my comments above (including the title of this blog entry) are not based on any systematic research, just on my informal observations of what seems to have worked and not worked for me in the past.  (Although it does seem consistent with the literature on active learning, as I've abosrbed it by reading a few books on the topic.)  What I'm missing is (a) careful experimentation (assigning treatments--different teaching methods--unconfounded with important variables such as characteristics of the class, and (b) outcome measures such as surveys of student satisfaction and performance on standardized tests.</p>]]></content>
<author>
<name>Andrew</name>
<url>http://www.stat.columbia.edu/~gelman</url>
<email>gelman@stat.columbia.edu</email>
</author>
<dc:subject>Teaching</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.stat.columbia.edu/~cook/movabletype/mlm/">
<![CDATA[<p>Seth posts <a href="http://www.blog.sethroberts.net/2007/10/28/i-learned-that-if-i-really-wanted-to-i-could-conquer-my-fear/">this account</a> by a college student who went back to her high school to give a guest lecture on depression to "Mr. Tinloy’s 3rd period psychology class."  Her feelings in preparing and delivering her lecture were pretty similar to my own feelings before doing this sort of thing, and I've been doing it for over 20 years!</p>

<p>The college student's presentation seemed to go well--the students were polite, got involved in discussion a bit, and clapped at the end, and the teacher was helpful in keeping things focused--but when she talked with some friends afterward, one said "she was fighting to stay awake, because the topic did not interest her one bit," and another said that "it was boring because she wasn’t all that interested in what I was talking about, but it got more interesting toward the end when other students started to talk. `Nobody likes guest speakers, so it’s okay.'"</p>

<p>I have a few thoughts:</p>

<p>1.  I suspect the student's presentation to the high school kids would've gone even better if she'd had them working in pairs to discuss the material.  When students are working in pairs, they seem less likely to drift off, also with two students there is more of a chance that one of them is interested in the topic.</p>

<p>2.  It's interesting but perhaps not so surprising that depression is not an interesting topic to the high school students.  Maybe they'd be more interested if it were framed in terms of being happy or sad, or good moods and bad moods?  Even those of us who feel far from "depression" get sad or demoralized on occasion.</p>

<p>3.  My own <a href="http://www.stat.columbia.edu/~gelman/research/presentations/">lectures to outside audiences</a>) seem to go well (in that people say nice things to me afterwards about the presentations) but I usually have difficulty getting people actively involved.  It often seems that my talks don't have "hooks" to grab the audience and motivate them to ask questions and think hard.  They more often sit there passively, enjoying it (I hope) but not actively engaged.  Maybe I should have them work in pairs.  I do this for college students and even grad students--it always surprises them, but they like it--but I've rarely had the nerve to try it with nonstudents.</p>

<p>4.  In the continuing theme of not practicing what we preach, I should point out that my comments above (including the title of this blog entry) are not based on any systematic research, just on my informal observations of what seems to have worked and not worked for me in the past.  (Although it does seem consistent with the literature on active learning, as I've abosrbed it by reading a few books on the topic.)  What I'm missing is (a) careful experimentation (assigning treatments--different teaching methods--unconfounded with important variables such as characteristics of the class, and (b) outcome measures such as surveys of student satisfaction and performance on standardized tests.</p>]]>

</content>
</entry>
<entry>
<title>More precision on income and voting?</title>
<link rel="alternate" type="text/html" href="http://www.stat.columbia.edu/~cook/movabletype/archives/2007/10/more_precision.html" />
<modified>2007-10-30T03:08:09Z</modified>
<issued>2007-10-30T05:57:12Z</issued>
<id>tag:www.stat.columbia.edu,2007:/~cook/movabletype/mlm/1.1374</id>
<created>2007-10-30T05:57:12Z</created>
<content><![CDATA[<p>Jeff writes,</p>]]></content>
<author>
<name>Andrew</name>
<url>http://www.stat.columbia.edu/~gelman</url>
<email>gelman@stat.columbia.edu</email>
</author>
<dc:subject>Political Science</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.stat.columbia.edu/~cook/movabletype/mlm/">
<![CDATA[<p>Jeff writes,</p>]]>
<![CDATA[<blockquote>Some questions of interest, picking up from your <a href="http://www.stat.columbia.edu/~cook/movabletype/archives/2007/10/some_cool_graph.html">trio of income/state winners graphs</a>:

<p>At what income cutoff ($) would Dems and Repubs tie in electoral votes?  What national income percentile is this?</p>

<p>What if you take the top x% within each state by income?  (How low do you have to go before the Dems can win?)</blockquote></p>

<p>My reply:  I imagine the threshold would be close to the 50th percentile, just by symmetry since each party got about half the vote, and when we last looked, the electoral college had little partisan bias (i.e., the seats-votes curve goes close to 50/50; see <a href="http://www.stat.columbia.edu/~gelman/research/published/elect_college_oxford.pdf">here</a>)..  </p>

<p>But I'm wary of trying to get a more precise answer because this would lean too heavily on the assumed linearity of the income predictor in the model.  Linearity fits reasonably well, but still, this is the sort of modeling exercise I wouldn't trust.  I certainly wouldn't trust the linear model on theoretical grounds, given that the categories are just the arbitrary divisions used by NES.</p>]]>
</content>
</entry>
<entry>
<title>Distinguishing association from causation</title>
<link rel="alternate" type="text/html" href="http://www.stat.columbia.edu/~cook/movabletype/archives/2007/10/distinguishing.html" />
<modified>2007-10-29T21:48:11Z</modified>
<issued>2007-10-29T20:56:23Z</issued>
<id>tag:www.stat.columbia.edu,2007:/~cook/movabletype/mlm/1.1372</id>
<created>2007-10-29T20:56:23Z</created>
<content><![CDATA[<p>I was pointed to <a href="http://www.acsh.org/publications/pubID.1629/pub_detail.asp">Distinguishing Association from Causation:A Background for Journalists</a> (there is also a <a href="http://www.acsh.org/docLib/20071029_AssociationCausation.pdf">PDF</a> version). Here is my summary of their executive summary:<br />
<ul><br />
<li>Scientific studies that show an association between a factor and a health effect do not necessarily imply that the factor causes the health effect. </li><br />
<li>Randomized trials are studies in which human volunteers are randomly assigned to receive either the agent being studied or an inactive placebo, usually under double-blind conditions.</li><br />
<li>The findings of animal experiments may not be directly applicable to the human situation because of genetic, anatomic, and physiologic differences between species and/or because of the use of unrealistically high doses.</li><br />
<li>In vitro experiments are useful for defining and isolating biologic mechanisms but are not directly applicable to humans.</li><br />
<li>The findings from observational epidemiologic studies are directly applicable to humans, but the associations detected in such studies are not necessarily causal. </li><br />
<li> Useful, time-tested criteria for determining whether an association is causal include:<br />
<ul><li> Temporality. For an association to be causal, the cause must precede the effect.</li><br />
<li>Strength. Scientists can be more confident in the causality of strong associations than weak ones.</li><br />
<li>Dose-response. Responses that increase in frequency as exposure increases are more convincingly supportive of causality than those that do not show this pattern.</li><br />
<li>Consistency. Relationships that are repeatedly observed by different investigators, in different places, circumstances, and times, are more likely to be causal.</li><br />
<li>Biological plausbility. Associations that are consistent with the scientific understanding of the biology of the disease or health effect under investigation are more likely to be causal.</li></ul></li><br />
<li>Studies that include appropriate statistical analysis and that have been published in peer-reviewed journals carry greater weight than those that lack statistical analysis and/or have been announced in other ways.</li><br />
<li>Claims of causation should never be made lightly.</li><br />
</ul></p>

<p>But all this isn't about <a href="http://poli.haifa.ac.il/~levi/inference.html">causation</a> vs association, it's about better studies or worse studies. Association and causation are not binary categories. Instead, there is a continuum from simple models on observational data (correlation between two variables), through more sophisticated models on observational data that include covariates (regression, structural equation models), through yet sophisticated models on observational data that take sample selection bias into consideration (Rubin's propensity score approach), to often simple models on controlled data (randomized experiments). But the mysterious causal "truth" is still out there. If one talks to philosophers these days, they're not even happy with the notion of causality as being powerful enough as a model of reality.</p>

<p>In the past, I've often unfairly complained about studies after having read misleading journalistic reports, so this report is a timely one. But the report has been paid for by <a href="http://www.cspinet.org/integrity/nonprofits/american_council_on_science_and_health.html">large pharma corporations</a>, people may wonder if there is bias or some sort of an agenda in this report. </p>

<p>My quick impression is that they're promoting the best practices in statistical methodology, that all these companies are subscribing to. But there could be greater use of cheaper observational studies with better modeling (such as employing the propensity score approach, or even just better regression modeling) compared to expensive randomized experiments, and society might be better off as a result. Moreover, there is the issue of <a href="http://www.uoregon.edu/~mgall/statistical_significance_v.htm">statistical versus practical significance</a>. What do you think?</p>]]></content>
<author>
<name>Aleks</name>
<url>http://stat.columbia.edu/~jakulin</url>
<email>jakulin@gmail.com</email>
</author>
<dc:subject>Miscellaneous Statistics</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.stat.columbia.edu/~cook/movabletype/mlm/">
<![CDATA[<p>I was pointed to <a href="http://www.acsh.org/publications/pubID.1629/pub_detail.asp">Distinguishing Association from Causation:A Background for Journalists</a> (there is also a <a href="http://www.acsh.org/docLib/20071029_AssociationCausation.pdf">PDF</a> version). Here is my summary of their executive summary:<br />
<ul><br />
<li>Scientific studies that show an association between a factor and a health effect do not necessarily imply that the factor causes the health effect. </li><br />
<li>Randomized trials are studies in which human volunteers are randomly assigned to receive either the agent being studied or an inactive placebo, usually under double-blind conditions.</li><br />
<li>The findings of animal experiments may not be directly applicable to the human situation because of genetic, anatomic, and physiologic differences between species and/or because of the use of unrealistically high doses.</li><br />
<li>In vitro experiments are useful for defining and isolating biologic mechanisms but are not directly applicable to humans.</li><br />
<li>The findings from observational epidemiologic studies are directly applicable to humans, but the associations detected in such studies are not necessarily causal. </li><br />
<li> Useful, time-tested criteria for determining whether an association is causal include:<br />
<ul><li> Temporality. For an association to be causal, the cause must precede the effect.</li><br />
<li>Strength. Scientists can be more confident in the causality of strong associations than weak ones.</li><br />
<li>Dose-response. Responses that increase in frequency as exposure increases are more convincingly supportive of causality than those that do not show this pattern.</li><br />
<li>Consistency. Relationships that are repeatedly observed by different investigators, in different places, circumstances, and times, are more likely to be causal.</li><br />
<li>Biological plausbility. Associations that are consistent with the scientific understanding of the biology of the disease or health effect under investigation are more likely to be causal.</li></ul></li><br />
<li>Studies that include appropriate statistical analysis and that have been published in peer-reviewed journals carry greater weight than those that lack statistical analysis and/or have been announced in other ways.</li><br />
<li>Claims of causation should never be made lightly.</li><br />
</ul></p>

<p>But all this isn't about <a href="http://poli.haifa.ac.il/~levi/inference.html">causation</a> vs association, it's about better studies or worse studies. Association and causation are not binary categories. Instead, there is a continuum from simple models on observational data (correlation between two variables), through more sophisticated models on observational data that include covariates (regression, structural equation models), through yet sophisticated models on observational data that take sample selection bias into consideration (Rubin's propensity score approach), to often simple models on controlled data (randomized experiments). But the mysterious causal "truth" is still out there. If one talks to philosophers these days, they're not even happy with the notion of causality as being powerful enough as a model of reality.</p>

<p>In the past, I've often unfairly complained about studies after having read misleading journalistic reports, so this report is a timely one. But the report has been paid for by <a href="http://www.cspinet.org/integrity/nonprofits/american_council_on_science_and_health.html">large pharma corporations</a>, people may wonder if there is bias or some sort of an agenda in this report. </p>

<p>My quick impression is that they're promoting the best practices in statistical methodology, that all these companies are subscribing to. But there could be greater use of cheaper observational studies with better modeling (such as employing the propensity score approach, or even just better regression modeling) compared to expensive randomized experiments, and society might be better off as a result. Moreover, there is the issue of <a href="http://www.uoregon.edu/~mgall/statistical_significance_v.htm">statistical versus practical significance</a>. What do you think?</p>]]>

</content>
</entry>
<entry>
<title>Anova</title>
<link rel="alternate" type="text/html" href="http://www.stat.columbia.edu/~cook/movabletype/archives/2007/10/anova.html" />
<modified>2007-10-29T02:38:37Z</modified>
<issued>2007-10-29T05:18:11Z</issued>
<id>tag:www.stat.columbia.edu,2007:/~cook/movabletype/mlm/1.1369</id>
<created>2007-10-29T05:18:11Z</created>
<content><![CDATA[<p>Cari Kaufman writes,</p>

<blockquote>
I am writing a paper on using Gaussian processes for Bayesian functional ANOVA, and I'd like to draw some connections to <a href="http://www.stat.columbia.edu/~gelman/research/published/AOS259.pdf">your 2005 Annals paper</a>.  In my own work I've chosen to use a 1-1 reparameterization of the cell means, that is, to constrain the levels within each factor.  But I am intrigued by your use of exchangeable levels for all factors, and I'm hoping you can take a few minutes to help me clarify your motivation for this decision.  Since not all parameters are estimable under the unconstrained model, don't you encounter problems with mixing when the sums of the levels trade off with the grand mean?  It seems in many situations it's advantageous to have an orthogonal design matrix, especially when the observed levels correspond to all possible levels in the population.  Do you have any thoughts on this you can share?

<p>I should say I found the paper very useful, especially your graphical representation of the variance components.  I also like your distinction between the superpopulation and finite population variances, which helped me clarify what happens when generalizing to functional responses.  Basically, we can share information across the domain to estimate the superpopulation variances by having a stationary Gaussian process prior, but the finite population variances can differ over the domain, which gives some nice insight into where<br />
various sources of variability are important.  (At the moment I'm working with climate modellers, who can really use maps of where various sources of variability show up in their output.)</blockquote></p>

<p>My reply:  I'm not quite sure what the question is, but I think you're pointing out the redundant parameterization issue, that if we specify all levels of a factor, and then have other crosscutting or nested factors (or even just a constant term), then the linear parameters are not all identifiable.  I would deal with this issue by fitting the large, nonidentified model and then summarizing using the relevant finite-population summaries.  We discuss this a bit in Sections 19.4-19.5 and Chapters 21-22 of <a href="http://www.stat.columbia.edu/~gelman/arm/">our new book</a>.</p>

<p>A couple notes on this:</p>

<p>1.  Mixing of the Gibbs sampler can be slow on the original, redundant parameter space but fast on the transformed space, which is what we really care about. Also, things work better with proper priors.  My new thing is <a href="http://www.stat.columbia.edu/~gelman/research/presentations/weakpriorstalk.pdf">weakly informative priors</a> which don't include all your prior information but act to regularize your inferences and keep the algorithms in a reasonable space where they can converge faster.  The orthoganality that you want can come in this lower-dimensional summary.</p>

<p>2.  The redundant-parameter model is identified, if only weakly, as long as we use proper prior distributions on the variance parameters.  In Bayesian Data Analysis and in my 2005 Anova paper, I was using flat prior distributions on these "sigma" parameters.  But since then I've moved to proper priors, or, in the Anova context, hierarchical priors.  See <a href="http://www.stat.columbia.edu/~gelman/research/published/taumain.pdf">this paper</a> for more information, including an example in Section 6 of the hierarchical model for the variance parameters.</p>]]></content>
<author>
<name>Andrew</name>
<url>http://www.stat.columbia.edu/~gelman</url>
<email>gelman@stat.columbia.edu</email>
</author>
<dc:subject>Miscellaneous Statistics</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.stat.columbia.edu/~cook/movabletype/mlm/">
<![CDATA[<p>Cari Kaufman writes,</p>

<blockquote>
I am writing a paper on using Gaussian processes for Bayesian functional ANOVA, and I'd like to draw some connections to <a href="http://www.stat.columbia.edu/~gelman/research/published/AOS259.pdf">your 2005 Annals paper</a>.  In my own work I've chosen to use a 1-1 reparameterization of the cell means, that is, to constrain the levels within each factor.  But I am intrigued by your use of exchangeable levels for all factors, and I'm hoping you can take a few minutes to help me clarify your motivation for this decision.  Since not all parameters are estimable under the unconstrained model, don't you encounter problems with mixing when the sums of the levels trade off with the grand mean?  It seems in many situations it's advantageous to have an orthogonal design matrix, especially when the observed levels correspond to all possible levels in the population.  Do you have any thoughts on this you can share?

<p>I should say I found the paper very useful, especially your graphical representation of the variance components.  I also like your distinction between the superpopulation and finite population variances, which helped me clarify what happens when generalizing to functional responses.  Basically, we can share information across the domain to estimate the superpopulation variances by having a stationary Gaussian process prior, but the finite population variances can differ over the domain, which gives some nice insight into where<br />
various sources of variability are important.  (At the moment I'm working with climate modellers, who can really use maps of where various sources of variability show up in their output.)</blockquote></p>

<p>My reply:  I'm not quite sure what the question is, but I think you're pointing out the redundant parameterization issue, that if we specify all levels of a factor, and then have other crosscutting or nested factors (or even just a constant term), then the linear parameters are not all identifiable.  I would deal with this issue by fitting the large, nonidentified model and then summarizing using the relevant finite-population summaries.  We discuss this a bit in Sections 19.4-19.5 and Chapters 21-22 of <a href="http://www.stat.columbia.edu/~gelman/arm/">our new book</a>.</p>

<p>A couple notes on this:</p>

<p>1.  Mixing of the Gibbs sampler can be slow on the original, redundant parameter space but fast on the transformed space, which is what we really care about. Also, things work better with proper priors.  My new thing is <a href="http://www.stat.columbia.edu/~gelman/research/presentations/weakpriorstalk.pdf">weakly informative priors</a> which don't include all your prior information but act to regularize your inferences and keep the algorithms in a reasonable space where they can converge faster.  The orthoganality that you want can come in this lower-dimensional summary.</p>

<p>2.  The redundant-parameter model is identified, if only weakly, as long as we use proper prior distributions on the variance parameters.  In Bayesian Data Analysis and in my 2005 Anova paper, I was using flat prior distributions on these "sigma" parameters.  But since then I've moved to proper priors, or, in the Anova context, hierarchical priors.  See <a href="http://www.stat.columbia.edu/~gelman/research/published/taumain.pdf">this paper</a> for more information, including an example in Section 6 of the hierarchical model for the variance parameters.</p>]]>

</content>
</entry>
<entry>
<title>More on differences in differences</title>
<link rel="alternate" type="text/html" href="http://www.stat.columbia.edu/~cook/movabletype/archives/2007/10/more_on_differe.html" />
<modified>2007-10-28T23:54:31Z</modified>
<issued>2007-10-28T23:48:22Z</issued>
<id>tag:www.stat.columbia.edu,2007:/~cook/movabletype/mlm/1.1368</id>
<created>2007-10-28T23:48:22Z</created>
<content><![CDATA[<p>Bob Erikson writes,</p>

<blockquote>I was trolling the internet and came across <a href="http://www.stat.columbia.edu/~cook/movabletype/archives/2007/02/differenceindif.html">your debate with Jens H. from Feb 15 07 on your blog</a> about differences in differences.

<p>You might find <a href="http://www.stat.columbia.edu/~gelman/stuff_for_blog/Campbell and Clayton.pdf">the attached document</a> of interest. It is a once-influential  currently-obscure article from half a century ago on this topic.  The language is not contemporary. But note Campbell's  example of 2 ways to analyze the substantive problem and two very different interpretations.  Presumably Campbell is correct, using a difference of differences approach.</blockquote></p>

<p>Bob's office is just across from mine in the political science department, but of course we communicate via blogs and emails.  Anyway, I'll have to read the paper carefully.  Also it will be interesting to see if they noticed that <a href="http://www.stat.columbia.edu/~gelman/research/published/gelman.pdf">before-after correlations are higher for controls than for treated units</a>.</p>]]></content>
<author>
<name>Andrew</name>
<url>http://www.stat.columbia.edu/~gelman</url>
<email>gelman@stat.columbia.edu</email>
</author>
<dc:subject>Causal Inference</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.stat.columbia.edu/~cook/movabletype/mlm/">
<![CDATA[<p>Bob Erikson writes,</p>

<blockquote>I was trolling the internet and came across <a href="http://www.stat.columbia.edu/~cook/movabletype/archives/2007/02/differenceindif.html">your debate with Jens H. from Feb 15 07 on your blog</a> about differences in differences.

<p>You might find <a href="http://www.stat.columbia.edu/~gelman/stuff_for_blog/Campbell and Clayton.pdf">the attached document</a> of interest. It is a once-influential  currently-obscure article from half a century ago on this topic.  The language is not contemporary. But note Campbell's  example of 2 ways to analyze the substantive problem and two very different interpretations.  Presumably Campbell is correct, using a difference of differences approach.</blockquote></p>

<p>Bob's office is just across from mine in the political science department, but of course we communicate via blogs and emails.  Anyway, I'll have to read the paper carefully.  Also it will be interesting to see if they noticed that <a href="http://www.stat.columbia.edu/~gelman/research/published/gelman.pdf">before-after correlations are higher for controls than for treated units</a>.</p>]]>

</content>
</entry>

<entry>
<title>Please update your feed URL!</title>
<link href="http://www.stat.columbia.edu/~cook/movabletype/mlm/" />
<modified>2008-04-04T23:54:31Z</modified>
<issued>2008-04-04T23:48:22Z</issued>
<id>tag:www.stat.columbia.edu,2007:/~cook/movabletype/mlm/1.5369</id>
<created>2008-04-04T23:48:22Z</created>
<content>
We have transferred our links to FeedBurner.
</content>
<author>
<name>Aleks</name>
<url>http://www.stat.columbia.edu/~jakulin</url>
<email>jakulin@stat.columbia.edu</email>
</author>
<dc:subject>Admin</dc:subject>
</entry>

</feed>
