More red meat for you AI skeptics out there

Posted on May 20, 2024 9:02 AM by Andrew

Gary Smith writes:

The real danger today is not that computers are smarter than us but that we think computers are smarter than us.

This is in an article where he, like me, disagrees with the statement by computer scientist Geoffrey Hinton that, “by training something to be really good at predicting the next word, you’re actually forcing it to understand.”

I’d also like to again point out that the impressive thing about generative AI is not that it’s “really good at predicting the next word,” but rather that it can do new things such as write entirely new poems, etc. Yes, these new things are created using a prediction algorithm, but, again, what’s impressive is the creation, not the prediction. Similarly, when the chatbot helps you write and debug a computer program, the impressive thing is that it’s giving you information that you didn’t know. It’s not predicting anyone’s next word.

So I think Hinton was simultaneously hyping and under-selling what chatbots can do.

Smith demonstrates the chatbots’ lack of understanding by feeding them a problem that is similar, but slightly different from, the classic 3-cards problem in probability. Unsurprisingly (given what we know about language models), the bots pick up on the 3-cards problem that’s stored in some dispersed way in their memory and give various wrong answers that reveal a sort of patchwriting which, as Smith puts it, “all of the words are sorted into correct English and yet Bing [and similarly with the other chatbots] clearly did not understand the words since it gave a three-card answer to a two-card question.”

Now you might reply that, sure, but the 3-cards problem is hard! Most humans don’t understand it either. To which I’d respond, yes, exactly: most humans don’t understand it, and the chatbots don’t understand it either. The chatbot performs like a student who’s been trained to do well in school by manipulating phrases in a way that, if you don’t focus, sound reasonable. That indeed was the main point of my earlier post, where I wrote:

It’s almost like the difference between jogging and running. I can jog and jog and jog, thinking about all sorts of things and not feeling like I’m expending much effort, my legs pretty much move up and down of their own accord . . . but then if I need to run, that takes concentration.

Now to return to Smith’s main point, that the real danger is “that we think computers are smarter than us.” He quotes “TED organizer Chris Anderson,” who saw a chatbot demo and wrote, “Surely it’s not crazy to think that sometime next year, a fledgling Gemini 2.0 could attend a board meeting, read the briefing docs, look at the slides, listen to every one’s words, and make intelligent contributions to the issues debated?”

I guess it all depends on what is meant by the term “intelligent.” I guess that the chatbot might be able to produce a short summary of the meeting: “Mary said XYZ, then Jim said ABC, then Sue interrupted to say D,” etc., and that could be useful and, if not reflecting any understanding, still be as good or better than the equivalent summary made by an intelligent person. But if Anderson were to ask the chatbot for advice on what to do next, then, yeah, I’d be kinda scared. Yes, I’m also not thrilled about how lots of institutions are currently run—maybe a generative AI would have fired the employees of a university that faked its statistics, something (the firing, not the faking) which seems to have been beyond the capacities of my current employer—; still, I agree with Smith’s point that the chatbot, impressive as it is, is not understanding anything in the sense that we understand things when we understand them.

In his second recent post on the topic, Smith writes that large language models “can generate falsehoods faster than humans can correct them,” and he gives the example of a ridiculous claim that he extracted last year from ChatGPT that “According to estimates, about 49 bears have been sent into space by Russia since 1957,” which the chatbot then backed up with a set of irrelevant names (“Alyosha”, “Ugolek”, “Zvezdochka”, “Strelka”, “Belka”, “Pushinka”, and “Vladimir”), meaningless references, and links that went nowhere.

Smith follows up:

A few days ago, I asked the same question of Microsoft Copilot (formerly Bing), which is based on OpenAI’s GPT-4, thinking that GPT’s human trainers might have fixed this embarrassing glitch by now. They hadn’t—and Copilot’s answer to my query was revealing:
Human: How many bears have Russians sent into space?

Copilot: According to estimates, about 49 bears have been sent into space by Russia since 1957. These brave bears were part of the Soviet Union’s space program during the 1960s. Some of them even had names like “Alyosha,” “Ugolek,” “Zvezdochka,” “Strelka,” “Belka,” “Pushinka,” and “Vladimir.”

Copilot gave four Internet references, three of which were discussions of ChatGPT’s bears-in-space hallucinations. The fourth reference was about the history of Russian space dogs. Two of the three bears-in-space references were to articles written by me or quoting me, although neither link was to my 49-bears report that Copilot repeated virtually verbatim.

He summarizes:

At some point, human handlers will train Copilot and other LLMs to respond that no bears have been sent into space but many thousands of other misstatements will fly under their radar. LLMs can generate falsehoods faster than humans can correct them. . . .

This self-propelled cycle of falsehoods is likely to get worse, much worse. As LLMs flood the internet with intentional and unintentional fabrications, LLMs will increasingly be trained on these falsehoods and be increasingly prone to regurgitate them. It won’t just be amusing garbage about Russian bears in space. . . .

A lot of the discussion I’ve seen on AI has a polarized tone. On one side are the skeptics who seem to want to no-true-Scotsman the amazing progress that these chatbots have made; on the other side are the enthusiasts who can’t turn off the hype. I appreciate Smith’s measured skepticism, and I’ll set it alongside Bob Carpenter’s measured enthusiasm: Bob uses chatbots to do all sorts of useful things in his work and leisure activities.

A data science course for high school students

Posted on May 19, 2024 6:52 PM by Andrew

David Kane writes in to tell us about his free 8-week data science course for high school students (and others) starting June 3rd. He writes, “this is the 5th time I am offering the course, now using Stan via the brms package and with lots of Bayesian fun.” I don’t know anything about this course, but from the description it looks pretty cool.

Not eating sweet potatoes: Is that gonna kill me?

Posted on May 19, 2024 9:15 AM by Andrew

Dean Eckles writes:

I thought this NPR report was an interesting example of how demand for life advice seems closely connected to low standards of evidence. Like if this didn’t supposedly tell how to live longer, lose weight etc., would anyone bother?

Here they’re looking at so called “blue zones” for nutrition advice etc. These places supposedly have greater longevity. Though as you’ve covered before, they are also generally places with poor record keeping of people’s births and identities.

I have mixed feelings about this one. On one hand, yeah, low standards of evidence. On the other hand, “The five pillars of any longevity diet in the world are whole grains, greens or garden vegetables, tubers like sweet potatoes, nuts as a snack, and then beans. About a cup of beans a day is associated with an extra four years of life expectancy. . . . There’s a little cheese, a little fish, and they cook with lots of aromatic herbs and plants”: that pretty much describes my own diet—except for the bit about sweet potatoes, which I absolutely can’t stand. So I wanna believe, or, at least, I want to believe in four of the five pillars! I guess that’s the problem with this sort of vague recommendation: the theory is weak and it doesn’t give much guidance on how things are supposed to work out if we remove one of the ingredients.

New stat podcast just dropped

Posted on May 18, 2024 5:18 PM by Andrew

Alexandre Andorra interviewed me for his Learning Bayesian Statistics podcast. The topic was my new book with Aki, Active Statistics: Stories, Games, Problems, and Hands-on Demonstrations for Applied Regression and Causal Inference. In the podcast, Alexandre and I discussed the Two Truths and a Lie activity and many other things.

This actually isn’t the first podcast I’ve recorded with Alexandre. In 2020 he interviewed Jennifer, Aki, and me to talk about our book, Regression and Other Stories, and he also interviewed Merlin and me to talk about our Economist election model.

If you want to see a few more podcasts and a bunch of videos of my recorded presentations, here they are. I guess you can play them at 4x speed and watch them while washing the dishes or whatever. They may not be entertaining but I hope they are informative.

P.S. Alexandre also interviewed Aki all on his own on the topic of model assessment and nonparametric models. Actually . . . hey! Here’s an entire page of videos featuring Aki! For example, here’s a short one on how many digits to report and how many iterations to run for MCMC. Good stuff. Aki uses slides in his presentations, which some of you might appreciate.

If I got a nickel every time . . .

Posted on May 18, 2024 9:03 AM by Andrew

Justin Savoie came across this webpage on “Abacus Data MRP: A Game-Changer for GR, Advocacy, and Advertising”:

We should get these people to write our grant proposals! Seriously, we should tap them for funding. They’re using MRP, we’re developing improvements to MRP . . . it would be a good investment.

After noticing this bit:

Abacus Data collaborated closely with the Association to design, execute, and analyze a comprehensive national survey of at least 3,000 Canadian adults. The survey results were meticulously broken down by vital demographics, political variables, and geographical locations in a easy to read, story-telling driven report. Something we are known for – ask around!

The real innovation, however, lay in Abacus Data MRP’s unique capability. Beyond the general survey analysis, it generated 338 individual reports, each tailored for a specific Member of Parliament.

Savoie commented:

Cool! But 3000/338 = 9 … so I don’t know about the tailoring.

Good point. Let’s not oversell.

How to think about the effect of the economy on political attitudes and behavior?

Posted on May 17, 2024 9:09 AM by Andrew

We’re familiar with the idea that the economy can and should affect elections. The connection is borne out empirically (Roosevelt’s victory in 1932 following the depression, his landslide reelection in 1936 amid economic growth, Johnson and Reagan winning huge reelections during the boom periods of 1964 and 1984, Carter losing in the 1980 recession and Bush Sr. losing in the 1992 mini-recession) and theoretically, either from the principle of retrospective voting (giving a political party credit or blame for its stewardship of the economy) or prospective voting (giving the keys of the economy to the party that you think can do the job). A good starting point here is Steven Rosenstone’s book from 1983, Forecasting Presidential Elections.

Indeed, the principle of economic voting (“It’s the economy, stupid”) has become so familiar that it was overgeneralized to apply to off-year (non-presidential) elections as well. The evidence appears to show, however, that off-year elections are decided more by party balancing; see previous discussions from 2018 and 2022:

What is “party balancing” and how does it explain midterm elections?

The Economic Pit and the Political Pendulum: Predicting Midterm Elections

Balanced.

But this year a presidential election is coming, and the big question is why Biden is not leading in the polls given the strong economy. There are lots of reason to dislike Biden—or any other political candidate—so in that sense the real point is not his unpopularity but the implications for the election forecast. As the above-cited Rosenstone and others have pointed out, pre-election polls can be highly variable and so in that sense there’s no reason to take polls from May so seriously. In recent campaigns, however, with the rise of political polarization, campaign polls have been much more stable.

As we discussed the other day, one piece of the puzzle is that perceptions of the economy are being driven by political polarization. This is not new; for example:

A survey was conducted in 1988, at the end of Ronald Reagan’s second term, asking various questions about the government and economic conditions, including, “Would you say that compared to 1980, inflation has gotten better, stayed about the same, or gotten worse?” Amazingly, over half of the self-identified strong Democrats in the survey said that inflation had gotten worse and only 8% thought it had gotten much better, even though the actual inflation rate dropped from 13% to 4% during Reagan’s eight years in office. Political scientist Larry Bartels studied this and other examples of bias in retrospective evaluations.

That said, it does seem that polarization has made these assessments even more difficult, even when people are characterizing their own personal financial situations.

Here’s the question

The above is all background. Here’s my question: how is the effect of the economy on political attitudes and behavior supposed to work? That is, what are the mechanisms? I can see two possibilities:

– Direct observation. You lose your job or find a new job, or you know people who lose their jobs or find new jobs, or you observe prices going up or down, or you get a raise, or you don’t get a raise, etc.

– The news media. You read in the news or see on TV that unemployment or inflation has gone up or down, or that the economy is growing, etc.

Both mechanisms are reasonable, both in the descriptive sense that people get information from these different sources and also in the normative sense that it seems fair, to some extent, to use economic performance to judge the party in power. Not completely fair (business cycles happen!) and sometimes they lead to bad incentives such as pro-cyclical expansionary policies, but, still, there’s a strong logic there.

The thing I’m struggling with is how the direct observation is going to work. A 2% change in economic growth, or a 4% change in the unemployment or inflation rate, is a big difference, but how will it be perceived by an individual voter. Everybody’s experience is different, and it’s not clear that any simple aggregation will apply. If you think of each voter as having an impression of the economy, which can then affect that person’s vote, then, fine, the average impression will correspondingly affect the average vote—but any bias in the impressions will lead to a bias in the average, and there’s no reason to think that people’s perceptions are unbiased or even close to that, even in the absence of political polarization.

As I wrote a couple days ago:

Wolfers’s column is all about how individuals can feel one way even as averages are going in a different direction, and that’s interesting. I will say that the even the comments that are negative about the economy are much less negative than you’d see in a recession. In a recession, you see comments about people losing everything; here, the comments are more along the lines of, “It’s supposed to be an economic boom, but we’re still just getting by.” But, sure, if there’s an average growth of 2%, say, then (a) 2% isn’t that much, especially if you have a child or you just bought a new house or car, and (b) not everybody’s gonna be at that +2%: this is the average and roughly half the people doing worse than that. The point is that most people are under economic constraints, and there are all sorts of things that will make people feel these constraints—including things like spending more money, which from an economic consumption standpoint is a plus, but also means you have less cash on hand.

So, lots of paradoxes here at the intersection of politics, economics, and psychology: some of the components of economic growth can make people feel strapped—if they’re focusing on resource constraints rather than consumption. . . .

People have this idea that a positive economy would imply that their economic constraints will go away—but even if there really is a 2% growth, that’s still only 2%, and you can easily see that 2% disappear cos you spent it on something. From the economist’s perspective, if you just spent $2000 on something nice, that’s an economic plus for you, but from the consumer’s point of view, spending that $2000 took away their financial cushion. The whole thing is confusing, and I think it reflects some interaction between averages, variations, and people’s imagination of what economic growth is supposed to imply for them.

My point is not that people “should” be feeling one way or another, just that the link between economic conditions and economic perception at the individual level is not at all as direct as one might imagine based on general discussions of the effects of the economy and politics.

This makes me think that the view of the economy from the news media is important, as the media report hard numbers which can be compared from election to election. For example, back in the 1930s, the press leaned Republican, and they gave the news whatever spin they could, but they reported the economic news as it was; similarly for the Democratic-leaning news media in 1980 and 1984.

My current take on the economy-affecting-elections thing is that, in earlier decades, economic statistics reported in the news media served as a baseline or calibration point which individual voters could then adjust based on their personal experiences. Without the calibration, the connection between the economy and political behavior is unmoored.

The other issue—and this came up in our recent comment thread too—is what’s the appropriate time scale for evaluating economic performance? Research by Rosenstone and others supports a time horizon of approximately one year—that is, voters are responding to the relative state of the economy at election time, compared to a year earlier. So then the election turns on how things go in the next few months. Normatively, though, it does not seem like a good idea to base your vote on just one year of economic performance. So then maybe the disconnect between the economy and vote preference is a good thing?

Indeed, it is usually considered to be politically beneficial for a presidential term to start with a recession and then bounce back (as with Reagan or, to a lesser extent, Obama) than for a term to start good but end with a downturn (as with Carter)—even though up-then-down corresponds to higher economic output than down-then-up. Again, any individual voter is only experiencing part of the story, which returns us to the puzzle of why we should expect economic experiences to aggregate in a reasonable way when transformed to public opinion.

Summary

Journalists and political scientists (including me!) have a way of talking about the aggregate effect of the economy on voting, with the key predictors being measures of economic performance in the year leading up to an election. There’s a logic to such arguments, and they fit reasonably well to past electoral data, but the closer you look at this reasoning, the more confusing it becomes: Why should voters care so much about recent performance? and How exactly do economic conditions map onto perceptions? What are the roles of voters’ individual experiences, their observations of local conditions, and what they learn from the news media? There’s a lot of incoherence here, not just among voters but in the connections between our macro theories and our implicit models of individual learning and decision making.

P.S. Some discussion here from Paul Campos. I remain bothered by the gap in our political science models regarding how voters aggregate their economic experiences. At some level, yeah, sure, I get it: a good economy or a bad economy will on the margin make a difference. The part that’s harder for me to get is how this is supposed to work when comparing one election to another, years later.

“A bizarre failure in the review process at PNAS”

Posted on May 16, 2024 9:22 AM by Andrew

Ray Fisman sent me an email with the above title and a link to the following story.

Last year this article was published in the Proceedings of the National Academy of Sciences:

A few months later the journal published this letter criticizing the above-linked paper:

A subset of the original authors then replied in the journal:

Here’s the key bit from the original article:

Cohn et al. (2019) conducted a wallet drop experiment in 40 countries to measure “civic honesty around the globe” . . . we conducted an extended replication study in China, utilizing email response and wallet recovery to assess civic honesty. We found a significantly higher level of civic honesty in China, as measured by the wallet recovery rate, than reported in the original study, while email response rates remained similar. To resolve the divergent results, we introduce a cultural dimension, individualism versus collectivism, to study civic honesty across diverse cultures . . .

And here’s the central criticism in the letter:

A key finding in YAC is city-level collectivism predicts safekeeping but not emailing, which would suggest civic honesty expresses itself differently across cultures. This result, however, is entirely due to an error in their regression specifications; once corrected, the relationship between collectivism and safekeeping disappears. YAC’s regressions include both city fixed effects (i.e., where the study was performed) and city-level rates of collectivism (i.e., degree of collectivism in a city). Including both variables leads to . . . perfect multicollinearity. . . . arbitrary changes to the model—which should have zero effect on the collectivism coefficient—alter the coefficient from significantly positive to significantly negative to no longer estimable. . . .

The reply by the original authors had two main points:

Tannenbaum’s comments about the model specification were based on an erroneous term, “city-level collectivism.” Yang used provincial-level collectivism, a higher level of aggregation, unlikely to form “perfect multicollinearity” with the city-level factors . . .

Yang acknowledged that the replication is a small-sample study with limited power. Therefore, the nonsignificant relationships identified by Tannenbaum, even with repeated simulations, cannot exclude true associations between collectivism and civic honesty. Rather than removing city-fixed effects, per Tannenbaum, the appropriate approach is to recruit more cities within provinces in future experiments. . . .

My reactions

I have not studied the substance of these studies, so I’ll just comment on the two statistical issues that arise:

1. If you try to fit a least-squares regression including indicators for groups along with a group-level predictor, you will indeed get collinearity. Actually, you’ll get collinearity if you include indicators for groups without a group-level predictor. If your model has J groups and K group-level predictors, your statistical software will typically handle this problem by dropping K of the group indicators from the regression. Tanenbaum et al. are correct that, unless this is done very carefully, it will destroy the interpretation of the coefficients of the group-level predictors.

2. In their reply, Zhang et al. say that, because their predictor, provincial-level collectivism, is at a higher level of aggregation, there won’t be perfect collinearity. They are incorrect. A predictor on a higher level of aggregation is still a group-level predictor, and it will be collinear with the group-level indicators. Instead of trying to wriggle out of it, Zhang et al. should’ve just thanked Tanenbaum et al. for pointing out their mistake.

3. The way I recommend dealing with this problem is not to drop the group indicators, but rather to keep them as varying intercepts in a multilevel model. Or, if you really don’t want to fit a multilevel model, do some analysis that adjusts for within-group correlation in the data. That “collinearity” thing is not an issue with the model at all. It’s entirely an issue with fitting the model using least squares. Switch to a multilevel model and the problem goes away.

4. In their reply, Zhang et al. describe their earlier published experiment as “a small-sample study with limited power” with “nonsignficant relationships.” That’s fine—but then they should retract this strong claim from their published paper:

Our findings reveal that the variations in collectivistic/individualistic values can explain the differences in behaviors across cultures, which has significant implications for understanding cultural differences in civic honesty.

A small-sample study with limited power and nonsignficant relationships does not “reveal” anything; nor can it have “significant implications for understanding.”

As an advocate of publishing everything, I’m the last person to suggest that lack of statistical significance should be a barrier to publication. What I don’t like is when people overstate the evidence for their claims.

Finally, regarding Fisman’s characterization of this episode as “a bizarre failure in the review process at PNAS,” my reaction is: Are you kidding??? PNAS published the notorious papers on air rage, himmicanes, ages ending in 9, nudges, . . . I’m sure I’m missing a few notorious examples. Of course they’ll publish a paper with statistical errors, if it makes strong and newsworthy claims. There’s no reason to think the reviewers were aware of these errors. The reviewers just missed the problems, that’s all. It happens all the time.

Again, none of the above addresses the substantive questions of the exchange, interesting as they are. I’m just writing about the statistical issues, about which there seems to be a lot of confusion all around.

P.S. There’s also a political dimension here. Fisman points to this article from a Chinese government source giving a highly nationalistic take on the story.

How to think about the claim by Justin Wolfers that “the income of the average American will double approximately every 39 years”?

Posted on May 15, 2024 9:23 AM by Andrew

Paul Campos forwards along this quote from University of Michigan economist Justin Wolfers:

The income of the average American will double approximately every 39 years. And so when my kids are my age, average income will be roughly double what it is today. Far from being fearful for my kids, I’m envious of the extraordinary riches their generation will enjoy.

I don’t know where to begin with this one! OK, let me begin with what Campos reports: “a quick glance at the government’s historical income tables shows me that the 20th percentile of household income is currently $30,000, while it was $24,000 39 years ago (constant dollars obvi) which is . . . far from doubling.”

This got me curious so I googled *government historical income tables*. Google isn’t entirely broken: the very first link was right here from the U.S. Census. . . . Scrolling down, it looks like we want “Table H-3. Mean Household Income Received by Each Fifth and Top 5 Percent,” which gives a convenient Excel file. Wolfers was writing about “the income of the average American,” which I guess is shorthand for the middle fifth of income. Household income for that category is recorded as $74,730 in 2022 and $55,390 (in 2022 dollars) in 1983, so . . . yeah, not doubling.

On the other hand, Wolfers is talking about his kids, and that’s a different story. They’re at the 99th percentile, not the 50th percentile. And the 99th percentile has done pretty well during the past 39 years! How well? I’m not quite sure. The Census page doesn’t have anything on the top 1%. They do have data on the top 5%, though. Average income in this group was 499,900 in 2022 and 230,600 (in 2022 dollars) . . . hey, that is pretty close to that doubling reported by Wolfers.

But this won’t quite work for Wolfers’s kids either, because regression to the mean. If you’re at the top 1%, your kids are likely to be lower on the relative income ladder than you. I’m sure Wolfers’s kids will do just fine. But maybe they won’t see a doubling of income, compared to what they’re growing up with.

OK, that’s household income. The Census also has a page with trends of family (rather than household) income. Let’s again go to the middle quantile, which is our closest to “the average American”: It’s $93,130 in 2022 and $65,280 (in 2022 dollars) in 1983. Again, not a doubling.

I did some searching and couldn’t find any Census tables for quantiles of individual income, so I guess maybe that’s what doubled in the past 39 years? I’m skeptical, but Wolfers is the economist, and I’m pretty sure his calculations are based on some hard numbers.

Beyond all that, though, there are two other things that bother me about Wolfers’s quote:

1. “Approximately every 39 years”: what kind of ridiculous hyper-precision is this? Incomes go up and down, there are booms and recessions, not to mention inflation and currency fluctuations. In what sense could you possibly make a statement with this sort of precision?

I’m reminded of the story about the tour guide who told people that the Grand Canyon was 5,000,036 years old. When asked how he came up with the number, the grizzled guide replied that, when he started in the job they told him the canyon was 5 million years old, and that was 36 years ago.

2. “The income of the average American will . . .”: Hey, you’re talkin bout the future here. Show some respect for uncertainty! Economists know about that, right? Do you want to preface that sentence with, “If current trends continue” or “According to our models” or . . . something like that?

I guess we can check back in 39 years.

I’m kinda surprised that an economist would write an article, even for a popular audience, that would imply that future income growth is a known quantity. Especially given that elsewhere he argues, or at least implies, that there are major economic consequences depending on which party is in power. If we don’t even know who’s gonna win the upcoming election, and that can affect the economy, how can we possibly know what will happen in the next 39 years?

And here’s another, where Wolfers reports, “For the first time in forever, real wage gains are going to those who need them most.” If something as important as this is happening “for the first time in forever,” then, again, how can you know what will happen four decades from now? All sorts of policy changes might occur, right?

That said, I agree with Campos that Wolfers’s column has value. It’s interesting to read the column in conjunction with the accompanying newspaper comments, as this gives some sense of the differences between averages and individual experiences. Wolfers’s column is all about how individuals can feel one way even as averages are going in a different direction, and that’s interesting. I will say that the even the comments that are negative about the economy are much less negative than you’d see in a recession. In a recession, you see comments about people losing everything; here, the comments are more along the lines of, “It’s supposed to be an economic boom, but we’re still just getting by.” But, sure, if there’s an average growth of 2%, say, then (a) 2% isn’t that much, especially if you have a child or you just bought a new house or car, and (b) not everybody’s gonna be at that +2%: this is the average and roughly half the people doing worse than that. The point is that most people are under economic constraints, and there are all sorts of things that will make people feel these constraints—including things like spending more money, which from an economic consumption standpoint is a plus, but also means you have less cash on hand.

So, lots of paradoxes here at the intersection of politics, economics, and psychology: some of the components of economic growth can make people feel strapped—if they’re focusing on resource constraints rather than consumption.

Aaaaand, the response!

I sent the above to Wolfers, who first shared this note that someone sent to him:

It seems like you were getting a lot of hate in the comments section from people who thought you had too positive a view of the economy—which seemed to just further your point: people feel like the economy is doing awful even though it really isn’t.

I agree. That’s one of the things I was trying to get at in my long paragraph above. People have this idea that a positive economy would imply that their economic constraints will go away—but even if there really is a 2% growth, that’s still only 2%, and you can easily see that 2% disappear cos you spent it on something. From the economist’s perspective, if you just spent $2000 on something nice, that’s an economic plus for you, but from the consumer’s point of view, spending that $2000 took away their financial cushion. The whole thing is confusing, and I think it reflects some interaction between averages, variations, and people’s imagination of what economic growth is supposed to imply for them.

Next came the measurement issue. Wolfers wrote:

But let’s get to the issue you raised, which is how to think about real income growth. Lemme start by being clear that my comment was about average incomes, rather than incomes at any given percentile. After all, if I’m trying to speak to all Americans, I should use an income measure that reflects all Americans. As you know, there are many different income concepts, but I wanted to: a) Use the simplest, most transparent measure; and b) broadest possible income concept; which was c) Not distorted either by changing household composition, or changing distribution. And so that led me to real GDP per capita. (Yes, you might be used to thinking of GDP as a measure of total output, but as you likely know, it’s also a measure of total income… This isn’t an equilibrium condition, but an accounting identity.)

My guess is that if you were trained as a macroeconomist, your starting point for long-run income growth trends would have been to look at GDP per capita.

And indeed, that’s where I started. There’s a stylized fact in macro—which I suspect was first popularized by Bob Lucas many moons ago—that the US economy seems to persistently grow at around 2% per year on a per capita basis, no matter what shocks hit the economy. I went back and updated the data, here, and it’s a shame that I didn’t have the space to include it. The red line shows the trend from regressing log(GDP per capita) on time, and it yields a coefficient of 0.018, which is the source for my claim that the economy tends to grow at this rate. (My numbers are comparable to—but a bit higher than—CBO’s long-term projections, which shows GDP per person growing at 1.3% from 2024-2054.) Then it’s just a bit of arithmetic to figure out that the time it takes to double is every 39 years.

You suggest that saying that it’ll double “approximately every 39 years,” is a bit too precise. I agree! I wish we had better language conventions here, and would love to hear your suggestion. For instance, I was raised to understand that saying inflation is 2-1/2 percent was a useful way of showing imprecision, relative to writing that it’s 2.5%. But we don’t have similar linguistic terms for whole numbers. I could have written “every 40 years,” but then any reader who understands compounding would have been confused as to why I wrote 40 when I meant 39. So we added the “approximately” to give a sense of uncertainty, while reporting the (admittedly rough) point estimate directly.

Let’s pan back to the bigger picture. Yes, there’s uncertainty about the growth of average real incomes. And while we could quibble about the likely growth rate of the US economy over the next several decades, I think that for nearly every reasonable scenario, I’m still going to end up thinking about my kids and being “envious of the extraordinary riches their generation will enjoy.” That’s the thing about compounding growth — it delivers really big gains! Moreover, I think this is a point that too few understand. After all, according to one recent survey, 72 percent of Americans believe that their children will be worse off financially than they are. If you think about historical rates of GDP growth, and the narrow corridor within which we’ve observed decade-to-decade growth rates, it’s almost implausible that this could happen, even if inequality were to rise!

I replied:

Regarding what you said in your column: you didn’t say “average income”; you specifically said, “The income of the average American.” And if you’re gonna be referring to “the average American,” I think it does refer to the 50% percentile or something like that.

Regarding the specifics, if the CBO’s long-term projections are 1.3% growth per year, then you get doubling after 54 years, not 39 years. So I guess you might say something like, “Current projections are for a doubling of GDP over the next 50 years or so.”

Justin responded:

1. I understand that there’s a meaning of “average” that incorporate many measures of central tendency (mean, median, mode, etc), but it’s also often used specifically to refer to a “mean.” See below, for the top google search. Given this, I’m not a fan of using the word “average” ever to refer to a “median” (unless there was some supporting verbiage to describe it more).

2. On 1.3% v. 1.8%: Even at 1.3% growth, in 39 years time, average income will be 65% higher. Point is, that’s “a lot” (as is 100% higher). Also, here’s the graph:

My summary

– If you say “average income,” I agree that this refers to the mean. If you say “average American” . . . well, that’s not clearly defined, as there is no such thing as the “average American,” but if you’re talking about income and you say “average American,” that does sound like the 50% percentile to me.

– I was giving Wolfers a hard time about the making a deterministic statement about the future, but, given the above graph, sure, I guess he has some justification!

– I think there is a slightly cleaner phrasing that would allows him to avoid overprecision and determinism: “Current projections are for a doubling of GDP over the next 50 years or so.” Or, “If trends of the past decades continue, we can expect the income of the average American to double by the end of the century.”

– Yes, I know that this is me being picky. But I’m a statistician: being picky is part of my job! You wouldn’t want it any other way.

Data challenges with the Local News Initiative mapping project

Posted on May 14, 2024 9:40 AM by Andrew

You might have heard about the Northwestern University journalism school’s Local News Initiative, a “team of experts in digital innovation, audience understanding and business strategy” with the goal to “reinvent the relationship between news organizations and audiences to elevate enterprises that empower citizens.” Its webpage shows a team of approximately 40 professors, students, and other affiliates, and lists funding from several foundations and individual contributors.

But historian Alice Dreger reports some problems with their data. From 14 Jan 2024:

Can we talk about the Local News Initiative’s widely-cited yet deeply mysterious map of the American local news landscape? Run out of Medill, Northwestern University’s journalism school, the map is generated from a “proprietary database” we’re not allowed to see. . . . Without being able to see the raw data, it’s impossible to know what the map really shows . . .

Let’s start with an easy example from the Greater Lansing area, Michigan’s capital region, where I [Dreger] live. Since 2002, Rina Risper has been running The New Citizens Press as a newspaper dedicated primarily to our area’s local Black community. The organization’s Facebook page describes it as “a multicultural newspaper based in Lansing, MI.” Yet when you click on the Medill map for our county (Ingham), the map shows zero “ethnic outlets” of news. Either New Citizens Press is miscategorized or this 22-year-old newspaper is completely missing.

The stated methodology of the mapping project hints at why New Citizens Press is apparently missing. The project team uses particular journalism associations like INN and LION Publishers, industry lists, and surveys to create the map. If an organization hasn’t made its way into those base sets, it appears it doesn’t qualify as a local news organization on this map, no matter how good a job it’s doing, no matter for how long. . . .

On 6 May 2024, Dreger updated with some good news:

Well, the Medill team apparently changed course since then, because now we can see who they are counting in each county.

And lo and behold, while four months ago the map claimed there were six local news outlets in Ingham, now it says there are seven, and it counts The New Citizens Press among them.

But wait. Remarkably, the Ingham map does not count East Lansing Info (ELi), the news organization I helped found ten years ago (and retired from last fall), even though ELi is a member of both LION and INN, two major operations from which the map claims to draw data.

Oy. Not a good sign. A cursory search indicates other members of INN also appear to be missing, including AfroLA and Baltimore Witness.

She puts out this call to local news organizations:

Do me [Dreger] a favor. Poke around your own area of the map, see what you find, and let me know. Go to this page, then scroll down just below the map and choose your state. Then click on the county you’re interested in. (The map can be a little buggy; try hovering all around the county to get it to engage, and reload the page if necessary.)

And some people responded:

Penda Howell, CEO/Publisher of NJUrbanNews.com, looked at the map for Essex County, New Jersey, and sent us a note letting us know that his outlet is listed as digital but not categorized as serving the Black community. That’s strange, since the mission statement of the organization makes plain their core service: “We are dedicated to covering New Jersey’s vibrant African American community through informative stories and thorough coverage. Founded in 2018, The motto of NJ Urban News is ‘A Voice for the Voiceless.’ We cover stories often overlooked by mainstream media impacting the 1.3 million people that makeup New Jersey’s Black community.”

Sean Nestor checked the map for Lucas County, Ohio, where he lives and discovered that the Toledo Journal is counted twice, once as “Toledo Journal” and once as “The Toledo Journal.” In one case it’s counted as an ethnic outlet. In the other case, it isn’t. Writes Sean, “The Toledo Journal is one of two Black-owned print publications; the other, The Sojourner’s Truth, is not listed, nor is La Prensa, a local Latino newspaper. The list also omits the Mirror, a print paper covering the Toledo suburb of Maumee as well as the Sylvania Advantage, which covers another suburb (Sylvania).”

Ken Martin of the Austin Bulldog examined the data for his county, Texas’s Travis County. “It lists 5 digital outlets but states only two are nonprofits when in fact 4 are nonprofits: The Austin Bulldog, Austin Vida, Austin Monitor, and Texas Tribune.”

And Tony Schinella, Senior Field Editor for Patch Media in New Hampshire and Rhode Island, sent in a doozy of a message to Medill’s team in response to having had a look at the data. . . . “I was really surprised by the exclusion of New Hampshire’s very active Patch.com sites from the study’s Local News Landscape map. As the founding editor of the Concord NH Patch site, the flagship for the state, in May 2011, it was dismaying to see my 12-plus years of hard work delivering news to the city ignored.” . . .

Dreger continued on 9 May 2024:

The map has also long been missing the Wayne & Garfield County Insider, a 30-year-old print weekly newspaper covering those two counties in southern Utah. The Medill map doesn’t list the Insider at all. In fact, it counts Wayne County as having no local news, meaning it’s one of those counties counted as a total “news desert” on the map. . . .

There’s just no excuse for this omission. I say that because Erica Walz, who runs the Insider, has told the Medill team over and over again about this mistake.
“I have contacted the Medill people FOUR times about the fact that we exist,” Walz wrote to Local News Blues, “and that they have repeatedly missed/neglected our existence in their study.”

To no avail.

“Medill has not recognized our newspaper, which is a much more significant presence [than Corner Post] providing community news in our region. I had specifically asked them to recognize The Insider, and received a reply from their investigators, Penelope Abernathy and Zach Metzger, that they would do so, but they haven’t managed it.”

From the webpage:

Penelope Muse Abernathy, a visiting professor at Northwestern University, is a former senior executive at The New York Times and Wall Street Journal, who specializes in researching local journalism. She is the author of two books and five reports, including The State of Local News (Northwestern University: 2022).

Metzger is responsible for managing, updating, and analyzing the different databases of news organizations maintained by the Medill Local News Initiative. He is also currently pursuing his PhD at the University of North Carolina, Chapel Hill.

Dreger followed up:

In a preliminary response to our (many) questions, Zach Metzer, the Director of Medill’s State of Local News Project, explained at least why Patch and Chalkbeat are missing.

“We have not previously included franchised groups like Patch and Chalkbeat in our datasets; for this year’s report, we will be considering ways to incorporate these organizations into our research.”

Yowza. So the map that’s been claiming to tell us where local news is missing hasn’t been tracking Patch or Chalkbeat. Because “franchise.”

I’d never heard of Patch before so I did a quick google search. One of the links was to New York City Local News, which led with the following stories:

Northern Lights Over NYC? Here’s What To Know About Strong Solar Storm

NYC Rents Break April Records In Latest Bad News For Tenants: Study

NYC Restaurants: Free Chicken(s)! + H Mart Food Hall

15-Year-Old Girl Faces Murder Charge In Queens Teen’s Stabbing: NYPD

Machete-Wielding Man Who Attacked Cops In Times Square Sentenced: DOJ

NYC Restaurants Ordered Closed May 3 – May 10

Man Rapes Woman After Choking Her From Behind On NYC Street: NYPD

Whale Speared By Cruise Ship Bow Sails Into Port: Top 5 NYC Stories

Hochul Signs ‘Sammy’s Law’ For 20 MPH NYC Speed Limits

44-Ft Whale Speared By Cruise Ship Bow Sails Into NYC Port

It doesn’t look like they’re doing any actual reporting here.

I was curious so I went over to the Patch site for Concord, New Hampshire, which seemed to have more active reporting. So I guess it varies from site to side, and I can see how it could be difficult for the Local News Initiative to distinguish between actual local news outlets and astroturf efforts. There’s no clear place to draw the line.

P.S. Here’s the methodology for the Local News Initiative project. They do a lot of work! But, when data are concerned, there’s always more work to be done. It’s not clear how they can best connect to people on the ground with knowledge of local news.

On the plus side, they still seem to be doing better than the Electoral Integrity Project.

Fewer kids in our future: How historical experience has distorted our sense of demographic norms

Posted on May 13, 2024 9:15 AM by Andrew

Here’s an interesting sociological fact. It’s mathematically trivial but is hugely important for how we think about the world.

For a long time, a large proportion of the population have been kids and young families. That’s because, until relatively recently, average lifespans were short: children often did not live to adulthood, and the steady state required a continuing supply of replacement births. Then, in the past century, death rates went down, but people were still in the habit of starting parenthood young and having many kids. There were changes in all sorts of things relating to vital statistics, but the population was still predominantly young. Long-term, though, we should expect children and young families to be a smaller proportion of the population.

The steady-state balance between life expectancy and age distribution is clear enough. For example, if everybody lives to be exactly 80, then in steady state the average age will be 40, only 20% of people will be under the age of 15, etc. This is just an approximation, but it gives the general sense of things.

Nonetheless, given centuries of recent experience, it has just seemed like the natural order of things to have lots of children underfoot, to the extent that when a society is low on babies compared to historical expectations, it feels wrong.

Typically this has been framed in terms of the end of exponential population increase, or as a surfeit of old people relative to the number of working-age people to support them, but a lower proportion of children and young family is part of the story.

To this point, sociologist Philip Cohen recommends a demography article by Ansley Coale from 1964: “How a Population Ages or Grows Younger.” Here’s Cohen:

It makes some key observations that are counterintuitive at first and serve as a great introduction to demographic thinking. Most important (according to him in this 1987 interview) is that reducing mortality in populations with high mortality often leads to a younger population — because more children survive. And then he explains that if we want to survive as a species . . . we are going to have to reduce fertility rates drastically, or else live with very high mortality rates. Fortunately, that’s what we’re doing [reducing fertility rates]. But then he’s also sad that the future will be much older, and less vibrant, than the past.

The point is obvious but often seems to get lost in the details. For example, just today this article appeared in the business section of the New York Times:

China Told Women to Have Babies, but Its Population Shrank Again

Faced with falling births, China’s efforts to stabilize a shrinking population and maintain economic growth are failing. . . .

Chinese women have been shunning marriage and babies at such a rapid pace that China’s population in 2023 shrank for the second straight year, accelerating the government’s sense of crisis over the country’s rapidly aging population and its economic future. . . .

9.02 million babies were born in 2023, down from 9.56 million in 2022 and the seventh year in a row that the number has fallen. Taken together with the number of people who died during the year — 11.1 million — China has more older people than anywhere else in the world . . .

The shrinking and aging population worries Beijing because it is draining China of the working-age people it needs to power the economy. . . .

I like this article a lot more than many articles on the topic, in that it doesn’t just treat the birth rate as some sort of policy knob to be turned; it focuses on women’s choices. I guess that men’s preferences make a difference here too, but I’ll defer to the sociologists and demographers on this one.

One thing that does bother me a bit about this news article is that it’s so focused on a single country. I mean, sure, yeah, I get it: it’s an article about China so they should be talking about China. I just think they should also mention that birth rates are falling around the world. For example, in the above graph it would be good to see the corresponding lines for some other countries.

There’s also this thing where they seem to be attributing the falling birth rates in part to women’s equality and in part to unequal conditions for women. There seems to be some incoherence in the story, in part because they’re explaining a single trend using multiple predictors. Anyway, I’m not trying to slam the news article—I like it!—; I’m just thinking of ways it could be better.

To return to the Cohen’s post . . . He did this cool thing where he downloaded Coale’s article from 2004, entered it into a word processor, and edited it. He created a scholarly remix, and details are at the end of his post. Here’s Cohen:

It’s a good article for teaching, but it was written before he even knew the Baby Boom was ending, and before fertility fell all over the world, and so on. If you don’t know that history it can be confusing to read, and if you do know the history it is still distracting and you want to keep looking things up to see what’s going on today.

So I [Cohen] updated it . . . The trickiest part was the discussion of global growth rates over the last 2000 years. He took the world from 250 million people in year 0 to 3 billion people in 1960. I wanted to go to 8 billion in 2024. . . .

Then the future projections were a little tricky, too. I got it to this:

If, on the other hand, mankind can avoid nuclear war, pandemics, and population decimation because of global climate change, and bring the fruits of modern technology, including prolonged life, to all parts of the world, the human population must become an old one, because only a low birth rate is compatible in the long run with a low death rate, and a low birth rate produces an old population. In fact, if by 2090 the global expectation of life at birth increases to eighty-two – a level achieved by a few dozen countries so far – and the global number of children born per women falls from 2.3 to 1.9, the global population will peak around 10.2 billion. In that scenario, which is the United Nations current projection, the decline in fertility would make the whole world older than the high-income countries today: with 17 per cent under fifteen and 24 per cent over sixty-five (compared with 16 per cent and 19 per cent, respectively, today).

So, yeah, basic stuff. In the short and medium term there’s the possibility of balancing out demographic changes through immigration; in steady state it doesn’t make sense to expect or demand to have a population with a higher proportion of kids and a fewer proportion of oldsters.

Times have changed, and an important step in going forward is to recognize how history has led us to a misleading sense of what should feel normal.

They’re trying to get a hold on the jungle of cluster analysis.

Posted on May 12, 2024 9:29 AM by Andrew

Iven Van Mechelen, Christian Hennig, and Henk Kiers write:

The domain of cluster analysis is a meeting point for a very rich multidisciplinary encounter, with cluster-analytic methods being studied and developed in discrete mathematics, numerical analysis, statistics, data analysis, data science, and computer science (including machine learning, data mining, and knowledge discovery), to name but a few. The other side of the coin, however, is that the domain suffers from a major accessibility problem as well as from the fact that it is rife with division across many pretty isolated islands. As a way out, the present paper offers a thorough and in-depth review of the clustering domain as a whole under the form of an outline map based on an overarching conceptual framework and a common language. With this framework we wish to contribute to structuring the clustering domain, to characterizing methods that have often been developed and studied in quite different contexts, to identifying links between methods, and to introducing a frame of reference for optimally setting up cluster analyses in data-analytic practice.

So, they’re trying to apply a sort of clustering to . . . the field of cluster analysis. I’m not knowledgeable enough about this area to evaluate their effort. I’m sharing it with those of you know might know more.

Also, I’m sympathetic to this work because a few years ago Hennig and I did something similar in our attempt to organize ideas around the concepts of subjectivity and objectivity in statistics.

“Is it really ‘the economy, stupid’?”

Posted on May 11, 2024 9:22 AM by Andrew

Political scientist Brian Schaffner writes:

If you’ve paid any attention at all to the news recently, you have probably seen more than a few stories about how the economy is weighing down President Biden’s reelection hopes. Many of these stories are based on data from polls that ask people about their own economic situations and also ask them what they think about Biden or how they plan to vote in 2024.

Such an analysis may look something like the following graph, which plots responses to a question we asked on the Cooperative Election Survey about how each person’s household income has changed over the past year and their approval rating of President Biden. . . .

But we also see a potential red flag when we look at this data: more Americans reported that their incomes decreased rather than increased over the past year despite the fact that government data indicates that wages and salaries are on the rise. . . .

Republicans were much more likely to report that their incomes declined during the previous year compared to Democrats. 35% of Republicans reported a decline in household income compared to 19% of Democrats.

Schaffner continues:

Is it really true that Republicans are struggling significantly more than Democrats when it comes to their household incomes? Or is this another example of a pattern that survey researchers call “expressive responding” — a phenomenon where individuals strategically provide dishonest answers to survey questions in an attempt to make their party look good or the other party look bad?

A Republican answering our survey might consider saying they are doing worse economically than is actually true as a way of supporting the narrative that the economy is struggling under the Biden presidency. Likewise, Democrats may report that they are doing better economically than they are to undermine that same narrative.

What can the data say? Here’s Schaffner:

It is often hard to detect survey respondents who are engaging in such expressive responding because we don’t actually know when someone is giving dishonest survey responses. (Though see here for work I did with Sam Luks where it was pretty clear). But in the case of income change on the 2022 CES, we actually do have data that allows us to get a good sense of this because it just so happens that 11,000 of our respondents were individuals we had previously interviewed back in 2020. Each time we interview a respondent, we ask them to report what their household income actually is. So for each respondent we know what they said their household income was in 2020 and what they said it was in 2022. Since these questions provide precise response categories and are buried among other demographic questions, it is unlikely that respondents would think to be dishonest when answering them in the same way that they might for the more vague and politically relevant question about income change.

So what do we see when we explore what people reported their income was in 2022 compared to what they reported it was in 2020?

In the 2022 survey, “17% of Americans reported that their income decreased somewhat in the past year while another 10% said that their income decreased a lot. Only 19% of Americans reported an increase in their income.”

But when comparing to the past survey: “only 18% of Americans gave a lower household income in 2022 than they did in 2020, while 35% reported a higher income level. Additionally, the partisan differences on this metric are much smaller — 21% of Republicans reported lower incomes in 2022 compared to 2020 while 18% of Democrats reported the same.”

Shaffner continues, “these two questions are not completely comparable since the self-reported change question asks about the past year but this second approach compares incomes reported over a two-year period. Nevertheless, it is doubtful that too many people will have suffered a decrease in income from 2021 to 2022 if their income had increased from 2020 to 2022.”

After comparing responses of Democrats and Republicans, Shaffner states:

The unfortunate effect of this pattern is to exacerbate the kind of relationship we saw in the first graph, making it look like income change is having a major effect on Biden’s approval rating when in fact it is just as likely that how somebody feels about Biden is affecting how they answer the question about income change. I [Shaffner] can show this most clearly by recreating the first plot, but this time using the measure of how each respondent’s self-reported income actually changed between 2020 and 2022 rather than how they said it had changed during the previous year. Using this approach, it turns out that the relationship between income change and Biden approval almost entirely disappears.

He concludes:

This is not to say that the economic picture is completely irrelevant to Biden’s relatively low approval rating or his reelection chances next year. It is reasonable to suspect that some swing voters are being persuaded by high inflation. But what this analysis shows is that simply asking people how inflation is affecting them and then comparing that to how they might vote in 2024 is not a good way to establish an accurate picture of that relationship. Ultimately, we cannot always take at face value what people tell us in polls to support a narrative that is circulating in the news.

Good point. I have nothing to add.

P.S. Schaffner and his colleagues also have a post on the claim that liberals are happier than conservatives, or the other way around. Regarding such statistics, let me point you to this post from a few years ago with further context here. It’s not just survey respondents who say things that aren’t true but fit their political ideologies.

What happens when you’ve had deferential media coverage and then, all of a sudden, you’re treated as a news item rather than as a figure of admiration?

Posted on May 10, 2024 9:56 AM by Andrew

It happens to sports stars (for example, Ted Williams).

It happens to scientists (followup here, and see also my post deploring the “scientist as hero” narrative). Researchers get used to uncritical media coverage and then, all of a sudden, someone questions their assumptions or their methods, or someone finds out their numbers don’t add up, they’re on Retraction Watch, and all of a sudden they’re screaming about terrorism or Stasi or whatever. From the outside, such behavior is annoying, but at some level I get it: it can be disorienting to move from fawning NPR interviews and book blurbs to tough questions from skeptics.

And, as Dahlia Lithwick reports, it happens to federal judges too:

If the high court is not in fact behaving in a fashion that makes its decisions respected, the real question is: Why we are all zealously watching and reporting on its decisions as though they are immutable legal truths? . . . if the Supreme Court is no longer functioning as a real “court,” why are we mostly still treating its output as if it were simply the “law?” . . .

I wonder whether all the fury being directed by some of the Justices at journalists right now comes from the kid glove treatment they not only came to enjoy, but to expect? In covering the court as though the Justices were uninteresting and untouchable, did we reinforce a norm in which they believe now that any scrutiny is an attack?

This is not new: Supreme Court judges were controversial in the 1930s when they struck down New Deal laws, and they were controversial in the 1950s when they ruled school segregation to be unconstitutional. That said, I do get the impression that most of the national press took the court’s side in both these controversies, supporting the court’s conservative decisions in the 1930s and its liberal decisions in the 1950s, perhaps both times out of a sense that the court is a fragile branch of government that is necessary to the country but at the same time inherently weaker than the legislative and executive branches.

Anyway, the part of Lithwick’s article that really resonates with me is the idea that these judges (or, as they are ridiculously called, “Justices”) are used to nonstop deferential news coverage, to the extent that they flip out when they are treated with skepticism, not just regarding the reasoning of individual decisions, but their whole institutional role is being questioned. Just as in the replication crisis, we’re not just calling out individual researchers such as the disgraced primatologist, the ESP dude, the beauty-and-sex-ratio guy, the Why We Sleep guy, the nudgelords, etc etc etc; we’re also calling out the science and media establishment for establishing the conditions whereby they could run wild and not be questioned.

Again, not completely new: influential political figures called into question the legitimacy of the court after its anti-New-Deal rulings, its anti-segregation rulings, and its ruling on abortion, but maybe there’s been some cultural shift where the court is seen more generally as a partisan institution, more than before. Those earlier controversial rulings were presented as bad decisions by out-of-touch or out-of-control judges but without such a sense of the court as being a direct player in two-party politics. I guess that last bit connects to current debates in science, for example when the journal Nature endorsed Joe Biden for president.

Who understands alignment anyway

Posted on May 9, 2024 1:32 PM by Jessica Hullman

This is Jessica. I don’t usually post multiple times a week, but turns out I have more to say on the topic of machine learning and human problems.

“Alignment” is the term used in the AI and ML communities to refer to the goal of aligning machine learning models with human values and preferences, so as to avoid risks ranging from the mundane to the catastrophic. It’s the topic of papers, workshops, talks, funding calls, etc.

There has been criticism of the nebulousness of what alignment is supposed to actually represent. Some of the critique of the ML conception of alignment comes from HCI research, the very interdisciplinary field that studies how people interact with technology and how to design human-computer interfaces. This pushback predates the “alignment” buzzword actually. I remember watching many in the HCI community bristle when in 2016 Michael Jordan wrote a blog post calling for the creation of a new “human-centric engineering discipline.” Seeing human-centered concerns get called out as a new frontier in AI/ML circles was enough to motivate some of the better resourced HCI researchers to create centers on Human-Centered AI or install themselves as team leaders in big tech companies, ensuring they wouldn’t be overlooked. Others have worked to make AI-related applications a bigger part of HCI research. Many are left to stew about wheels being reinvented, trying to be patient and issuing the occasional plea for everyone to recognize the overlapping goals.

My take is that HCI can help quite a bit with alignment, but that what it can offer is not what much of the ML research community wants or perceives themselves to need. It’s kind of like what a consulting statistician can offer to a data analysis versus what they are perceived to offer by those that recruit them. The real value of adding the statistician is often their role in helping you rethink your objective from the ground up. It’s not necessarily that they’re going to give you exactly the best tools to address some narrower problem you’ve convinced yourself needs to be solved. E.g., you’re convinced that if you just find the right causal inference technique you can confound a big messy dataset you’ve amassed and learn exactly how to improve some outcome X, but the pesky statistician comes in and spoils it by telling you, “No, if X is ultimately the goal, you’re going to need a different data collection procedure altogether.”

In the case of aligning ML, there are certainly human-oriented questions that arise within the current paradigm for aligning models. For example, questions of eliciting specific information from humans become important for deploying generative models. Reinforcement learning from human feedback (RLHF) is a standard method for fine-tuning a large pretrained model like GPT-4, where some group of annotators is recruited, often with no special experience required, and asked to select their preferred model output in a series of forced choice tasks, usually given some loosely defined criteria like “most helpful” or “least harmful”. Behavioral models for aggregating preferences across people like Bradley-Terry-Luce are used to learn a utility function. Human-oriented concerns include how to design the forced choice task and interface, how much information can reasonably be obtained from a single person, and how to crowdsource this efficiently. Beyond the common need to collect human annotations, other examples where human concerns arise in the current ML paradigm include questions like how to represent fairness ideals or how to evaluate post-hoc explanation techniques.

Could an HCI researcher be helpful for these questions? Sure, though I suspect that the most relevant work for some of these elicitation problems is likely to be found elsewhere, like psychophysics or decision science. Could the ML researcher figure this kind of stuff out without the HCI researcher? Probably. In many cases it may be more efficient for them to do it themselves, since HCI is a very large and interdisciplinary field. So I’m not surprised that ML researchers are often doing these things themselves, nor do I blame them.

On the other hand, I think the HCI pushback to ML alignment is valid when you consider the broader goal of creating predictive models that are well-aligned with human goals and values. If there’s a secret sauce that your average HCI researcher can bring, it’s the mindset of user-centered design, which makes serious attempts to understand the needs of the people being designed for. HCI research also demonstrates what it looks like to hold the conviction that human values are not monolithic, contributing knowledge on a variety of methods to try to get at what different groups want from technology. When taken to heart, I expect this kind of perspective suggests rethinking pretty much everything about human-facing ML models from the ground up.

Unfortunately though, I don’t really see much incentive for the average ML researcher interested in alignment to invest in the HCI way of doing things. All interdisciplinary collaborations tend to be hard, and this one seems likely to be particularly slow and messy. Meanwhile AI/ML research is moving at a faster pace than ever.

I also tend to believe that when someone is peering into a field, and believes that they can bring in some new perspective or methods that will be transformative, there’s an onus on that person to invest enough in understanding the field they hope to change to be able to demonstrate the value they want to bring. You can’t really expect people to listen if you haven’t taken the time to understand their concerns well enough to show them that you really could provide concrete suggestions. If the HCI researcher wants alignment to be done better or differently, maybe it’s time they temporally reinvent themselves as an ML researcher. Figuring out how to publish HCI-oriented papers at ML venues may not be easy, but it’s a step toward real impact.

I don’t mean this last part to sound dismissive, or like I’m trying to defend ML alignment. I think it’s just how things work. I’ve had multiple times in my career where I’ve looked at some other field and thought, I bet I could improve that. It’s how I’m feeling right now, actually. And every time I’ve been in this position, it seems clear to me that the only way to have that impact is to invest enough time in the new field to internalize how they think about it.

This is a very disturbing map.

Posted on May 9, 2024 9:25 AM by Andrew

From xkcd:

Kinda related to this and this.

I just got a strange phone call from two people who claimed to be writing a news story. They were asking me very vague questions and I think it was some sort of scam. I guess this sort of thing is why nobody answers the phone anymore.

Posted on May 8, 2024 5:57 PM by Andrew

Just got a weird phone call from two people who claimed to be from the New York Times, asking me very vague questions. They said someone recommended my name to them, but they wouldn’t say who recommended them. They asked me what my research was about, and I said I do a lot of research so it would help if they could tell me what their story was about. They said they couldn’t tell me. I recommended they go to my webpage and look at my published and unpublished papers and then if they had any questions they could ask me something specific. Or they could send me an email. It was a very weird conversation. At one point they asked me what I thought about political changes in recent decades (I can’t remember the exact question) and what did I think about the upcoming election. It still wasn’t clear what they were asking me. I concluded the conversation by saying that if they had questions they should email me and identify themselves in the email.

It was all super weird. My guess is that they are not from the Times or from any news organization at all! Really creepy, actually. Maybe it was some sort of scam—they keep me on the line long enough and then they switch to some investment pitch? Or maybe they were some far-left or far-right political group trying to get me to say something that they could twist into an endorsement of their position? I have no idea. It seemed so pointless to me, but at no point was it funny enough to be explainable as a prank call.

I’d rather not think about this any further, but I thought maybe I should post it here just in case these people, whoever they are, try to misrepresent me or others who they were calling in this way. Maybe they’re calling all the Columbia professors? I have no idea.

This kind of thing really annoys me in that they are abusing the openness that we have as teachers. Our job is a mixture of teaching, research, and service. Helping out journalists, or just about anyone who contacts me, is part of service. Whoever called me just now was at best an incompetent time-waster and at worst a lying manipulator. If there are enough people out there doing this sort of thing, it makes it that much harder for us to do our public service.

In retrospect, my mistake was not to just end the conversation right away after they wouldn’t tell me the purpose of their call. Anyway, I’m posting this here so that if it happens to any of you, you can just hang up and move on with your day. The call was 6 minutes long, of which approximately 5.5 minutes were completely unnecessary.

Applied modelling in drug development? brms!

Posted on May 8, 2024 2:30 PM by Sebastian Weber

Colleagues of mine here at Novartis – Björn Holzhauer, Lukas Widmer & Andrew Bean – and myself have recently published the new website on “Applied modelling in drug development via brms”. The idea is to present real-life case studies from drug development and demonstrate how these can be solved using the amazing R package brms by Paul Bürkner (short for Bayesian regression models using Stan). brms lets you specify your model using the R model formula syntax you may be familiar with so that you do not need to write your own Stan code – while offering a surprising number of features in a super flexible manner. Each case study showcases a key modelling technique or a specific brms feature.

Every case study starts with a problem statement from drug development and then shows how to solve it in a stepwise manner. Most case studies are self-contained and can be run directly with the code provided on the website. The code to generate the website is open-source and hosted on GitHub.

The material has been created for annual workshops organised at Novartis over the last three years with Paul as a guest instructor. Each case study is contributed by a Novartis colleague, making the website almost like an edited online book full of problems in biostatistics. The material has become a valuable resource to jump-start my work at hand as quite often I reuse some code from the website (or use the website as a reminder for how to do things). I’d hope that others find the material just as useful, and I would be curious to hear what your thoughts are.

Finally, colleagues of mine – David Ohlssen, Björn Holzhauer & Andrew Bean – will teach a half-day course based on the website material at the JSM conference in Portland on Monday, August 5th. In case you are interested, you can sign-up for the course during conference registration.

Studying causal inference in the presence of feedback:

Posted on May 8, 2024 9:15 AM by Andrew

Kuang Xu writes:

I’m a professor at the business school at Stanford working on operations research and statistics. Recently, I shared one of our new preprints with a friend who pointed out some of your blog posts that seem to be talking about some related phenomenon. In particular, our paper studies how, by using adaptive control, the states of a processing system are effected in such a way that congestion no longer “correlates” with the underlying slowdown of services.

You mentioned in the blog where you wonder if there’s some formal treatment of this phenomenon where control removes correlation in a system, and I thought you might find this to be interesting, possibly a formal example of the effect you were thinking about.

We’ve been wondering if there are other similar, concrete examples in the policy realm that resemble this.

My reply: I’m not sure. On one hand, the difficulty of causal inference with observational data is well known—it’s a strong theme of all presentations of causal inference—but it seems that most of the concerns come with selection rather than feedback.

Xu responds:

We tried to explore this connection to a small degree in the lit review – there’s some similarity to how people use inverse [estimated] probability weighting to debias selection, but these are generally one-time interventions so not so much of a feedback loop. Like you wrote in that blog post, something like monetary policy is more like a feedback loop, but it’s hard to isolate such effects in these complex systems.

As I wrote in my earlier post on the topic, I’m pretty sure there was tons of work back in the 1940s-1960s in this area of feedback in control systems. I can just picture a bunch of guys in crewcuts wearing short-sleeved button-up shirts with pocket protectors working on these problems. For some reason, though, I haven’t hear much about any of this nowadays within statistics or econometrics. Seems like there’s room for some unification, along with some communication so that the rest of us can make use of whatever has been doing in this area already.

Papers on human decision-making under uncertainty in ML venues! We have advice.

Posted on May 7, 2024 1:10 PM by Jessica Hullman

This is Jessica. Human subjects experiments are starting to appear in machine learning venues. For example, ICML, one of the big ML conferences, accepted a few papers on quantifying prediction uncertainty with user studies.

Overall, I’m glad to see this. Theoretically rigorous uncertainty approaches that provide calibration or coverage guarantees may not be worth the effort (e.g., in terms of the model retraining or holding out of additional data that they require) if providing that information doesn’t impact human decisions, or leads to under- or overreliance. We know from decades of research that people can seem to resist uncertainty information or use it inappropriately.

But it’s also important to acknowledge that human decision-making under uncertainty can be tricky to study. Details matter. In my own experimental work I’ve had multiple moments where I realized something new about how people were making decisions that I hadn’t necessarily considered a few studies earlier. So I have some advice on this topic.

First, some points specific to studying uncertainty quantification:

Make sure that the question you are asking about the value of uncertainty is truly about human behavior, and therefore requires an experiment. Consider that many questions about the potential value added by providing uncertainty information for a decision task can be worked out without running an experiment.
Related to #1, be clear on what it would mean for a decision-maker to use the uncertainty information in the ideal case. If you’re not sure, figure that out before you run the experiment so you know what optimal use looks like.
Be careful that how you describe things to people accurately reflects what the uncertainty means. For example, don’t make false promises about uncertainty for individual instances, such as by implying that we know that there is a 95% chance the true value is in this specific interval. It might seem like a minor detail, but if your participants have unrealistic expectations about what they are seeing, it’s hard to interpret your results. You also risk misleading readers of your paper who don’t catch the distinction.
Don’t assume the intended calibration of the uncertainty method applies to the instances that you actually present to people in your experiment. For example, just because you calibrated your method to have 95% coverage on average doesn’t mean that this must be what is achieved on the instances you actually present people with. Acknowledge the potential difference in analyzing the results. Same goes for accuracy and other aggregate performance measures you might estimate on held-out data. Unless it’s part of your research question, try to avoid telling people one thing but show them something pretty different.
Think carefully about how you elicit probabilistic information from people like their confidence in a judgment. Don’t assume they will be able to tell you exactly what they believe because you asked them, even if lots of prior studies have made the assumption.
Keep in mind that there’s something ironic about papers about the importance of uncertainty quantification that ultimately present all of their results in a dichotomous, significant vs not-significant style.

Regarding that last point, I could see some people saying it’s too picky, and arguing that there is a long history of using NHST to analyze experiment results. But I think it’s fair to ask whether one wants to be part of the problem or the solution. If you care about good uncertainty expression in the artificial world of your experiment, shouldn’t you also care about it when you present the results of your research?

The above list was specific to experiments on expressions of uncertainty, but I also have some more general advice:

Related to #6 on not falling back on dichotomous reasoning, be careful with how you decide sample size. “Large sample size” is relative. If you are testing 10 different conditions, even if you have repeated measures, having what seems like a large number of people total doesn’t necessarily mean you have high power to detect the differences you care about. Fake data simulation will give you a much better sense of how well you’ll be able to distinguish effects you care about than online power calculators.
Attempt to account for the design of the experiment when you model your results, not just the factor(s) you have research questions about. If you’re worried that the model might not fit when you include everything, you can preregister a model selection process rather than a specific specification. Something I’m seeing a fair amount is use of repeated trials designs, but where the modeling doesn’t account for individual-specific effects (e.g., through random intercepts and possibly random slopes) and what trial number they were on. In my experience, these things often have non-trivial effects, such that leaving them out changes the other effects you’re estimating in your model. This paper by Yarkoni has a nice demonstration of how ignoring individual heterogeneity leads to overconfident interpretations.
Pre-registration is a property of a data collection and analysis process. Deviations from part of what was planned are not uncommon, so don’t treat preregistration as a label to that applies on the level of a whole project. You don’t really need the term preregistration in the abstract and the introduction and the conclusion. Just discuss it when you get to methods and if needed to note deviation when you talk about results.

I sent this list to Alex Kale, who added a few more:

Be clear with yourself, your participants, and your reader, about how the provided uncertainty information is relevant to the task.
Use proper scoring rules in situations where you want the task to have a clear objective and performance to have a clear meaning.
Try to account for the role of prior knowledge and risk attitudes. Do not ignore these things, and pretend that results will generalize.
Do not pretend that “high stakes” decision scenarios can be emulated in the typical controlled experiment.
Do not confuse the effectiveness of presentation format with the effectiveness of providing uncertainty information in general. Do not treat poorly designed or implemented uncertainty communication as representative of what people can do with a more careful presentation.
^ especially when the uncertainty provided is non-informative for the task at hand, as in many XAI studies. This means your experiment is ill conceived.
Do not treat “understanding” as something that is easily quantified with measures like simulability alone. Do not collapse the complex cognitive processes that underly understanding to vacuous concepts like “intuition.”
Study how people learn to use uncertainty information, not just one-shot performance.
Study deployments with actual users and do user-centered design or co-design if you want to develop tools that will have much hope of being adopted outside the lab. The road from theory to practice is messy. Embrace the mess!

Finally, a personal pet-peeve: it’s not true that hardly anyone has ever studied the effect of presenting uncertainty in model predictions. Decision-making under uncertainty is a cross cutting topic that has been discussed for many years in a number of fields. Predictive uncertainty is not a new concept.

Some of this appears with further discussion in our papers on why we need decision theory to evaluate human decision-making, how to evaluate reliance on model predictions, how to benchmark the value of information in studies like on information displays, and why evaluating uncertainty expressions is hard.

P.S. I realized that maybe this post sounds sort of complain-y, but that is not the intent. I’m impressed by many of the papers I’m seeing from the ML community that bring in human behavior. If anything, I just want to help others avoid the trap of treating experiments involving human behavior as black boxes for producing aggregate-level statistics about which of multiple treatments is better, because you can miss a lot that way. The more effort I’ve put into to trying to understand what’s going on, through simulations/doing the math and through modeling the experiment data, the more I’ve learned.

“There is a war between the ones who say there is a war, and the ones who say there isn’t.”

Posted on May 7, 2024 9:04 AM by Andrew

I was reminded of the above line (it’s from a Leonard Cohen song) after reading something stupid on the internet regarding some technical issue, and various people were trying to place the dispute in the context of a so-called academic “war.” I won’t get into the details because there are a million such arguments on the internet, and here I want to focus on a general problem in communication that this example illustrates.

I’ve been involved in lots of academic disagreements over the years, and I pretty much never think it’s a good idea to frame any of them as a “war,” even as a joke.

Indeed, I suspect the technical error that got that discussion going arose from a misguided “war” framing.

Here’s the problem. If you’re in a “war,” or you think you’re in a war, then maybe it’s ok to consider some people as your opponents and attribute to them positions that they have never taken. It’s “war,” right? The other side has surely done much worse, no? Never back down and all that.

So that’s a good reason to avoid the “war” framework. There are lots of better ways to move forward, especially in science, by focusing on areas of legitimate disagreement and admitting where we’ve messed up.