I view data analysis as summarization: use the machine to work with large quantities of data that would otherwise be hard to deal with by hand. I am also curious about what would the data suggest, and open to suggestions. Automated model selection can be used to list a few hypotheses that stick out of the crowd: I was not using model selection to select anything, but merely to be able to quantify how much a hypothesis sticks out from the morass of the null.
The response from several social scientists has been rather unappreciative along the following lines: "Where is your hypothesis? What you're doing isn't science! You're doing DATA MINING
Of course, I'm doing data mining, and I'm proud of it. Because data mining is summarization, it's journalism, it's surveying, it's mapping. That where one gets ideas from and impressions. Of course what data mining found isn't "true". The models underlying data mining are most definitely not "true". But a mean is informative even if the distribution isn't symmetric.
The "scientific" approach corresponds to picking The One and Only Holy Hypothesis. Then you collect the data. Then you fit the model and verify whether it works or not. Then you write a paper. The good thing about the "scientific" approach is that you don't have to think, and that you need very little common sense. But real science is curiosity and pursuit of improved understanding of the world, not mindlessly following algorithms that can be taught even to imbeciles.
Let me analyze where the problem lies. There is data D. And there are multiple models M. In confirmatory data analysis (CDA) high prior probabilities are assigned to a single model and its negative (null): so it is very easy to establish which of the two is better. In exploratory data analysis (EDA) and data mining the prior over models is relatively flat. Yes, there are models underlying EDA too: if you rotate your scatter plot in three dimensions to get a good view of the phenomenon, your parameters are the rotations and you're doing kernel density estimation with your eyes. When you see a fit, you stop and save the snapshot. The problem is that no model in particular sticks out, so it's hard to establish the best one. Yes, it's hard to establish what "truth" is. "Truth" is the domain of religion. "Model", "data" and "evidence" are the domain of science.
Many of the hypothesis generated by people from theory might be understood as deserving higher prior probability: after all they are based on experience. In turn, a flat prior includes many models that are unlikely. For that matter, one should use a bit of common sense interpreting EDA results: because the prior was flat, if something looks fishy, subtract a little bit from it and study it in more detail. On the other hand, if you don't see something you think you should, add a little and study it in more detail. A CDA that tells you everything you've already known doesn't deserve a paper. But it's better to just eyeball the results with an implicit prior in your mind than to try to cook up a complex prior that will do the same. But once you've found a surprise, throw all the CDA you've got at it.

Aleks,
1. You write, "I view data analysis as summarization: use the machine to work with large quantities of data that would otherwise be hard to deal with by hand." I think this is too restrictive a definition, since traditionally data analysis actually has been done by hand. Even now, consider analyses of small datasets such as the bioassay data in chapter 3, and the eight schools in chapter 5.
2. Regarding implicit models in EDA, see my two papers on the topic here and here.
3. I do agree with you, though, that EDA doesn't get enough respect, at least among Bayesians. I've felt this way since I went to the Bayesian conference in 1991 and saw that: (a) almost nobody was checking their models, and, even more distressingly, (b) when I asked people about this, many felt that model checking had no place in Bayesian statistics. They seemed to have a belief that their prior distribution should already include all possible models they might consider. My frustration from these discussions led to the work underlying the above two linked papers, along with my paper in 1996 with Meng and Stern on Bayesian predictive checking.
Do I remember correctly that these debates go back to 60s, with Tukey(?) defending what was then called "data fishing"?
So, you're asking whether it's time to move on to the next term? :)
I feel similarly. Many top-level cognitive and developmental psychologists look at you weird if you do anything other than a t-test. I think in the case of the social sciences it's fear rather than any more substantive/informed belief about how analysis should be done.
I recommend reading Breiman's paper on the two cultures. At least in the statistics community, there is increasing acceptance of data mining methods. I think the divide may have to do with explanatory versus predictive models. If one is only concerned with cause-effect, then data mining will not give one that. The shortsightedness of only looking for causal models is the problem.
Andrew, I agree about the restrictiveness of "using machines". What I had in mind and what would be a better phrasing of the idea is "by formal or mathematical methods".
Janez, I prefer confronting the criticism arising from a lack of understanding than implicitly acknowledging it by changing the name and merely running away from the problem.
Kaiser, yes, absolutely. In addition to Breiman's paper there is also Friedman, J. H. "Data Mining and Statistics: What's the Connection?"
Kaiser and Aleks,
Breiman's paper is interesting but I think he was unduly influenced by his own experiences. He had success using his methods but incorrectly disparaged the methods of others. See here for further discussion of his point. In statistics, there are a lot of roads to Rome, and it is unfortunately all to natural for people to assume that, just because their method has worked for them, that other methods are somehow inferior.
"I think in the case of the social sciences it's fear rather than any more substantive/informed belief about how analysis should be done."
Absolutely not. Social scientists, and any other scientist for that matter, attempt to falsify, confirm, or do anything else in the positivist tradition. Statisticians tend to be agnostic about the "science" than concerned whether the story told from the data is correct. I agree with Aleks though, data mining can help with further intuition on theories.
"The "scientific" approach corresponds to picking The One and Only Holy Hypothesis. Then you collect the data. Then you fit the model and verify whether it works or not. Then you write a paper. The good thing about the "scientific" approach is that you don't have to think, and that you need very little common sense. But real science is curiosity and pursuit of improved understanding of the world, not mindlessly following algorithms that can be taught even to imbeciles."
Sorry, just wanted to have that repeated again for posterity...
i don't know that i agree that you don't have to think with the scientific approach. because to come up with a good hypothesis, you do have to think about the research area. you have to know the literature, what studies have been done, what are the questions still to be answered, etc. too many times, i have seen researchers not even remotely think about their research, but just collect some poorly constructed dataset (oftentimes small) and run hundreds of regression models all for the purpose of picking a p-value less than 0.05. many times they don't even have the variables necessary for what they want to examine. and oftentimes, the result they focus on does not even make sense. i realize you are advocating nothing resembling this, but you seem to denigrate the scientific process too much.
i do agree that seeing what the data tells you, mining it as it were, gets short shrift. (unfortunately, i don't have a good grasp of how to do it.) it is not something that is taught enough to us.
Jimmy, agreed that there are people doing really bad data mining. But it's not the data mining as a methodology that's bad, it's their research that's bad. Let's not throw out the baby with the bath water. Perhaps, it's worth teaching good EDA rather than dissing it and teaching only CDA. I've seen a lot of crappy CDA too.
Didn't Kepler's laws come from data mining?
hubris-
That would be an example of bad research and I think it's incorrect to view the scientific method as a singular event (e.g., fit a model). It's about the continuity of one hypothesis holding in the face of another test. But if you want to understand things, truly understand, then you want to build ideas that explain multiple events and stand up to new data. Data mining does not do that.
Data mining builds intuition, which is useful. Kepler used it (from Tycho Brahe's data), but ultimately, it was Newton using a more scientific method that derived the model that could explain more than Kepler (why planets move on a plane, I believe).
Aleks,
I think this discussion is to be put in the context of another subject mentioned in this blog: IRBs ( http://www.stat.columbia.edu/~cook/movabletype/archives/2007/08/irb_watch.html )
I imagine that a culture that produces IRBs can only be skewed against datamining. Let us take an example, researcher A does a study on a certain peak found in the results of MALDI-TOF studies (producing very large amount of data for one sample), finds something about a specific peak and publish this stuff. Researcher B thinks that the peak is not really the good one and thinks that you have to look at two peaks to find something about another disease. How does he go about using Researcher A's data? How does he say to his IRB that while the initial study was done for a specific purpose, he could find something else in the data that was not thought in the first place , it would raise all kinds of issues on the IRBs part and I feel that he would be unlikely to, officially, get his hand on the dataset.
Igor.
Igor, a paper shouldn't be published without (anonymized) data accompanying it. A paper using the data without citing the data sources should constitute an act of plagiarism.
Aleks,
You do realize that in effect, data is never published and by extension extremely rarely shared. I would be positively interested in hearing a story where you or a colleague you know has been able to put her/his hands on a large set of data after it was published.
Aside from this argument, I was really pointing out something about the culture of scientists that deal everyday with human subject. The fact that IRB exist is a sign of an inability to consider data-mining as a valid scientific tool.
IRB exist for a very good reason but oftentimes they are good reason for institutions to cover themselves and yield overly conservative, defacto regulations (to the point of ridicule -see IRB pot of Seth-).
Igor.
Igor, you're saying that nobody is doing what I'm suggesting. I agree. That's why it's a rant. But broadening the dissatisfaction with and criticism of the current situation would increase the opportunity for the people who will drive the change.
Aleks,
I hope you are right, but having been in a situation similar to the one I mentioned, I honestly do not see how it could come about. Maybe defining what I would call a framework that I would call "orphan data" would allow some of the situation I mentioned to not occur.
Igor.