Paul Murrell’s new book on data technologies

Posted on May 20, 2008 9:39 AM by Andrew

Paul Murell’s new book begins as follows:

The basic premise of this book is that scientists are required to perform many tasks with data other than statistical analyses. A lot of time and effort is usually invested in getting data ready for analysis: collecting the data, storing the data, transforming and subsetting the data, and transferring the data between different operating systems and applications.

Many scientists acquire data management skills in an ad hoc manner, as problems arise in practice. In most cases, skills are self-taught or passed down, guild-like, from master to apprentice. This book aims to provide a more structured and more complete introduction to the skills required for managing data.

This seems like a great idea, although it makes me think the title should say “data management” rather than “data technologies.” Also, I hope that he clarifies that “data” does not simply mean “raw data.” We often spend a lot of time working with structured data, in which the structuring (multilevel structures, missing data patterns, time series and spatial structure, etc.) is an important part of the data that is often obscured in traditional computer representations of data objects. As we’ve recently discussed, even something as simple as a variable constrained to lie in the range [0,1] is not usually stored as such on the computer.

I like the book a lot and think every statistician should have a copy. I have some comments on Section 2.3.4, on Debugging Code, which I’ll place in a separate blog entry. For now, here are some little comments on various parts of the book:

– The table of contents is followed by a longer Full Contents. Tastes differ, I’m sure, but I never get anything out of this sort of thing. I’d ditch it and make a better index. I also don’t know that the list of figures is particularly helpful.

– For the example on bottom of p.12, top of p.13, I’d switch to a programming example rather than an English sentence example. To teach programming syntax, use programming. Yes, that means you have to pick a language (C, or R, or Fortran, or html, or whatever), but you’re gonna have to do this anyway so you might as well get started.

– On page 29, Paul (implicitly) recommends indenting by 4. In my experience, this gets messy fast. I prefer indenting by 2.

– I completely agree with the “don’t repeat yourself” principle on page 33. I tell people this all the time, and I’m glad to know that this is a recognized principle!

– I like the summary at the end of each chapter.

– When introducing html, xml, etc., I think it would be helpful to have a little bit of history: who invented these, when, and why?

– On page 27, the labels on the axes seem too small to me, also the y-axis should go to zero, the line should to the extreme left and right of the plot (after all, time didn’t start in 1900 or end in 2050), and the y-axis should be in billions, not millions. If it’s your own graph, you should be sure to use good practice in a book like this! I’d also recommending giving a page or so discussing the choices that are made in this sort of graph (see, for example, Appendix B from my book with Jennifer to get a sense of how this can be done).

– I’m not thrilled with the introduction to R around page 238. I think Paul is too close to this material to give a good introduction. (Just for example, page 255 is a big waste of space; if nothing else, just do a 3×3 matrix so as not to take up the whole page!) I don’t really know what can be done with this chapter as a whole, but maybe it would help to have a frank discussion of the strengths of R and also the difficulties involved in using R.

– The picture of Euler on page 267 is horrible! No history of html, no history of R, no history of C, but a picture of Leonhard Euler! Save this for the calculus books, dude! Students have tons of opportunities in college to be told how beautiful Euler’s equation is, blah, blah, blah. Now it’s time for CS and Stat to get some respect.

– If you’re going to talk about digital photography (p.283), you should speak at least briefly about the storage of the photography data themselves!

– p.301: Again, it’s odd that space is devoted to the irrelevant history of an insect survey rather than to the history of software.

– Section 11.7.9 is a bit of a letdown: from major data sources to “a resident of Baltimore, Maryland”? Surely you can come up with a better example than this! Unless you’re trying to make a general point about how people can gather and analyze their own data, but if you want to make this point, I’d recommend doing it explicitly.

In summary

This book is wonderful. I’m so sick of people writing books that have already been written, or done better by others (for example, all the Intro to Bayes books that come across my desk, or, to use an even better example, all the interchangeable Intro to Statistics texts that people keep writing and foisting upon unsuspecting students). Paul Murrell’s book is unique and is much needed. I look forward to seeing the completed version.

5 thoughts on “Paul Murrell’s new book on data technologies”

JD Long on May 20, 2008 10:04 AM at 10:04 am said:

The CC license is one of the best features of the book! I didn't anticipate that I would be able to d/l the entire thing. Thanks for the review as I would not have found this text otherwise. This book seems to fill a much needed niche.
chris paulse on May 20, 2008 10:17 AM at 10:17 am said:

Thanks for pointing this out. A likely useful companion is Managing Gigabytes.
Hadley on May 20, 2008 1:26 PM at 1:26 pm said:

See also Data manipulation with R by Phil Spector.
John S. on May 20, 2008 3:41 PM at 3:41 pm said:

Section 11.7.9 is a bit of a letdown: from major data sources to "a resident of Baltimore, Maryland"? Surely you can come up with a better example than this!

As a resident of Knoxville, Tennessee, I have to disagree. Utility bills are an very interesting data source.
Bob Carpenter on May 20, 2008 3:46 PM at 3:46 pm said:

I always look for something I know in a book like this. For this one, I perused the text section (p. 116 and thereabouts), and it's pretty bad.

What I'd have liked to have seen is a reference to the International Components for Unicode (ICU) package, which is the best cross-platform, cross-language unicode processing tool.

I'd have also liked to have seen a mention of the Unix/cygwin command-line tool od (octal dump), which lets you display the bytes, and looks to be what he was using to generate his text. Maybe he mentions it elsewhere.

And I'd have liked to have seen the fundamental rule of text processing: always explicitly specify the encoding! Violated with the first HTML example on p. 12 (there's an example with encoding info on p. 232, but it's not explained).

Let's look at what's wrong or misleading. The string "just testing" has exactly the same encoding in ASCII, Latin-1 and UTF-8-encoded Unicode. What the author calls "UNICODE" is actually UTF-16-encoded unicode. The byte-order marks are optional for UTF-16.

UTF-8 is typically presented without byte-order marks, as there's no byte order to speak of (for UTF-16, it's which order the pair of bytes occur in, most significant or least significant first). Notepad includes byte-order marks for UTF-8, which is allowed, but not recommended.

There's no fundamental difference between Windows and Unix/Linux/Mac at the character processing level. Characters get processed by applications, not the OS! Statements like this one (p. 116) make no sense: "On Windows, UNICODE text will typically use two bytes per character; on Linux, the number of bytes will vary depending on which characters are stored (if the text is only ASCII it will only take one byte per character)." What the author seems to be confusing is UTF-16 on Windows vs. UTF-8 on Linux.

The author may also be confused by the operating system defaults in some programs. For instance, Java defaults to Windows-1252 in the U.S., whereas on Unix it defaults to Latin1. And the big killer is line breaks, which by default in many programs like text editors, produce different bytes on Windows, the Mac and on Unix/Linux.

What really gums up characters is cut-and-paste from editors into browsers or vice-versa.

Finally, why try to provide reference materials for all these programs and standards? I never understand this tendency in computer science textbooks. They're always partial and always out of date.

PS: Managing Gigabytes is a great book on the details of building static information retrieval systems, but nowadays, I'd recommend using Apache Lucene if you need to build something and reading Manning et al.'s book on IR if you want to understand the state of the art.

Comments are closed.