Sof[t]

Joe Fruehwald writes:

I’m working with linguistic data, specifically binomial hits and misses of a certain variable for certain words (specifically whether or not the “t” sound was pronounced at the end of words like “soft”). Word frequency follows a power law, with most words appearing just once, and with some words being hyperfrequent. I’m not interested in specific word effects, but I am interested in the effect of word frequency.

A logistic model fit is going to be heavily influenced by the effect of the hyperfrequent words which constitute only one type. To control for the item effect, I would fit a multilevel model with a random intercept by word, but like I said, most of the words appear only once.

Is there a principled approach to this problem?

My response: It’s ok to fit a multilevel model even if most groups only have one observation each. You’ll want to throw in some word-level predictors too. Think of the multilevel model not as a substitute for the usual thoughtful steps of statistical modeling but rather as a way to account for unmodeled error at the group level.

4 thoughts on “Sof[t]

  1. The distribution of vocabulary words is quite unusually, it follows a Zipfian distribution, named for Zipfs law which covers rare events. I've read about this for some projects I was working on, but haven't really done anything with it. A useful starting point is the book:

    Baayen, Harald R. (2001) Word Frequency Distributions (Text, Speech and Language Technology, Volume 18). Springer. (ISBN 0792370171).

    Hope this helps.

  2. @Russell: Zipf's law is hardly unusual: it's a special instance of a power law relationship, which occur throughout the natural world. And for anyone still paying attention, Zipfian behaviors occur for plenty of NON-linguistic data sources, and therefore is NOT a good diagnostic of whether a data source is "language". See:

    http://catarina.csee.ogi.edu/plotcaption.jpg

  3. If you're trying to predict whether a certain sound is produced word-finally, won't there be large contextual effects from (a) stress/intonation, and (b) segmental (sound) context of the following word's initial syllable and its stress?

    Multilevel models are THE thing for this kind of data. It's been widely observed in everything from parsing to word sense disambiguation to language modeling for speech recognition that using all the data with word-level effects is a big win for prediction even in the face of a lot of single counts. Typically you see "additive smoothing", which in a two-outcome case is just a MAP estimate with a uniform beta prior greater than 1 (e.g. Beta(2,2) for what's known in the language processing literature as "Laplace smoothing").

    You might also want to build a mixture model of function words vs. content words or be even more fine-grained and use part-of-speech (i.e. noun, verb, article). With enough data, though, especially with rare words, you run into problems with lots of out-of-dictionary categories or instances where an automatic part-of-speech tagger is uncertain.

  4. @Bob Oh yes, there are are a lot of contextual effects. We haven't always controlled for intonational effects in the literature, but the segmental effects have been carefully investigated. I should have been more clear in my original question. Word frequency isn't the only thing to be added to the model. In fact, it's one of the last things.

    As for morphological effects, the facts are very interesting, and I outline some of them here:
    http://www.ling.upenn.edu/~joseff/rstudy/week7.ht

    The relevant bit is:

    In English, word final consonant clusters, especially those with a final /t/ or /d/ are likely to be simplified by deleting the final /t/ or /d/. There are at least 4 grammatical classes which are affected by this simplification:

    -Regular Past Tense: "passed","missed"
    -Irregular Past Tense: "swept","kept"
    -Monomorphemes: "past","mist"
    -not-contraction: "don't", "can't"

    It's been noted for a long time that there is a grammatical effect on -t/d deletion. The order of grammatical classes in order of least to most deletion is Regular

Comments are closed.