Find Pages

Posted at Nov 20/2005 03:57 PM by Peter Y:
On the outside, the data collection protocol for this study was relatively straightforward to orchestrate. Random samples of each of the six Harry Potter books and the Stephen King book were taken to create the data for study. A random number generator (EXCEL) was used to select 10 pages from each book. A selection of words from each page was then taken for analysis. Words in novels, however, do not exist independently of one another but, instead, are ordered into sentences. In light of this, it was deemed appropriate to select complete sentences for analysis. Therefore, the selection of words on each randomly chosen page began with the first word of the first complete sentence at the top of the page. A target number of 120 words was used for each selection, or about 1/3 of a page. This number was chosen because it is large enough to include a wide range of words without being unwieldy. A smaller sample could more easily be skewed by being all dialogue or some other weighted style of writing. A larger number of words makes that extreme result less likely.

The sample process stopped at the end of the sentence that came closest to 120 words. Thus, the number of words sampled on each page varies between about 110 and 130 words.

This was all relatively straightforward and easy to set up. More challenging was the actual categorization of each datum, or word. Every word was assigned one of nine categories: noun, verb, adjective, adverb, pronoun, preposition, conjunction, article, or interjection. The nine categories were designed to cover any potential word in the study. The problem, of course, is that the definitions of these nine parts of speech are by no means hard and fast, and many—if not most—words can fall into more than one category depending on their usage. A word like “there”, for instance, can at times be used as a noun, an adverb, an adjective, or a pronoun, depending on its context. Thus the biggest challenge in the design of the study is to maintain a level of consistency that allows the data analysis to be meaningful.

Therefore the nine categories will be briefly laid out below, along with some of the ground rules used to govern word placement:

Noun: Nouns are persons, places, objects, and ideas. This category is usually straightforward. For the sake of simplicity, full names and titled names were considered one single noun (i.e. John Smith or Professor Zheng is considered one noun).

Verb: Verbs are actions. They denote some level of activity taken by a subject. In this study compound verbs (have slept, would be going, etc.) were considered separately, with each word counting as a different verb. Also, gerunds (verbs being used as nouns) and verbs being used as adjectives were considered verbs for the purpose of the study.

Adjective: Adjectives are modifiers. They describe nouns and pronouns. The main source of confusion with this category is that possessive pronouns are considered as adjectives (i.e. “his” car, “our” home—both are adjectives).

Adverbs: Adverbs are modifiers of verbs, adjectives, and other adverbs. According to Mr. King, they are one of the most telling parts of speech in a writer’s arsenal.

Pronoun: Pronouns take the place of proper nouns, replacing them with placeholder words like “me”, “she” or “they”. As noted before, possessive pronouns are considered adjectives in this study.

Preposition: Prepositions describe the relation of objects to one another. “Of”, “above”, “from”, “with”, “toward”, etc.

Conjunction: Conjunctions connect different parts of a sentence with words like “and”, “but”, and “because”.

Articles: Articles are “a”, “an”, and “the”.

Interjections: Interjections are words that are expressed without real meaning—usually in dialogue. Whenever someone says “Wow” or “Ouch!” those are interjections.

In deciding how to categorize words the Merriam-Webster English-language dictionary was used to define each word’s usage and part of speech, and every attempt at consistency was made.

