Mixed-up Mr. Darcy

Plenty of classic works of literature are available online in the public domain. These out of copyright books provide rich data mining opportunities for analyst geeks like me.

This morning, I downloaded a copy of Pride and Prejudice, written by Jane Austin back in 1813.

A little pre-processing work was needed on the raw file to remove chapter headings, remove superfluous carriage-return/line-feeds and normalise punctuation (removal of double spaces and dashes, dealing with straight quotes and curly quotes). Anybody who has had to process non-structured data knows that these taxes have to be paid. (A pet peave of mine is having to deal with data that is supposedly "well formed" and structured, but in the end turns out to be "mostly well formed". I've lost count of the number of times I've been handed a CSV file, or TXT file to import into SQL only to find that quotes were not used on the text fields, so everything appears fine until a record comes along with a natural comma in the string of a text field. This, of course, is treated as a seperator, offsetting everything downstream of there ... grrr ... or the same with numeric fields with thousands seperators ... or rows that were too long for the export program that just generated incomplete rows with ragged CRLF at the end. Over the years I've had to write dozens of macros and programs to clean-up dirty data just to pre-process it ready for consumption and import. There's no such thing as "partially structured" data; it's either clean and well-formed, or it's not!)

Simple Word Frequency

Below is a table of the most popular words in the Pride and Prejudice along with their frequency. Pretty standard stuff.

There are 6,763 distinct words used in the book. The longest (non-hyphenated) words are seventeen letters in length, and these are: disinterestedness, communicativeness and misrepresentation.

Word couplets

Rather than simply just counting the words, or frequency of letter used, I thought it would be fun to analyse the words in context to their position next to others. I parsed through the text and counted the number of distinct words. Then, for each of these distinct words, I went on to determine all the different words that succeeded the first word, and counted them to work out a frequency of these couplets. For instance, the word after is followed by the word breakfast six times in the book, and the word dinner four times.

"The" Long Tail

As the table above shows, the word the is the modal word in the book, occuring 4,312 times. The most most frequent word succeeding is whole with a count of 71, closely followed by same and room with a count of 69 (Meaning the snippet "the whole" occurs 71 times in the novel, and the phrase "the room" occurs 69 times).

The table below, on the left, shows the top words that follow the word the in the book, and the chart on the right shows the distribution of all words that the precedes.


There are total of 1,278 different words that follow the, and the distribution follows a classic long tail curve, with 706 of these words only occuring once (approx 60%).

Most common pairings

Before running the query, I suspected that the most common pairings of words would be "Mr. Darcy" since these two words would always occur together. However, whilst this couplet does appear very high on the list, the construct "of the" gets the award, occuring 461 times cf. "Mr. Darcy" which occurs 242 times. Other very common phrases are "to be" (441 times), "in the" (383 times), "of her" (260 times), "to the" (250 times) and "it was" (248 times).

How do you start a sentence?

The top words after a period are:

How do you end a sentence?

There are 5,045 periods in Pride and Prejudice. 9,123 commas, 499 exclamation points!, 462 question marks (really?), 1,538 semi-colons; yes it's true, and the number of colons: 133

So just how does one use a semi-colon? By far the most common word following a semi-colon in Jane Austin's work is the word and with 692 occurences, followed by but with 311 and for with 68.

Mixed up Mr. Darcy

Because we have the probabilty distribution of words that follow each other, using some random numbers we can create our own Jane Austinesque gobbledygook work of literature. Rather then use an infinite number of monkeys on typewriters, we can select words based on the distribution patterns of these words in original text.

First I select, at random, one of the words that I know Jane Austin used to start a sentence. Then, using a weighted probability, I select, randomly, one of words I know she followed that word with (the more popular words having a higher probability of being selected), and so on ... building sentences like playing dominoes. [GEEK NOTE] In this exercise, I also tokenised all the punctuation marks to make them appear as words so commas, periods and semi-colons also appear based on probability of the proceding word. [/GEEK NOTE]

Some sample output:

He could be a man who was no more than I had given me. I will not be very little of her sister, she, and in the whole of it is a little of her with him to be in love; it, that she could do not be in her. But not to see how could have been a few moments, and I do the room at last.

But the same, she was now and I have heard of his sister; but she had received from her sister, said Mrs. Bennet was very good humour, that he was so much. I should have been more than she is not to him to have you will, and I am afraid you are you. He might be sure, she was to be so far from Mr. Darcy. She was very little, as she was soon as to the carriage, I should be so many of the same, as he had any young man; for it to be the whole, said: I am sure of the whole family.

She was a most of the room. His own. I have a great deal of my own family. Elizabeth, to see him to the very much. And, for her. I was to him, and, she felt for she had been so, and, and her mother, and, I had been in her to be in my dear Mr. Bingley, by his character was to be in the same time, as much.

You are very well. I should not have been in a few weeks. It is very well, and the other, that she would have the next to have no more, she could be in a man, she had given him, though this is not be so far from what I am not to be the subject, in her, which she felt it is, she was a few minutes, I do, and of his character.

It's paradoxically both easy to read, and hard to read, at the same time! Your brain seems to follow along the words, knowing that they make grammatical sense, and are logically next to each other. The sentences start to make sense as you parse them, but then somehow it gets derailed when they turn bogus. Every now and then, quite a few words are strung together that make sense and you can't help smiling as you read.

Romance not your thing?

I realise that not everyone appreciates romance novels, so I repeated the experiment with another classic The War of the Worlds, written by H.G.Wells in 1898.

As you can imagine, the writing styles are very different, as is the distribution of words and sentence structure.

Repeating the same exercise, here are a few randomly generated mixed-up Wars of the Worlds sentences.

At once to see, I have fallen, the road, in the night was a man on his own house and I was, in the water, the pit was, I could be the black smoke, and the Martians. The artilleryman began to have been a little boy might be seen the cylinder. It was still, and there, and a hundred and I was a time, the curate had been the first time, and the first time I had been at the common.

I could not only to me that the houses. A moment the house and then I went into the Martian might have been at this way! said. He asked the Martians, I was the first I was a man, and the pit, a little way. A hundred yards of a moment I said.

There was a little crowd of the road. And the first I went on either flexibility or two or more. I came to me, the heat Rays right, but in the Martians. I was the night. Then I have seen on to be killed, the pit, the water, I was, I had already been the common, and then I saw the other.

A man in the Martians were, a little way! said the red weed, the heat Ray, the first, and a time. I could see it was in the Martians had been at first I was the people were watching the red weed, it is the Martian, I had been a couple of the Martians had been at the road, and I saw through the house, I saw one of the road, and in a man.

Gripping stuff!


