HomeBlogAbout UsWorkContentContact Us

Digital Shakespeare (and data cleanliness)

I was cleaning up some files today (post-recovery of a hard drive issue; long story, but thankfully, no data loss), and came upon a little project I undertook a few years ago processing the plays of Shakespeare. This was an extra-corricular project, something I undertook just for entertainment, but it produced some fun graphics that I'd like to share. (And an important lesson about data cleanliness, but more of that later).

The works of Shakespeare have long since passed out of copyright so are now avaialale to download, in machine readable format, from the internet.

The initial aim of the project was to understand which actors appeared together in which scenes. If, for instance, you were putting on a production and needed to reherse and there were calendar constraints, which combinations of characters could you call upon? Or, if there were personality conflicts with your actors, which roles should you cast them in to minimize their interactions?

(The tragedy of) Romeo and Juliet

A quick parse of Romeo and Juliet reveals that there are 24 scenes in the play, and 34 players. These are listed below in the order of their first appearance in the play along with the number of words that each of them speaks.

Player# of spoken words
SAMPSON260
GREGORY149
ABRAHAM24
BENVOLIO1153
TYBALT262
FIRST CITIZEN43
CAPULET2123
LADY CAPULET869
MONTAGUE316
LADY MONTAGUE28
PRINCE585
ROMEO4668
PARIS 539
SERVANT183
NURSE2208
JULIET4263
MERCUTIO2093
FIRST SERVANT 79
SECOND SERVANT96
SECOND CAPULET17
FRIAR LAURENCE2728
PETER245
LADY  CAPULET11
FIRST MUSICIAN62
SECOND MUSICIAN34
MUSICIAN8
THIRD MUSICIAN7
BALTHASAR233
APOTHECARY53
FRIAR JOHN96
PAGE82
FIRST WATCHMAN 135
SECOND WATCHMAN9
THIRD WATCHMAN26

The next stage is to plot the players on a graph. Each circle below represents one of the players. The size of the circle (area wise) represents the number of words spoken. The more words the player speaks, the larger the circle.

Were I going to continue with this project, at this point I would invest some time in finding an more appropriate way of distributing the players over the plane. Simply placing them randomly is far from optimal; some are too close together and best use has not been made of all the white space. Probably some organic algorithm which gives them 'mass' and meshes them together with invisible 'springs', allowing them to migrate to locations of least energy would help spread them apart whilst allowing for the fact that 'larger' (more important) objects should get more space around them to reinforce their significance.

Common Scenes

Next, we connect the players on the graph if they share lines in the same scene. We'll use the thickness of the line to represent the number of scenes in common. In Romeo and Juliet, no pairs of actors share more than six scenes in common.

Three sets of players share six scenes in common. Interestingly, the title chracters are not amongst them!

Juliet and Nurse share six scenes, as do Capulet and Lady Capulet, and finally Romeo and Benvolio.

Moving to five or more common scenes

Moving from six to five (or more) common scenes adds some new players into the mix. The players that have five scenes are differentiated by having slightly thinner connecting lines.

Finally, Romeo and Juliet start to share some scenes, as to Juliet and Lady Capulet. Also appearing in the same scenes now are Capulet and Nurse, and Capulent and Romeo.

Four or more common scenes

Moving to four or more common scenes increases interations between the players already mentioned, and adds a couple of new players: Friar Laurence, Paris and Mercutio.

In this play, all the action is centered around the main players, who interact closely with each other. I've analysed a dozen or so plays, and in some of them, it's possible to identify a couple of parallel threads in the story. In some plays, there is rich nucleation around two (or more) seperate islands of characters who are involved in many scenes with each other, but these characters do not interact with people outside of their own zones.

Three or more common scenes

Things are starting to get a little busy! We're starting to see how important a role Romeo is as a hub, being involved in more common scenes than any other character, and Tybalt makes his first apperance.

Two or more common scenes

One or more common scenes

Finally we're down to one or more common scenes. In this (messy) graph, anybody who is connected with a line shares are least one scene together.

Hang on, how come there are two 'Lady Capulets'

Confession time: In preparing this article, I noticed that there were two Lady Capulets on the graph? What's going on?

Closer examination of the data and the source shows the reason for this. In the text I pulled from the internet, in one of the scenes the character is written as "LADY—CAPULET" (with two spaces inbetween the words), whereas in the rest of the document it is written as "LADY–CAPULET" (with just one space inbetween). These differences in spelling are treated by my code as these players being distinct characters. Looking back at my code, I was wise enough to trim redundant and superfluous spaces off either end of the string, and to convert all the string to single (upper) case, but I did not think of checking for multiple spaces in the middle of the string! I've learned a valuable test case here, and have elected to leave this blog as it is (and not re-run the code for Romeo and Juliet) to, hopefully, educate others of traps like this. Especially when dealing with unstructured data (and even when dealing with normalised data), it's very important to stop, check, and have a 'warm body' look over the results to make sure there are not anomolies present (especially before making any important business decissions). Remember, computers are tools, and need to be treated as such.

Digressing away from Shakespeare for a while, the above issue reminds me of a gnarly bug we uncovered back in the days I worked at Automap. The code in question compared registration names which were input by the user against a know value that was stored in the application, by comparing a hash of the first 32 characters of the strings. Tens of thousands of users were using the application without issue, but one particular user could not get their registration name to work. Time after time they tried, each time with an error, and in frustration they called our support line. Sure enough, I was able to reproduce their problem. What was causing the error? Surely it couldn't be an error in the hash code function (a standard algorithm that has stood the test of time). With help from the developer, the code was opened up and the mystery was revealed.

The function that received the input from the user was calculating the Hash value like this (Psuedo code):

V1 = fn_Hash(TRIM(LEFT(input,32)))

The function that calculated the Hash from the stored value in the registry used this formula:

V2 = fn_Hash(LEFT(TRIM(input)),32)

It just so happened that this customer had <SPACE> character in position 32 of their registration name!




"Fair is foul, and foul is fair"

What's done is done. Let's move onto regicide, and repeat the exercise with (The Tragedy of) MacBath.

There are 41 players in MacBeth and 28 scenes.

Below, again in order of first appearance, are the players. This time there are no double spaces in the names, but there are a couple of issues that could be handled differently. As you may know, the play opens with three witches. The script labels these appropriately First Witch, Second Witch, and Third Witch. The script also references All (meaning all witches), similarly in Act III, there are lines for a First Murderer, a Second Murderer, and Both Murderers. In an ideal world, I would modify the data schema to cope with these nestings, but since it's just a couple of lines, and all feature in the same scenes as their 'child' entities, I'm going to leave them in there as distinct entities.

Player# of spoken words
FIRST WITCH349
SECOND WITCH125
THIRD WITCH129
ALL124
DUNCAN472
MALCOLM1513
SERGEANT234
LENNOX499
ROSS905
MACBETH5316
BANQUO775
ANGUS137
LADY MACBETH1891
MESSENGER176
FLEANCE15
PORTER305
MACDUFF1156
DONALBAIN60
OLD MAN81
ATTENDANT8
FIRST MURDERER166
SECOND MURDERER86
BOTH MURDERERS8
SERVANT21
THIRD MURDERER42
LORDS13
HECATE265
LORD156
FIRST APPARITION13
SECOND APPARITION23
THIRD APPARITION32
LADY MACDUFF291
SON143
DOCTOR 328
GENTLEWOMAN181
MENTEITH71
CAITHNESS73
SEYTON32
SIWARD203
SOLDIERS4
YOUNG SIWARD 43

Seven Scenes in common

The players with the most scenese in common are MacBeth and Lady MacBeth, who share seven scenes.

Six or more scenes in common

Reducing to six scenes adds the interactions between Malcolm and MacDuff.

Five or more scenes in common

Four or more scenes in common

Three or more scenes in common

Two or more scenes in common

At least one scene in common

 

You can find a complete list of all the articles here.      Click here to receive email alerts on new articles.

© 2009-2013 DataGenetics