Inspired by this fascinating study of vocabulary in rap lyrics, by Matt Daniels at Polygraph, my colleague Derek Greene decided to take a similar dive into our own data. Which of our 46 novels includes the widest selection of unique words?
I’ll let Derek explain this chart in his own words:
I looked at the vocabulary sizes for our novels to get a sense of their “lexical diversity” – i.e. the number of unique words vs the total number of words in each novel.
As you would expect, the number of unique words is very highly correlated with the total number of words – see the linear relationship in the attached plot. But there are a few novels that deviate from this trend somewhat, as indicated on the plot. These include both of the Phineas novels (fewer than expected unique words) and Waverley and Portrait (more than expected unique words).
So in case you’re just wondering what the hell the title of this post was about: Walter Scott and James Joyce’s novels, and the Wu-Tang Clan’s lyrics, are all well above the average for lexical diversity within their own cohorts. (This isn’t an entirely fair comparison, since only the first 35,000 words from each rapper were included in the Polygraph study, whereas most of our novels are considerably longer. The two works in our corpus that are closest in length to the 35,000 mark are The Sign of the Four and The Time Machine, which are more or less on par with Redman and Cypress Hill respectively, around the middle of the rap cohort. It’s worth noting, however, that rap is constrained by metrical and other formal considerations that don’t apply to prose, so perhaps a comparison with 19th-century poetry might be more appropriate!)
But what does this study of lexical diversity indicate about our own novels? Well, we’re still looking at developing this further, but here are a few thoughts and speculations:
The first thing that jumped out at me from this chart, I’m afraid, is that I personally processed four of the six longest (and wordiest) novels! These are Phineas Finn, Middlemarch, House by the Church-Yard and – with the largest unique word count, at 17,214 individual words – my own personal white whale, Vanity Fair.
Knowing that juggernaut of a novel like I do, I’d be willing to bet that a considerable chunk of its cohort of unique words is made up of the names of the ~600 characters that Thackeray saw fit to mention once and then immediately forget about, with complete disregard for the poor literary researchers who would have to painstakingly record and disambiguate them all a couple of centuries later. For example, in chapter 51, the reader is introduced to more than thirty new throwaway characters, including a selection of high-society party guests with names such as Champignac, Truffigny, the Earl of Portansherry, the Turkish attaché Kibob Bey, the Duchess (Dowager) of Stilton, the Duc de la Gruyere, Chevalier Tosti, and the Comte de Brie.
*cough* Anyway! Getting back to lexical diversity, I find it interesting that Waverley and Portrait of the Artist as a Young Man are both above the the average point. Portrait, of course, is written by a famous linguistic innovator, James Joyce, and contains Dublin slang as well stretches of text in Latin, which might contribute towards its count of unique words. Waverley, meanwhile, contains a large number of Scottish placenames and stretches of dialect or accented speech (both in dialogue and in quotes from poetry), such as the following from Bailie Duncan MacWheeble:
Tak care ye guide him weel, sir, for he’s aye been short in the wind since—ahem—Lord be gude to me! (in a low voice), I was gaun to come out wi’—since I rode whip and spur to fetch the Chevalier to redd Mr. Wauverley and Vich lan Vohr; and an uncanny coup I gat for my pains. Lord forgie your honour! I might hae broken my neck; but troth it was in a venture, mae ways nor ane; but this maks amends for a’.
The same argument might apply to House by the Church-Yard, which contains considerable quantities of Irish-accented dialogue and Irish place-names. This is just a guess, but if it’s true, it might help to identify works containing large quantities of unusual linguistic features, when examining massive literary corpora, for example.
In contrast to these works, however, Bleak House, Middlemarch and both Phineas novels are interesting in that they skew below the average for lexical diversity – the two works by Trollope considerably more so than the others, despite being the fourth and fifth longest in our corpus. Why might this be the case?
Well, it’s definitely not because they’re simplistic! As Steven Pinker points out in his 1994 book The Language Instinct, there’s a general tendency to assume that vocabulary size is correlated with intelligence – that the number of words a person knows how to use, or the number of unique words used in a written text, is an indicator of how smart or at least how well-educated the speaker or writer is. Yet as Pinker argues, “people can recognise vastly more words than they have occasion to use in some fixed period of time or space”. The average person’s vocabulary is much larger than most people would generally assume; according to Pinker, a secondary school graduate typically knows about 60,000 unique words, while most children know about 13,000 by the age of six*. So while Vanity Fair‘s 17,000 unique words may sound like a huge number, it shouldn’t actually pose a problem for a Leaving Cert student! (That is, assuming they don’t get bored and wander off somewhere during the many cheese-related jokes.**)
Knowing a lot of words doesn’t necessarily mean that you should use them all at once, though. Shakespeare’s works may have been astonishingly lexically diverse, but more lexical diversity in your writing doesn’t make you Shakespeare – or even indicate better writing skills. Jane Austen’s novels are some of the most enduringly popular and well-studied works in our corpus, yet (with the exception of Northanger Abbey, which is right on the average***), all of her books fall noticeably below the average mark for lexical diversity. Siimilarly, the study I’ve quoted above notes that the rapper DMX comes back of the pack for lexical diversity at number 85, yet this isn’t a reflection of the quality of his lyrics: rather, as author Matt Daniels notes, his “raw energy and honesty were the most memorable qualities of his music”.
If you’ll forgive a brief segue into modern literary advice, the authors of How Not To Write A Novel actually caution against being too wordy, at least where it isn’t called for:
Beginning writers often believe that the true genius uses only words from the furthest reaches of the English language, the darkest recesses of the dictionary, the sort of words that cannot survive on their own in any natural environment.
Sorry; this is not writing. This is showing off, and nobody likes a show-off.
So, the text of Phineas Finn might be only a third as lexically diverse as Portrait of the Artist as a Young Man, but that isn’t necessarily a fault. It might perhaps be more repetitive on a statistical level, but from my own (obviously subjective) viewpoint, Phineas is simply a lot more fun to read****. Without unduly offending any Joyceans that might be among my readership, I just don’t know of any character in Portrait that can match Violet Effingham for sheer sass:
“Lady Baldock asked me the other day whether I was going to throw myself away on Mr. Laurence Fitzgibbon.”
“Indeed she did.”
“And what did you answer?”
“I told her that it was not quite settled; but that as I had only spoken to him once during the last two years, and then for not more than half a minute, and as I wasn’t sure whether I knew him by sight, and as I had reason to suppose he didn’t know my name, there might, perhaps, be a delay of a week or two before the thing came off. Then she flounced out of the room.”
(Phineas Finn, chapter 27)
Anyway, I’m going to suggest that at the end of the day, in novels as in rap lyrics, it’s not about the size of your vocabulary: it’s how you use it.
*Pinker, Steven, The Language Instinct: How the Mind Creates Language (HarperCollins, 2000). This discussion can be found on pages 149-151. I highly recommend this book if you’re interested in how languages work!
**Did I mention that there are sixteen characters with the name of John in Vanity Fair? SIXTEEN. I know, because I personally counted them. This has nothing to do with the topic at hand, I just thought you should know.
***Northanger Abbey is generally considered a parody of the Gothic genre, which might potentially explain why it differs from Austen’s other works. Interestingly, Pride and Prejudice – possibly the most widely beloved book of the early 19th century – has the lowest lexical diversity of all of her novels.
****What I do find interesting is that Phineas and its sequel Phineas Redux are nearly exactly the same length – there is only about 500 words difference between them, which only 1/3 the length of this – admittedly rambling – blog post. I wonder if that’s got something to do with Trollope’s exceptionally regular writing habits?