The 80-20 rule is amazingly useful. As someone who occasionally tries to learn new languages (human languages, in this case), I had always wondered how good language textbooks really are at guessing which words and concepts to introduce in what order. So tonight, I got tired of wondering and wrote a little program to analyze word frequencies.
My latest language learning project is Latin; I’ve tried a few times and haven’t really gotten there. Seems like it’s time to try again. Which words should I learn first, though?
To find out, I downloaded the entire New Testament of the Vulgate (the Latin version of the Bible) from Wikisource. The Vulgate is relatively straightforward, as Latin goes, and has been in active use by the Church for over 1500 years. Plus, it’s in a consistent style (for the most part), which helps the analysis.
If I restrict the program to the top 10 words, by frequency, in the book of John, I get this:
Most common 10 words
These 10 distinct words (3085 instances) account for:
0.390472471690746% of 2561 distinct words
21.9323190672544% of 14066 total words
et (898)
in (377)
non (307)
quia (258)
est (235)
me (213)
qui (207)
autem (201)
Iesus (199)
eum (190)
Those stats are pretty impressive at a glance: 1/3 of a percent of the distinct words make up over 20% of the total words in the work. So I started wondering where the 80-20 rule cutoff is; what does it take to get 80% of the total words represented? After some playing around, I came up with this:
Most common 561 words
These 561 distinct words (11254 instances) account for:
21.9055056618508% of 2561 distinct words
80.0085312100099% of 14066 total words
et (898)
in (377)
non (307)
quia (258)
est (235)
me (213)
...
So, the 80-20 rule is amazingly close here: 20% of the distinct words make up 80% of the word instances. But this is a pretty small sample size; what does the whole New Testament look like?
Most common 10 words
These 10 distinct words (26230 instances) account for:
0.0579777365491651% of 17248 distinct words
19.6052051333797% of 133791 total words
et (9404)
in (4449)
est (2195)
autem (2128)
non (1976)
qui (1897)
cum (1125)
ut (1095)
ad (1066)
enim (895)
In this case the top 10 words still account for about 20% of the total word instances. The 80% cutoff for instances is even more dramatic, though:
Most common 2641 words
These 2641 distinct words (107032 instances) account for:
15.3119202226345% of 17248 distinct words
79.999402052455% of 133791 total words
et (9404)
in (4449)
est (2195)
autem (2128)
non (1976)
qui (1897)
...
In this case, you only have to know 15% of the distinct words to get to 80% of the instances, but that does represent a lot more words. But any way you slice it, my Latin textbook (Wheelock’s, which I understand is quite popular at the University level) is only in partial alignment with this word list at best. I wonder how much this would change if I threw Caesar or Cicero at it.
Don’t take this too seriously (Lies, Damn Lies, and Statistics
), because this is bound to be wrong for a number of reasons. One excellent reason is the highly inflected nature of Latin, meaning that words change by just a few letters very often. There are something like 15 different endings for the adjective magnus, for example, depending on the word’s function in a sentence.
I’m curious to try this on French and maybe English, which are progressively less inflected than Latin. I’d expect the results to be even more dramatic, although I’ll have contractions to contend with.
Anyway, if you’re curious, my little perl script is posted here.