Today, I was finally about to evaluate a portion of Kauchak’s Wikipedia corpus, as well as the English Online corpus, with the language model trained on the CIC-FCE corpus I mentioned previously. On the Wikipedia corpus, the perplexity was computed to be 27.5864. On the English Online corpus, the perplexity was 26.0762. And, as a baseline, evaluating the perplexity on the model training set (the CIC-FCE corpus) is 24.8764. This change in perplexity between the baseline and the first two corpora make sense: obviously, the model should not be as surprised to see the text that it was trained on. Furthermore, it makes sense that the Wikipedia corpus has a slightly higher perplexity than the English Online corpus. The latter corpus was written to accommodate English learners, so the syntactical constructions that is uses are less complex. This then translates into a lower perplexity than that of the Wikipedia corpus, which does not make such considerations.
This comparison in perplexity is admittedly a little confusing. The relative differences in perplexity make sense. But the absolute perplexities do not. What is a “good” perplexity score? How low should a perplexity be so that we can consider the model to be an accurate representation of text for English learners? Do only the relative differences matter? And if so, how much of a difference is considered satisfactory?