So, yesterday’s discussion with Prof. Medero was quite illuminating. As I mentioned in yesterday’s blog post, my ultimate goal is to create a table of perplexities using a Bash script. And so far, this is what I have for the table.
|Corpora to Train Language Model On||Perplexity on Test Set of Corpus for ESOLs (2k)|
|General English (12m words)||22.5623|
|English written by ESOL (532k words)||23.2972|
|General English + English Written by ESOL||?|
|English written for ESOL (230k)||20.7549|
I’m a little confused as to why the perplexity for general English is lower than the perplexity of English written by ESOL. That doesn’t make intuitive sense to me, because I would expect the latter corpus to be simpler, therefore less “surprised” by the test set of corpus of text for ESOLs. Still working out the reasoning with Prof. Medero, but I think that it might have to do with the sizes of the corpora that we’re using. The test set is pretty small, as are the ESOL corpora in comparison to the general English corpora. I’m starting to look into finding larger ESOL corpora to see if that might have an effect on the perplexity, but for tomorrow, I am going to be just focusing on filling in the penultimate row in that table.