July 28 (Week 6) – Maury

So, yesterday’s discussion with Prof. Medero was quite illuminating. As I mentioned in yesterday’s blog post, my ultimate goal is to create a table of perplexities using a Bash script. And so far, this is what I have for the table.

Corpora to Train Language Model On Perplexity on Test Set of Corpus for ESOLs (2k)
General English (12m words) 22.5623
English written by ESOL (532k words) 23.2972
General English + English Written by ESOL ?
English written for ESOL (230k) 20.7549

I’m a little confused as to why the perplexity for general English is lower than the perplexity of English written by ESOL. That doesn’t make intuitive sense to me, because I would expect the latter corpus to be simpler, therefore less “surprised” by the test set of corpus of text for ESOLs. Still working out the reasoning with Prof. Medero, but I think that it might have to do with the sizes of the corpora that we’re using. The test set is pretty small, as are the ESOL corpora in comparison to the general English corpora. I’m starting to look into finding larger ESOL corpora to see if that might have an effect on the perplexity, but for tomorrow, I am going to be just focusing on filling in the penultimate row in that table.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s