August 1 (Week 7) – Maury

Today, I spent time implementing the interpolation of the general + by-learners corpus. This gave me the completed table:

Corpora to Train Language Model On Perplexity on Test Set of Corpus for ESOLs (2k)
General English (12m words) 22.5623
English written by ESOL (532k words) 23.2972
General English + English Written by ESOL (Linear Interpolation) 21.7679
General English + English Written by ESOL (Log-linear Interpolation) 21.5548
English written for ESOL (230k) 20.7549

Perfect! The interpolation (kind of) does what we expect it to do, which is to lower the perplexity. The next major question that I have to answer is: how do we know that the differences between two perplexities is statistically significant? This will be useful to compare two perplexities. My next task is to figure this out. Today, I’ve been working toward extending my Bash script to automatically doing this significance testing. Hopefully, I’ll have something working tomorrow. And when I do, the next thing I can start to do is to tinker with the parameters of each language model, and see how varied I can make my results.

2 thoughts on “August 1 (Week 7) – Maury

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s