I spoke to Prof. Medero today, and she cleared up a lot of the questions that I have! She explained it by first supposing that we have three corpora:
- Corpora of general English (this, right now, would be the Wikipedia articles in Kauchak’s corpus)
- Corpora of English written by second-language English learners (hereafter referred to as ESOL) (this would be the CIC-FCE corpus I have)
- Corpora of English written for ESOLs (the English online corpus that I compiled)
The goal this summer is to use 1. and 2. to model what 3. would look like. This, of course, may raise several follow-up questions.
Why would we want to model 3.? Well, modeling text written for ESOLs is a necessary first step if we ultimately want to generate simplifications for ESOLs. We want to know what is considered difficult, so that we know what to simplify. I thought that I would accomplish both of these steps throughout the summer. But I simply do not have the time to do that.
Well, why use other corpora to model 3.? Can’t we just use the corpora we’re trying to model? As it turns out, we can do this for ESOLs. There is plenty of corpora out there written forESOLs. But suppose that we wanted to extend our approach to other audience besides ESOLs (e.g., dyslexics, aphasiacs). It turns out that finding corpora written for more specific audiences is quite difficult. Therefore, in the hope of coming up with generalizable results, we want to use corpora that we can easily find (general English and English written by specific reading audiences) to model corpora that we cannot easily find.
How are we going to combine two different corpora to model 3.? Well, this is an open question, so far. Prof. Medero suggested some sort of linear interpolation of language models, or some kind of log-likelihood ratio combination. It’s up to me to investigate these approaches in the coming weeks, and see what works best?
How do we measure success in modeling 3.? This is also somewhat of an open question. Prof. Medero and I posited using the measure of perplexity. The ideal goal is to be able to train two models: one based off of 3., and one that combines 1. and 2. in some way. If we generate a test set of corpora written for ESOLs, and get the same perplexities using each model, then we will have achieved our goal. Obviously, there are some nuances to this approach, but this is the general idea.
Armed with all of this information, I’m setting out to create a table that resembles the following:
|Corpora to Train Language Model On||Perplexity on Test Set of Corpus for ESOLs|
|1. + 2.||…|
The goal is to have the penultimate and last rows of the tables give similar enough perplexities. If that’s the case, then we know we will have achieved our goal. To that end, I have begun creating a Bash script that generates this table, so that we can generate this table quickly and easily.