July 15 (Week 4) – Maury

Yesterday, I set up Stanford CoreNLP to parse Kauchak’s Wikipedia, and that is currently running. But today, my focus is going to be finding another corpora of language written for English learners. Prof. Medero proposes that if we interpolate a language model trained on text for English learners, and a model trained on text at a standard reading level (Kauchak’s Wikipedia corpus), we can create a language model that estimates difficulty of text for English learners. As I’m writing this, I am not sure why we need to use interpolation to achieve this, when we can easily find text written by English language learners and just base the model solely on that. But I’ll wait until I can ask her that.

I found a resource that contains a bunch of reading texts available for English learners. In particular, I found this site, English Online, that contains articles specifically written for learners of English. As a starting point, I wrote a script that downloads all of the articles and extracts the sentences from them. I then started a parse of them using Stanford CoreNLP, and I am hoping that the parses will complete over the weekend.

