These past two days, I have spent time thinking about the direction of the project, and I have some questions that I really want answered.
- If we have corpora of text written by English learners, why don’t we just use that to model text written by English learners? Why should we get that model by interpolating models of text written for normal English and text written for English learners?
- I know that we can measure success using interpolation. But when will the difference in perplexity be significant?
- Are we realistically going to get to the next part of this project, which is actually generating simplifications?
While thinking about those issues, I found an article that delves into an improvement of part-of-speech language modeling. I skimmed through it, and I realized that what would probably make sense is instead of the “sentences” passed into the language models being pre-order traversals of syntax trees, the “sentences” should just be the actual words of the sentence but just replaced with their parts-of-speech. I tweaked a Python script to do that, and the results that I report from now on will use that approach until I figure out an even more meaningful way to use the parse trees.