Now that I nailed down my reading audience, the next step is figuring out how to simplify text for them. And to do that, I need corpuses that I can use to identify the types of simplifications targeted toward second-language English speakers. There are a couple that I investigated today:
- I investigated the corpus (with complex sentence and 7 possible simplifications) that Xu et. al created in their paper “Optimizing Machine Translation for Text Simplification” to test their SMT. The data will not be released until early August unfortunately. So that’s out of the question. Not to mention, this has the disadvantage of us not knowing what audience the simplifications were targeted toward.
- I could use Wikipedia edit history to identify simplifications. Though, the data tends to be noisy, and also has the same disadvantage as above.
- Sarah Petersen’s dissertation examines a simplification corpus and identifies the simplifications made for second-language English speakers. But I believe that Prof. Medero says she, no longer working with Petersen, does not have access to that data.
- NAACL’s workshop on Building Educational Applications had the release of a paper “Predicting the Spelling Difficulty of Words for Language Learners” by Beinborn et. al. In the paper, they talk about a few corpuses written by second-language English learners. One avenue that Prof. Medero suggested I explore was was using this data to learn the grammatical constructs that second-language English learners find difficult. Though I do think this makes sense, I imagine doing this to be extremely difficult.
Over the weekend, I am going to think about these corpuses and see which one is worth investigating next week.