So, last week, I nailed down how I wanted to simplify text for second-language English speakers. I also determined how I need to find samples of written text that illustrate what they find difficult.
To be honest, for a lot of today, I was boggled down by thinking about how exactly we are going to make the syntactic simplifications. It seems like a lot of other papers, such as Zhu et. al in their paper and Aluisio et al in their’s, relied on elaborate, hand-written rules for simplifications. And in other scenarios, such as in this paper, the rules were learned from Wikipedia corpuses. I honestly got overwhelmed for much of the day. And it took me a while to realize that I cannot think about that stage right now. First, I have to identify what exactly the syntactic simplifications are that we want to make. Only then, can I worry about how we are going to perform them.
That being said, I have identified a few corpuses that can help me finding more about those syntactic simplifications. And they are all corpuses of text written by/targeted toward second-language English speakers. They are:
- ETS Corpus of Non-Native Written English, from the Linguistic Data Consortium
- EF-Cambridge Open Language Database of essays written by second-language English learners (which I first discovered from this paper)
- Reading passages from the five main suite Cambridge English exams, targeted toward (not written by) L2 English learners.
Tomorrow I am going to continue looking for tools that I can use to analyze them. I will start by trying to find a parser (such as Stanford’s sentence parser) and getting a basic sentence parse on my computer working. After that, I will work on retrieving the dataset.