Today, I’ve decided to abandon the whole “examine the errors” in the corpus approach that I was trying earlier, and I’m going to go back to examine parse trees. I spoke to Prof. Medero today, and she suggested that I first create a parse tree for every sentence in the corpus. Then, I grab the Penn Tree Bank annotation for each word in the sentence, and create a language model based off of the frequencies for each annotation. After that, I can generate how likely a text’s syntactic structure might be to appear with respect to the text written in the corpus. And that, in turn, might help me judge how difficult a text is for second-language English speakers.
So far, I’ve extracted all of the sentences. But I am getting a problem with Stanford CoreNLP in that it keeps timing out if I give it a large number of sentences to process. That was the roadblock at the end of the day, so I decided to give up and try again on Tuesday.