Today, I have made significant progress on using Stanford CoreNLP and SRILM. So, for Stanford CoreNLP, I was able to get what I want by using the Java library directly–as opposed to using CoreNLP as a service or through command line. I’m not sure why, but the last two options kept giving me heap space errors, even when I gave a maximum size of 8GB… But thankfully, I no longer have to worry about that. I was able to get sentence parses for the first 5,000 sentences in the corpus in a reasonable amount of time, and I expect to get parses for all ~30k sentences by tomorrow morning, all written to a text file.
As far as SRILM, this morning I was able to get it installed on Knuth. It turns out that the machine auto-detected Knuth’s version incorrectly: it thought Knuth was 32-bit. After some research online, I was able to manually override the autodetection, and SRILM compiled successfully! I finally have it as a command-line tool.
All that remains this point is to format the parses in a way that SRILM will accept–and I already have work for that completed from earlier in the week. I hope to be able to finally train a language model in SRILM tomorrow. I don’t know what exactly what I’m going to do with the language model, but I think I’m going to want to compute the model’s perplexity over various corpora. And the goal is for the perplexity to be low for texts written by/ for English learners, but high for texts written for English speakers in general.