Yesterday, I identified three corpora that I want to analyze. The last corpora is the only one that I could download directly, so that is the one I will be working with. It is not ideal, because it is not corpora that was written by second-language English speakers, but it is a starting point.
Now, the next step is to identify the syntactic parsers that I want to use. There are several routes that I had to think about here. So, I could stick to my language of choice (Python) and use a production-level Python parser, spaCy. However, I thought about performance constraints and whether spaCy will have everything that I want it to have (which I admittedly have not investigated). Then, I thought about using Google’s SyntaxNet, a parser released in May 2016 which purportedly has an accuracy higher than all of the parsers out there. I would actually want to use Parsey McParseface, which is a trained version of SyntaxNet. I was tempted to use it because of the performance improvements, but was somewhat dissuaded because of the possible complexity in setting it up.
I instead decided on using the Stanford CoreNLP software package. It seems like the industry standard, it has many features from what initial investigations tell me, and I have the flexibility of using either Python or Java. That way, if I ever want to create a software package out of my research code, I could easily port my code to Java and still use CoreNLP.
So, I downloaded Stanford CoreNLP released on 12-09-2015, and I am playing around with the server. Tomorrow, I will hopefully have some syntactic parses for the corpus I am looking at.