Still working on examining the FCE dataset, and I’m honestly having a lot of trouble trying to come up with hypotheses for syntactic structures that might be difficult. I talked to Michelle about it, and she suggested me that I examine the errors that are in the corpus. The errors might then be indicative of certain structures that the L2 English learners have not yet mastered, possibly because they find the structures difficult to interpret and thus write. So I’ve worked on that, and I’ve summarized the most common mistakes that L2 English learners have had, grouped by their native language. But I have yet to figure out how I am going to extract meaning from this.
I posted my problem on Stack Overflow – no answers yet. I do hope there is a good way to do this. In the meantime, I messed around with Firebase. Online tutorials were either outdated or not very thorough, so thanks to Michael’s help I got data uploading to the database within a few, simple lines. Firebase also offers remote settings, so that’ll be exciting if I can get that working, but it’s not as straightforward as the data sending I did today. Hopefully I’ll get a few answers on my S.O. thread tonight so I can start looking into that tomorrow.
So, today, I have been coding and do not have much to talk about. Originally, I had planned to use the third corpora that I mentioned in yesterday’s post, but I did not read the page correctly and the data has not been released yet. So, instead, I looked at this paper and found this corpus, the CIC-FCE dataset. The majority of the day has been spent just examining this corpus, and trying to massage it into a way that I can easily feed into the Stanford CoreNLP parser. No hard results yet, but I hope that changes tomorrow.
I finally got text to bitmap conversion working this afternoon, but as soon as the bitmap image’s width exceeds 2000, it becomes a black rectangle like what SKLabelNode was doing. Breaking up the text into multiple bitmap images would really complicate the code, so I’d rather avoid that. I also thought about exporting the image, but I got a long image externally and put it in the app, and it came out as a black rectangle. In conclusion, iOS can’t handle images that exceed the size of the screen, so I may need some alternative method to hook up text to SpriteKit physics. I am planning to post on Stack Overflow tomorrow to see if anyone has solutions for this extremely specific problem.
Yesterday, I identified three corpora that I want to analyze. The last corpora is the only one that I could download directly, so that is the one I will be working with. It is not ideal, because it is not corpora that was written by second-language English speakers, but it is a starting point.
Now, the next step is to identify the syntactic parsers that I want to use. There are several routes that I had to think about here. So, I could stick to my language of choice (Python) and use a production-level Python parser, spaCy. However, I thought about performance constraints and whether spaCy will have everything that I want it to have (which I admittedly have not investigated). Then, I thought about using Google’s SyntaxNet, a parser released in May 2016 which purportedly has an accuracy higher than all of the parsers out there. I would actually want to use Parsey McParseface, which is a trained version of SyntaxNet. I was tempted to use it because of the performance improvements, but was somewhat dissuaded because of the possible complexity in setting it up.
I instead decided on using the Stanford CoreNLP software package. It seems like the industry standard, it has many features from what initial investigations tell me, and I have the flexibility of using either Python or Java. That way, if I ever want to create a software package out of my research code, I could easily port my code to Java and still use CoreNLP.
So, I downloaded Stanford CoreNLP released on 12-09-2015, and I am playing around with the server. Tomorrow, I will hopefully have some syntactic parses for the corpus I am looking at.
So, last week, I nailed down how I wanted to simplify text for second-language English speakers. I also determined how I need to find samples of written text that illustrate what they find difficult.
To be honest, for a lot of today, I was boggled down by thinking about how exactly we are going to make the syntactic simplifications. It seems like a lot of other papers, such as Zhu et. al in their paper and Aluisio et al in their’s, relied on elaborate, hand-written rules for simplifications. And in other scenarios, such as in this paper, the rules were learned from Wikipedia corpuses. I honestly got overwhelmed for much of the day. And it took me a while to realize that I cannot think about that stage right now. First, I have to identify what exactly the syntactic simplifications are that we want to make. Only then, can I worry about how we are going to perform them.
That being said, I have identified a few corpuses that can help me finding more about those syntactic simplifications. And they are all corpuses of text written by/targeted toward second-language English speakers. They are:
- ETS Corpus of Non-Native Written English, from the Linguistic Data Consortium
- EF-Cambridge Open Language Database of essays written by second-language English learners (which I first discovered from this paper)
- Reading passages from the five main suite Cambridge English exams, targeted toward (not written by) L2 English learners.
Tomorrow I am going to continue looking for tools that I can use to analyze them. I will start by trying to find a parser (such as Stanford’s sentence parser) and getting a basic sentence parse on my computer working. After that, I will work on retrieving the dataset.
The SKLabelNode I got tilting smoothly last Friday can’t handle large quantities of text, which is what I need for TextScroll. If I overload it, the label turns into a black bar, and this is a bug that hasn’t been fixed yet. I tried connecting a UILabel to the GameScene, but Stack Overflow advised that I keep UIKit stuff and GameScene stuff separate. Then I came across a suggestion to convert text to a bitmap image, then using that image as an SKSpriteNode. This would let me implement SpriteKit physics directly on the text (no need for the ball!), and works well with Adam’s data collection system. The person claimed that it wasn’t difficult to do, but the code they provided seemed to be a brief summary in Objective C that I couldn’t readily implement, so I tried finding other tutorials in Swift. There were complete code samples for adding text into larger images, but all I want is a function that takes some text and gives back a bitmap version of it in a narrow rectangle. I tried modifying the code online to do this, but no rectangle or text would appear when I tried running my code on the iPad. Prof. Medero suggested that I try coloring the rectangle in to see if it was the rectangle or the text that wasn’t getting drawn, but I was at a loss for how to do this on a CGRectangle. While I have experience with UIKit objects in the storyboard editor and SpriteKit objects, I didn’t realize that I had left Core Graphics unexplored, which was why I couldn’t make fully informed decisions about modifying the code I found. I am currently following a tutorial on Core Graphics to remedy that. It’s teaching me a lot about how to make unique buttons and animations, so that’s exciting as well.
Once I succeed at this string to bitmap conversion, I hope that SpriteKit and Core Graphics will make implementing everything else much easier than if I had tried to modify the old Scrolling Label.