Now that we have our baseline feature functions, which we have wanted for so long, Prof. Medero and I talked a little this week about where to go next. We have two possible avenues. How do we create feature functions that represent difficulty? And once we have them, how do we choose the weights for these feature functions?
Creating feature functions that represent difficulty is where the real innovation comes with my whole project. This is the problem that researchers are currently tackling, and the part that excites me the most about this project. Prof. Medero had invaluable advice to tackle this problem. She first asked me to do research on what people care about when looking for simplified text. For example, perhaps people value translations that do not have “rare” words, for some definition of “rare.” Or, perhaps people value translations where the sentences are short. I need to ask questions like these first, with some research. And after I answer this question, Prof. Medero then suggested that the next step is where “the math” comes in. She then proposes answering: how do I create feature functions (using math or machine learning!) that quantify these characteristics? Again, this is hard. But it’s the challenging part of the project; much of what I have done so far is preparing for this.
The next avenue that Prof. Medero and I talked about was: assuming we have the feature functions, how do we choose the weights for them? Prof. Medero, based on her thesis, thought it might be best to take a user-centered approach. That is, instead of deciding the weights that improve the simplification quality from my perspective, present the user of our simplification system with translation options that lets them indirectly choose the weights. For example, say that we have feature functions for vocabulary difficulty and sentence length. Then, to pick the weights, we ask the users to make a trade-off (with a slider) where one end means that the user prefers preserving text meaning, and the other end means they want the system make the input text as simple as possible.
I find both topic areas to be extremely interesting. But for the immediate future, I would prefer exploring ideas for feature functions. I feel like answering the first question before the second is a more natural progression of things.
In light of this, Prof. Medero has suggested to me a challenge that will push me in the direction of answering the first question. She told me about the Semantic Evaluation Exercises International Workshop on Semantic Evaluation (SemEval-2016). They are an annual workshop that poses a series of tasks relating to semantic evaluation of text. In particular, Prof. Medero mentioned to me Task 11: Complex Word Identification. The task in a nutshell asks us to identify whether at least 1 of several annotators labeled a word in a sentence as “complex” for non-native English speakers. This is a particularly fitting challenge for me, as it gives me a concrete end goal (i.e., classify as many “complex” words correctly as possible) that I could potentially use as a feature function or for inspiration.
I decided to take on the task, and much of my work this week relates to exploring that dataset. So I have roughly 2.2k examples, each of which has: the sentence the word is in, the word itself, the position of that word in a sentence, and whether that word is labeled as complex. Approximately ~30 percent of the dataset has training examples that are labeled as complex. And, the first inclination for me is to use machine learning. That is where I am at this week, and hopefully I have some concrete results to deliver in the near future.