Week 14, December 7-11 – Maury

If you recall from last week’s post, I am now taking a break from Moses and instead concentrating my efforts on creating a machine learning model that classifies words as complex (Task 11 of SemEval 2016). After doing some preliminary research, I threw together a machine machine learning pipeline–from data set preprocessing to metrics calculation–just to get something working.

Recall from last week some information about the dataset

 So I have roughly 2.2k examples, each of which has: the sentence the word is in, the word itself, the position of that word in a sentence, and whether that word is labeled as complex.

I used the following as features for each training example:

  • Word length — I included this with the guess that longer words tended to be labeled more frequently as complex.
  • Word part-of-speech — I included this hoping to capture a possible hidden relationship between part-of-speech (and possibly other features) and its classification as “complex.”
  • Unigram word count — This is how many times a word appears in a corpus. The idea here is that words labeled as complex appear less frequently in the corpus. I used the counts provided to me by the Word frequency data‘s free database of the “top 5,000 words/lemmas from the 450 million word Corpus of Contemporary American English,” which is “the only large and balanced corpus of American English.” I expect this to be the most indicative feature.
  • Unigram word count of the preceding and following word — Again, perhaps a hidden relationship here?
  • Bigram word count, one where the word in question is the first word in the bigram, and one where the word in question is the second word in the bigram — Again, perhaps a hidden relationship.

And I used the following machine learning models).

  • Majority class classifier, where all training examples are predicted to be labeled as “not complex.”
  • Decision tree, tuned for the best depth size using K-Folds cross-validation (with 10 folds). I used scikit-learn‘s implementation here.
  • Perceptron, with no special parameters. I used scikit-learn‘s implementation here.
  • Support Vector Machine (SVM) with C = 5 using scikit-learn‘s implementation.

The results with these features/ models are not great. Here is a summary of their performance metrics, which are computed and averaged using K-Folds cross-validation where k = 10.

Classifier Accuracy Precision Recall
Majority Class Baseline 0.6844 N/A 0
Decision Tree (depth 5) 0.7349 0.6283 0.4092
Perceptron 0.6884 0.5398 0.1211
Support Vector Machine 0.5168 0.4019 0.5630

The major take-away from this table is that none of the models perform significantly better than the majority class baseline, at least as far as accuracy. This is not good, and is likely due to the features that we selected above. They by themselves are simply not informative enough. If I want to make the non-trivial classifiers perform better, the key is to devise more informative features until I get to the performance I want. In the end I would like at least 80 percent accuracy.

Unfortunately, my work for the near future stops here. It has been a fantastic learning experience for me, and I am excited to return to blogging about my progress on both this and my automatic text simplification work when my next (and final!) semester kicks off (January 2016).

Week 13, November 30-December 4 – Maury

Now that we have our baseline feature functions, which we have wanted for so long, Prof. Medero and I talked a little this week about where to go next. We have two possible avenues. How do we create feature functions that represent difficulty? And once we have them, how do we choose the weights for these feature functions?

Creating feature functions that represent difficulty is where the real innovation comes with my whole project. This is the problem that researchers are currently tackling, and the part that excites me the most about this project. Prof. Medero had invaluable advice to tackle this problem. She first asked me to do research on what people care about when looking for simplified text. For example, perhaps people value translations that do not have “rare” words, for some definition of “rare.” Or, perhaps people value translations where the sentences are short. I need to ask questions like these first, with some research. And after I answer this question, Prof. Medero then suggested that the next step is where “the math” comes in. She then proposes answering: how do I create feature functions (using math or machine learning!) that quantify these characteristics? Again, this is hard. But it’s the challenging part of the project; much of what I have done so far is preparing for this.

The next avenue that Prof. Medero and I talked about was: assuming we have the feature functions, how do we choose the weights for them? Prof. Medero, based on her thesis, thought it might be best to take a user-centered approach. That is, instead of deciding the weights that improve the simplification quality from my perspective, present the user of our simplification system with translation options that lets them indirectly choose the weights. For example, say that we have feature functions for vocabulary difficulty and sentence length. Then, to pick the weights, we ask the users to make a trade-off (with a slider) where one end means that the user prefers preserving text meaning, and the other end means they want the system make the input text as simple as possible.

I find both topic areas to be extremely interesting. But for the immediate future, I would prefer exploring ideas for feature functions. I feel like answering the first question before the second is a more natural progression of things.

In light of this, Prof. Medero has suggested to me a challenge that will push me in the direction of answering the first question. She told me about the Semantic Evaluation Exercises International Workshop on Semantic Evaluation (SemEval-2016). They are an annual workshop that poses a series of tasks relating to semantic evaluation of text. In particular, Prof. Medero mentioned to me Task 11: Complex Word Identification. The task in a nutshell asks us to identify whether at least 1 of several annotators labeled a word in a sentence as “complex” for non-native English speakers. This is a particularly fitting challenge for me, as it gives me a concrete end goal (i.e., classify as many “complex” words correctly as possible) that I could potentially use as a feature function or for inspiration.

I decided to take on the task, and much of my work this week relates to exploring that dataset. So I have roughly 2.2k examples, each of which has: the sentence the word is in, the word itself, the position of that word in a sentence, and whether that word is labeled as complex. Approximately ~30 percent of the dataset has training examples that are labeled as complex. And, the first inclination for me is to use machine learning. That is where I am at this week, and hopefully I have some concrete results to deliver in the near future.