Week 1, January 19-22 – Maury

Hey again, y’all!

I am back for another semester of research with Prof. Medero. And we kicked it off by continuing last December’s work on the Complex Word Identification task (Task 11 of SemEval 2016). As a reminder, the task is to identify whether an annotator labeled a word in a sentence as “complex” for non-native English Speakers.

I spent this last week understanding the changes that Prof. Medero made to the codebase we had. Firstly, she proposed a bucketload of new features:

  • Number of syllables in a word, computed by getting the first pronunciation of the word and counting all of the phonemes that contain a vowel.
  • Number of different pronunciations in a word, according to the number of pronunciations given by nltk‘s version of the CMU pronunciation dictionary.
  • Length of the stemmed word, computed using nltk’s PorterStemmer.
  • *Length of the lemmatized word, computed using nltk’s WordNetLemmatizer.
  • *Number of times that the lemmatized word appears in a version of the Word frequency data‘s free database where all words are lemmatized. This feature was repeated for the words that precede and proceed the word in the sentence.
  • Number of times that the stemmed word appears in a version of the Word frequency data‘s free database where all words are stemmed. This feature was repeated for the words that precede and proceed the word in the sentence.
  • Number of synonyms according to WordNet.

The starred(*) features were the ones that she ended up using.

Prof. Medero also made a big change based on one extremely important observation about the competition, one that I did not think about until now. It is this detail, from the Task website:

The training set is composed by the judgments of 20 distinct annotators over a set of 200 sentences, while the test set is composed by the judgments made over 9,000 sentences by only one annotator.

We are training on a system with one distribution of labels (i.e., the binary labels of whether at least 1 in 20 annotators found a word complex), but we are testing our system on another distribution of labels (i.e., the binary labels of if exactly 1 annotator found a word complex).

As a side-point: we thought that this small detail puts all systems in this competition at a pretty big disadvantage. Part of what makes machine learning algorithms correct is the expectation of identical label distributions when you train and test your data. But if this doesn’t work, our problem becomes more difficult…

Anyway, this detail is notable because up until this point, the metrics that we were getting from cross-validation were not indicative of our system’s true performance. Our cross validation metrics were not computed with the label distribution of the testing set. They were instead computed with the label distribution of the training set. This prompted Prof. Medero to suggest one important change to the approach that we take to our model development. She suggested that we train on the labels that correspond to whether at least one in 19 annotators found a word complex, but perform cross-validation on the labels that correspond to whether the 20th annotator found the word complex. That way, the cross-validation metrics that we obtain are more representative of how our system will actually perform.

Prof. Medero then suggested one last change. Instead of using binary labels during model training, she suggested using continuous labels. Now, for a given word, its label would be the percentage of all annotators who marked the word as complex. Firstly, this allows models to see more clearly the relationship between our features and word complexity.  Secondly, this allows us to play with a threshold of the minimum percentage we will use to label a word as complex.

This table represents the cross-validation metrics before Prof. Medero’s suggested changes. It is worth noting that these cross-validation metrics were computed using Prof. Medero’s change about the label distribution that I mentioned above (this is why they’re different from the ones here).

Classifier Accuracy Precision Recall
Majority Class Baseline 0.9522 N/A 0.0000
Decision Tree (depth 5) 0.7917 0.1590 0.7974
Perceptron 0.9133 0.2202 0.2930
Support Vector Machine (C=1) 0.4140 0.0648 0.5584

Prof. Medero’s changes yielded these results. Note that all classifiers were implementations done by scikit-learn and, during training, assign an importance to each training example that is proportional to how much the example’s label is represented in the training set.

Classifier Accuracy Precision Recall
Regression Tree (max depth 4, threshold 0.05 percent) 0.8194 0.1773 0.7736
Decision Tree (max depth 4, threshold 0.10) 0.7868 0.1589 0.7960
Perceptron (threshold 0.05) 0.5081 0.1117 0.5723
Support Vector Machine (threshold 0.02, C=1) 0.5967 0.1040 0.5826

As you can see, the Perceptron and Support Vector Machine are lagging in performance, as far as precision, in comparison with before the changes. Recall, though, improves. But these classifiers are still outperformed by the Regression Tree and Decision Tree. So, for the competition submission, we will likely use these two models for testing.

Next week I will work on adding some new features and submitting our entry to the SemEval Task creators.