If you recall from last week’s post, I am now taking a break from Moses and instead concentrating my efforts on creating a machine learning model that classifies words as complex (Task 11 of SemEval 2016). After doing some preliminary research, I threw together a machine machine learning pipeline–from data set preprocessing to metrics calculation–just to get something working.
Recall from last week some information about the dataset
So I have roughly 2.2k examples, each of which has: the sentence the word is in, the word itself, the position of that word in a sentence, and whether that word is labeled as complex.
I used the following as features for each training example:
- Word length — I included this with the guess that longer words tended to be labeled more frequently as complex.
- Word part-of-speech — I included this hoping to capture a possible hidden relationship between part-of-speech (and possibly other features) and its classification as “complex.”
- Unigram word count — This is how many times a word appears in a corpus. The idea here is that words labeled as complex appear less frequently in the corpus. I used the counts provided to me by the Word frequency data‘s free database of the “top 5,000 words/lemmas from the 450 million word Corpus of Contemporary American English,” which is “the only large and balanced corpus of American English.” I expect this to be the most indicative feature.
- Unigram word count of the preceding and following word — Again, perhaps a hidden relationship here?
- Bigram word count, one where the word in question is the first word in the bigram, and one where the word in question is the second word in the bigram — Again, perhaps a hidden relationship.
And I used the following machine learning models).
- Majority class classifier, where all training examples are predicted to be labeled as “not complex.”
- Decision tree, tuned for the best depth size using K-Folds cross-validation (with 10 folds). I used scikit-learn‘s implementation here.
- Perceptron, with no special parameters. I used scikit-learn‘s implementation here.
- Support Vector Machine (SVM) with
C = 5using scikit-learn‘s implementation.
The results with these features/ models are not great. Here is a summary of their performance metrics, which are computed and averaged using K-Folds cross-validation where k = 10.
|Majority Class Baseline||0.6844||N/A||0|
|Decision Tree (depth 5)||0.7349||0.6283||0.4092|
|Support Vector Machine||0.5168||0.4019||0.5630|
The major take-away from this table is that none of the models perform significantly better than the majority class baseline, at least as far as accuracy. This is not good, and is likely due to the features that we selected above. They by themselves are simply not informative enough. If I want to make the non-trivial classifiers perform better, the key is to devise more informative features until I get to the performance I want. In the end I would like at least 80 percent accuracy.
Unfortunately, my work for the near future stops here. It has been a fantastic learning experience for me, and I am excited to return to blogging about my progress on both this and my automatic text simplification work when my next (and final!) semester kicks off (January 2016).