This week was spent building off of Prof. Medero’s changes to the systems we are submitting for the Complex Word Identification task (Task 11 of SemEval 2016). Most of my time this week was spent on engineering the following features for the task:
- Unigram count of how often that word appears in certain corpus, according to a 100,000 word list of the most frequent words in the English language. This is the paid version of the Word frequency database that we used previously.
- Lemma count according to the same 100,000 word list.
- A weighted combination of how likely we were to see that word according to its number of character and its unigram count. This feature was created by first creating two normal distributions: over all character counts that we see in the training set, and over all unigram counts that we see. We then use these distributions to compute how likely we are to see this word with its particular number of characters, along with its unigram count.I added this feature with the hope that I strike a fair balance between word length and unigram count. When I was looking at the examples we were getting wrong during cross-validation, I saw that many of the words were small words that were not well-represented in the 100,000 word list that I was using. Since small words were less likely to be marked as complex anyway, I wanted to “give these words a chance” to be marked as simple even though their unigram counts might be low. So I combined the two probabilities.
- Average age-of-acquisition. Prof. Medero and I discussed this idea briefly. We postulated that perhaps we can find a positive correlation with perceived word difficulty and its average age-of-acquisition. So I added this feature. I ended using this list of over 30,000 English words, with their predicted age-of-acquisition from this site.
- Average “concreteness,” on a scale of 1-5. This feature was really based on just a spark of insight I had. I was thumbing through the training examples that we got incorrect, and I noticed that many of the words carried abstract in addition to concrete meanings. I hypothesized that words are more likely to be marked as complex if they are more rated concrete. So I used a compiled list of ~40,000 English words, with their predicted concreteness rating from this website.
Here are some of the metrics that I computed using cross-validation. I used all of the features above, including word lemma length and the lemma count of how often that lemma appears in Word frequency data’s free database.
|Decision Tree (max depth 3, threshold 0.25)||0.8462||0.1999||0.7518|
|Decision Tree (max depth 4, threshold 0.25)||0.8614||0.2109||0.6523|
|Decision Tree (max depth 5, threshold 0.25)||0.8632||0.2158||0.7067|
|Regression Tree (max depth 3, threshold 0.05)||0.8122||0.1739||0.7633|
|Regression Tree (max depth 4, threshold 0.05)||0.8007||0.1728||0.7680|
|Regression Tree (max depth 3, threshold 0.05)||0.8351||0.1949||0.6975|
Compared to last week’s results, we bought ourselves a bit more accuracy and precision. Awesome! The two bolded classifiers were the ones that submitted to the Task. So, we’ll see how it performs next week!