This week continues my work on tuning the baseline feature functions for Moses.
My first action item was finishing the creation of baseline feature function(s). In a previous meeting, Prof. Medero suggested creating two feature functions: one feature function that penalizes for the number of characters of each word, and another feature function that penalizes for the length of the current hypothesis. I have implemented these feature functions since then, in this Github commit. Depending on whether the weight I give to these feature functions, the output changes (which is good!).
Recall from last week this sample input sentence.
the first and broadest sense of art is the one that has remained closest to the older latin meaning , which roughly translates to " skill " or " craft , " and also from an indo-european root meaning " arrangement " or " to arrange " .
Ideally, I want this simplified output:
the first and broadest sense of " art " means " arrangement " or " to arrange . "
And without the feature functions (i.e., just the translations/ language model), Moses gives me:
the first and broader sense of art is the one that has remained closer to the older latin meaning , which roughly translates|UNK|UNK|UNK " to skill|UNK|UNK|UNK or " craft , and also from an indo-european root meaning " arrangement " or " to arrange|UNK|UNK|UNK " . " " to roughly|UNK|UNK|UNK
where the differences between the input are in red, missing words are italicized, and extraneous words are bolded. Besides the change to broader/ closer, not much has changed. It’s still the same length, and no words have been simplified. Introducing the feature functions with an equal weight as the translation/ language model changes the output. I instead get:
the first and broader sense of art is the one that has remained closer to the older latin meaning , which roughly translates to skill|UNK|UNK|UNK or craft ; and also from an indo-european root " or " to arrange|UNK|UNK|UNK " . meaning " arrangement or to arrange " " " " roughly|UNK|UNK|UNK latin meaning which translates|UNK|UNK|UNK indo-european
Clearly the feature functions have an effect, here. If we place significantly more weight on the feature functions (e.g,. feature functions are weighted 100 times more than other features), we can get completely nonsensical output such as this:
, , . to or an or to to the of art is the one and the has and also from root that older latin which skill|UNK|UNK|UNK craft first widest sense " " " " " " " " arrange|UNK|UNK|UNK meaning roughly|UNK|UNK|UNK meaning closest remained translates|UNK|UNK|UNK indo-european arrangement
which I did not bother differentiating from the input because they’re simply too different. Notice the preference for short words, especially in the beginning.
One could easily see that there is an element of tweaking with the weights that needs to be done here. This is especially something that could matter when I have a more meaningful feature function; maybe I can use the weights as a way to tell Moses how simple I want a translation to be? This is something that I definitely want to take on in the future, after I create more meaningful feature functions. This is a major task of mine for the rest of the semester (which ends in mid-December).
The next task that I took on this past week was expanding the Moses translations phrase table. I may have mentioned previously that I am using the XS (eXtra-Small) PPDB datasets of: Lexical Paraphrases, Lexical Identities, Phrasal Paraphrases, and Phrasal Identities. I decided now that I am at a point where I want to expand the phrase table. So, I downloaded the equivalent PPDB datasets, but of size L (Large) instead. Even though preprocessing and Moses phrase table loading took much longer than it did with the XS size, I hope that I will get enough gains in performance to justify switching datasets. In the future I expect to look at some Moses phrase table loading optimizations, to speed it up as much as I can.
Lastly I was tasked with, but did not get to, investigating creating a new language model using SRILM. For that, I would need a corpus to build the model with, and I currently do not have one yet. So I will put that on the back-burner.