This week was definitely a coding and exploratory week for me. After I finished the literature review last week, I set my sights back on Moses. I first focused on exploring the parts of Moses that I would need to change in order to incorporate a measure of text difficulty. This involved several hours of digging through their codebase and following dependencies until I got where I need to be. By then, I had enough information from scavengering to generate the below diagram. Note that, in the diagram, the ScoreComponentCollection includes a series of classes that inherit from FeatureFunction. My objective for the next two weeks is to create a class that inherits from FeatureFunction to quantify text difficulty. Then, I can add it to the ScoreComponentCollection so that Moses can consider the measure when comparing translation hypotheses.
After discovering this, I turned my attention back to generating the phrase table from the PPDB database. In my post a few weeks ago, I was using the PPDB database of size “Small” with “All” types of (lexical, one-to-many, phrasal, syntactical) paraphrases. It turns out that this database does not have identity paraphrases of any kind. This essentially means that Moses, with the phrase table it gets, will try to change every word of a sentence, even the already-simplified ones! For example, Moses might “simplify” the sentence “This is great” to “This is laudable.” We don’t want this behavior. So what I decided to do is combine the PPDB databases with identity paraphrases (which are the databases of lexical and phrasal paraphrases). This is not ideal because this will make sentence splits and deletions almost entirely unlikely. But I thought this would be a good stopgap to get a baseline working. Eventually, though, I will have to revisit this problem, because I want to include one-to-many and syntactical paraphrases as well! Perhaps I will have to generate probabilities of my own for paraphrases that do not have identity paraphrases in PPDB already.
I do not have a working Moses with the new identity paraphrases, but I will update this post with some sample translations when I do have that working. In the meantime, next week I will work on investigating more into the “FeatureFunction” class and see if I can get Moses to at least recognize a baseline text difficulty measure (such as a measure based on the length of a word).
[UPDATE 11/10/2015] I finally got Moses working with the new identity paraphrases! I gave it this sentence as input (which you have seen before here):
this month was originally named sextilis in latin , because it was the sixth month in the ancient roman calendar , which started in march about 735 bc under romulus .
And I got this output “simplification”:
this month was initially named , because it was the sixth ancient roman timetable , which began in march about 735 bc under romulus . in the months in latin sextilis|UNK|UNK|UNK
It looks more similar to the output! Yay!