Similar to last week, this week has consisted of a lot of coding to get a primitive end-to-end solution of Moses working. And, I have reaped the fruits of my labor! I have a working (albeit poorly working) version of Moses that uses the PPDB paraphrase tables. But I had to jump through a few more hoops.
If you recall in last week’s post, I talked about how I was working on massaging the paraphrase table into a form that Moses understands. I wrapped that task up this week, and I packaged my work into a nice little GitHub repository so that those interested can use if if they ever need to. The only problem here is, it does not seem like the output paraphrase table is representative of all paraphrases in the input file. This might be a Python memory problem (not being able to hold it all in memory), but I am not entirely sure. I will come back to this in the following week.
After making the paraphrase table, I next had to find a language model to use with Moses. Right now, I am using the language model included in the sample-models package on Moses’s website. This model is based off of an 11MB Europarl corpus, so I am not expecting good results with this. Since the goal is to get an end-to-end solution, this will work for now.
I next had to find some input data to give Moses. My literature research led me to immediately think of the Zhu et al’s PWKP dataset of parallel complex-simple sentences. So, I downloaded the dataset, massaged it into the form that I needed (see this GitHub repository), took the first 500 parallel sentence pairs, and fed it into Moses. Its output was not pretty. After a quick glance, it seemed like all Moses did was mangle the output and cause grammatical errors. For example, the one “complex” English sentence is:
This month was originally named Sextilis in Latin, because it was the sixth month in the ancient Roman calendar, which started in March about 735 BC under Romulus.
And the corresponding output “simple” English sentence is:
This months has already been mentioned here as it was vi during the former Roman calendar, which began in March concerning 735 BC of Romulus. Latin, Sextilis
The reference “simple” English sentence is:
This month was first called Sextilis in Latin, because it was the sixth month in the old Roman calendar. The Roman calendar began in March about 735 BC after Romulus.
When I compared the Moses’s output with the reference output for the first 500 sentences, I got an embarrassingly low BLEU score of 0.0534. This result makes sense for several reasons. First, I have not modified Moses to prefer simplified translations. Second, I do not know or understand how Moses deals with words not in the vocabulary (but I will soon). Third, I do not know if the paraphrase table that Moses uses accounts for identity paraphrases. And finally, my language model is not specific to the text simplification task that I am up against. All in all, though I am very happy with the barebones configuration on Moses that I have. Now I can say that I have a bench line to compare my improvements against.
Next week will certainly be more theory based. Now that I have a Moses testbed, I can focus on finishing my literature review and investigating how Moses works so that I can see what modifications I can make to improve the performance.