Unlike previous weeks, this past week has been very coding-oriented! Instead of doing research, this week I worked on putting together a proof-of-concept phrase-based SMT system that performs text simplification. Unfortunately, I still did not have access to the Box instance as of 10/12, so I ended up having to create a new AWS instance and compile Moses from source. Getting Moses to compile and work with sample data was a bit of a pain. It took a long time to compile, and I kept having to resolve dependencies. But I was able to successfully run it! Because I will make changes to Moses to accommodate some new functionality, I decided to compile/ run a forked version of Moses that exists on my Github. But thankfully, I got it done.
After installing Moses came the task of actually supplying the data (i.e., phrase translation probabilities) that Moses needs to make text simplifications. Now, here’s where I’m going to be different. Usually, to generate these translation probabilities, people use a sentence-aligned simplification corpus such as Zhu et. al’s Parallel Wikipedia Simplification corpus. But the problem is that those corpuses tend to be noisy. Furthermore, the level of simplification varies widely from sentence pair to sentence pair. Therefore, Prof. Julie and I propose an alternative. Instead of relying on a simplification corpus, we use an English paraphrasing database instead (PPDB), and, when performing the decoding step of SMT, compute how much less difficult each paraphrasing really is. We get two big wins this way. First, we have a much, much larger dataset and vocabulary available to us (sentence simplification corpuses are relatively smaller). Also, our measure of difficulty is standardized across all possible phrase translations.
Now I have to consider how to incorporate the PPDB into Moses. Because PPDB includes English paraphrases along with their probabilities, that simplifies my task; I do not have to worry about Moses’s translation model being trained on a corpus! All I have to do is wrestle PPDB’s data into a phrase translation table that Moses likes, and then give it to Moses. This is the bit that I still am having trouble with. I need to perform some pre-processing on the PPDB’s dataset, including normalizing the probabilities and interpreting all of the different features that a phrase translation has (this resource has been really helpful). This is the part that I’m currently working through. But once I have that, I can start creating a baseline system that I can compare against. For the following week, this will be happening in parallel with more literature review and perusal of Moses’s documentation.