Week 7, October 19-October 23 – Maury

As I mentioned in last week’s post, this week more of a reading and writing week for me. In particular, the main task this week was to assemble a literature review on what’s been done with text simplification, and how I am improving based off of existing approaches. I mostly focused on works achieved simplification with phrase-based machine translation techniques (including Zhu et al. (2010), Coster and Kauchak (2011), Wubben et al. (2012),  and Specia (2010)). I discussed how many of these techniques relied on using Simple Wikipedia as a basis for the simplification corpus, and I pointed out how existing works (like Xu. et al (2015)) critique this reliance on Wikipedia and have a call-to-action for using other datasets for simplification. This is where my work comes in. Like I mentioned in the first post, I am going to avoid using simplification corpuses for phrase-based translation. Instead I plan to use an extensive English paraphrase database. And I will use quantitative difficulty measures to prioritize paraphrases that perform simplifications.

After finishing the review, I did a little to investigate what parts of Moses I will need to modify to add the quantitative difficulty measures. I reckon that, at a preliminary glance, the majority of my changes will happen in the translation model folder. I plan to make this idea more concrete over the next week. Once I finish with that, I plan on modifying the paraphrase database that I created in my previous post to ensure that I am getting the best possible paraphrases.

Works Cited

  • Delphine Bernhard Zhu, Zhemin and Iryna Gurevych. 2010. A monolingual tree-based translation model for sentence simplification. In Proceedings of the 23rd international conference on computational linguistics, pages 1353–1361. Association for Computational Linguistics.
  • William Coster and David Kauchak. 2011. Learning to simplify sentences using wikipedia. In Proceedings of the Workshop on Monolingual Text-To-Text Generation, MTTG ’11, pages 1–9.
  • Antal Van Den Bosch Wubben, Sander and Emiel Krah- mer. 2012. Sentence simplification by monolingual machine translation. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, pages 1015–1024. Association for Computational Linguistics.
  • Lucia Specia. 2010. Translating from complex to sim- plified sentences. In Computational Processing of the Portuguese Language, pages 30–39. Springer.
  • Chris Callison-Burch Xu, Wei and Courtney Napoles. 2015. Problems in current text simplification research: New data can help. Transactions of the Association for Computational Linguistics, 3:283–297.

Week 6, October 12-October 16 – Maury

Similar to last week, this week has consisted of a lot of coding to get a primitive end-to-end solution of Moses working. And, I have reaped the fruits of my labor! I have a working (albeit poorly working) version of Moses that uses the PPDB paraphrase tables. But I had to jump through a few more hoops.

If you recall in last week’s post, I talked about how I was working on massaging the paraphrase table into a form that Moses understands. I wrapped that task up this week, and I packaged my work into a nice little GitHub repository so that those interested can use if if they ever need to. The only problem here is, it does not seem like the output paraphrase table is representative of all paraphrases in the input file. This might be a Python memory problem (not being able to hold it all in memory), but I am not entirely sure. I will come back to this in the following week.

After making the paraphrase table, I next had to find a language model to use with Moses. Right now, I am using the language model included in the sample-models package on Moses’s website. This model is based off of an 11MB Europarl corpus, so I am not expecting good results with this. Since the goal is to get an end-to-end solution, this will work for now.

I next had to find some input data to give Moses. My literature research led me to immediately think of the Zhu et al’s PWKP dataset of parallel complex-simple sentences. So, I downloaded the dataset, massaged it into the form that I needed (see this GitHub repository), took the first 500 parallel sentence pairs, and fed it into Moses. Its output was not pretty. After a quick glance, it seemed like all Moses did was mangle the output and cause grammatical errors. For example, the one “complex” English sentence is:

This month was originally named Sextilis in Latin, because it was the sixth month in the ancient Roman calendar, which started in March about 735 BC under Romulus.

And the corresponding output “simple” English sentence is:

This months has already been mentioned here as it was vi during the former Roman calendar, which began in March concerning 735 BC of Romulus. Latin, Sextilis

The reference “simple” English sentence is:

This month was first called Sextilis in Latin, because it was the sixth month in the old Roman calendar. The Roman calendar began in March about 735 BC after Romulus.

When I compared the Moses’s output with the reference output for the first 500 sentences, I got an embarrassingly low BLEU score of 0.0534. This result makes sense for several reasons. First, I have not modified Moses to prefer simplified translations. Second, I do not know or understand how Moses deals with words not in the vocabulary (but I will soon). Third, I do not know if the paraphrase table that Moses uses accounts for identity paraphrases. And finally, my language model is not specific to the text simplification task that I am up against. All in all, though I am very happy with the barebones configuration on Moses that I have. Now I can say that I have a bench line to compare my improvements against.

Next week will certainly be more theory based. Now that I have a Moses testbed, I can focus on finishing my literature review and investigating how Moses works so that I can see what modifications I can make to improve the performance.

Week 5, October 5-October 9 – Maury

Unlike previous weeks, this past week has been very coding-oriented! Instead of doing research, this week I worked on putting together a proof-of-concept phrase-based SMT system that performs text simplification. Unfortunately, I still did not have access to the Box instance as of 10/12, so I ended up having to create a new AWS instance and compile Moses from source. Getting Moses to compile and work with sample data was a bit of a pain. It took a long time to compile, and I kept having to resolve dependencies. But I was able to successfully run it! Because I will make changes to Moses to accommodate some new functionality, I decided to compile/ run a forked version of Moses that exists on my Github. But thankfully, I got it done.

After installing Moses came the task of actually supplying the data (i.e., phrase translation probabilities) that Moses needs to make text simplifications. Now, here’s where I’m going to be different. Usually, to generate these translation probabilities, people use a sentence-aligned simplification corpus such as Zhu et. al’s Parallel Wikipedia Simplification corpus. But the problem is that those corpuses tend to be noisy. Furthermore, the level of simplification  varies widely from sentence pair to sentence pair. Therefore, Prof. Julie and I propose an alternative. Instead of relying on a simplification corpus, we use an English paraphrasing database instead (PPDB), and, when performing the decoding step of SMT, compute how much less difficult each paraphrasing really is. We get two big wins this way. First, we have a much, much larger dataset and vocabulary available to us (sentence simplification corpuses are relatively smaller). Also, our measure of difficulty is standardized across all possible phrase translations.

Now I have to consider how to incorporate the PPDB into Moses. Because PPDB includes English paraphrases along with their probabilities, that simplifies my task; I do not have to worry about Moses’s translation model being trained on a corpus! All I have to do is wrestle PPDB’s data into a phrase translation table that Moses likes, and then give it to Moses. This is the bit that I still am having trouble with. I need to perform some pre-processing on the PPDB’s dataset, including normalizing the probabilities and interpreting all of the different features that a phrase translation has (this resource has been really helpful). This is the part that I’m currently working through. But once I have that, I can start creating a baseline system that I can compare against. For the following week, this will be happening in parallel with more literature review and perusal of Moses’s documentation.

Week 4, September 28-October 2 – Maury

This past week has been spent delving into research that relate to the automatic text simplification task. I took Prof. Julie’s advice, and started by looking at papers accepted into the Association for Computational Linguistics in 2013, 2014, and 2015. Across those three years, there were unfortunately not many papers relating to the simplification task. But it did provide me some good background into the field. Here were some papers that I found as a result of that research:

  • A Deeper Exploration of the Standard PB-SMT Approach to Text Simplification and Its Evaluation by Stajer et al: This article emphasizes the importance of the quality of a corpus, as opposed to the number of sentences a corpus have, when trying to improve Statistical Machine Translation (SMT) system performance. By a corpus, I mean a collection of sentence pairs, where each pair consists of a “normal English” sentence and a “simplified English” equivalent.
  • Simplifying Lexical Simplification: Do We Need Simplified Corpora? by Glavas et al: This article challenges the notion that we even need a corpus to perform simplification. The authors here use an unsupervised learning approach and present results that indicate their systems are just as effective as other systems that use corpora.
  • Hybrid Simplification Using Deep Semantics and Machine Translation by Narayan et al: This article presents a hybrid approach to automatic text simplification. The authors combine deep semantics and SMT. Deep semantics are used as heuristics to drop phrases and split complex sentences into several simpler ones. SMT is used to handle word reordering and word substitution.
  • Improving Text Simplification Language Modeling Using Unsimplified Text Data by Kauchak et al: This article questions the common assumption that SMT systems used for the simplification task must train only on simplified English for its language model. Kauchak argues for the effectiveness of having a “diverse” language model trained on both “simplified English” and “normal English.”

All of this was great insight into the current work on text simplification. But I realized that I did not really get to research what I wanted to contribute. I had not researched if or how we could incorporate measures of text difficulty in SMT, rather than simply being told what is normal/ simplified English through a corpus. After realizing this, I shifted my focus to finding measures of text difficulty. I unearthed several papers that mentioned or created ideas of quantifying text difficulty:

This literature research has prompted some important questions about my research. Again, if you remember from my first blog post, I want to be able to build a version of a phrase-based SMT system. “Normal English” will be the foreign language in this system, and “simplified English” will be the target language. But reading more has revealed to me that I have a lot of thinking to do:

  • What is my language model of simplified English going to be based on? Will I used the oft-used corpus of Simple English Wikipedia introduced by (Zhu et. al 2010)? Could I incorporate David Kauchak’s findings and create a corpus of simplified English with interspersed normal English?
  • What are going to be the success metrics that I will use to evaluate my SMT system? What previous work, if any, should I compare my system to? And will I use metrics that have a human element incorporated with them?

Next week, I’m going to work with Prof. Julie to answer these questions. I am also going to be continue my research, and look at other ways to quantify text difficulty. And finally I am going to keep trying to initialize Box, a machine translation research platform available on AWS. I got a message from Box’s creator, and they told me it was something that Amazon needed to fix.