This past week has been spent delving into research that relate to the automatic text simplification task. I took Prof. Julie’s advice, and started by looking at papers accepted into the Association for Computational Linguistics in 2013, 2014, and 2015. Across those three years, there were unfortunately not many papers relating to the simplification task. But it did provide me some good background into the field. Here were some papers that I found as a result of that research:
- A Deeper Exploration of the Standard PB-SMT Approach to Text Simplification and Its Evaluation by Stajer et al: This article emphasizes the importance of the quality of a corpus, as opposed to the number of sentences a corpus have, when trying to improve Statistical Machine Translation (SMT) system performance. By a corpus, I mean a collection of sentence pairs, where each pair consists of a “normal English” sentence and a “simplified English” equivalent.
- Simplifying Lexical Simplification: Do We Need Simplified Corpora? by Glavas et al: This article challenges the notion that we even need a corpus to perform simplification. The authors here use an unsupervised learning approach and present results that indicate their systems are just as effective as other systems that use corpora.
- Hybrid Simplification Using Deep Semantics and Machine Translation by Narayan et al: This article presents a hybrid approach to automatic text simplification. The authors combine deep semantics and SMT. Deep semantics are used as heuristics to drop phrases and split complex sentences into several simpler ones. SMT is used to handle word reordering and word substitution.
- Improving Text Simplification Language Modeling Using Unsimplified Text Data by Kauchak et al: This article questions the common assumption that SMT systems used for the simplification task must train only on simplified English for its language model. Kauchak argues for the effectiveness of having a “diverse” language model trained on both “simplified English” and “normal English.”
All of this was great insight into the current work on text simplification. But I realized that I did not really get to research what I wanted to contribute. I had not researched if or how we could incorporate measures of text difficulty in SMT, rather than simply being told what is normal/ simplified English through a corpus. After realizing this, I shifted my focus to finding measures of text difficulty. I unearthed several papers that mentioned or created ideas of quantifying text difficulty:
- What Can Readability Measures Really Tell Us About Text Difficulty? by Stajner et al: This article explores the viability of the Flesch Reading Ease metric, the Flesch-Kincaid readability formula, the Fog Index, and the SMOG formula in predicting the complexity of a text.
This literature research has prompted some important questions about my research. Again, if you remember from my first blog post, I want to be able to build a version of a phrase-based SMT system. “Normal English” will be the foreign language in this system, and “simplified English” will be the target language. But reading more has revealed to me that I have a lot of thinking to do:
- What is my language model of simplified English going to be based on? Will I used the oft-used corpus of Simple English Wikipedia introduced by (Zhu et. al 2010)? Could I incorporate David Kauchak’s findings and create a corpus of simplified English with interspersed normal English?
- What are going to be the success metrics that I will use to evaluate my SMT system? What previous work, if any, should I compare my system to? And will I use metrics that have a human element incorporated with them?
Next week, I’m going to work with Prof. Julie to answer these questions. I am also going to be continue my research, and look at other ways to quantify text difficulty. And finally I am going to keep trying to initialize Box, a machine translation research platform available on AWS. I got a message from Box’s creator, and they told me it was something that Amazon needed to fix.