So, from yesterday’s post, it’s clear that I want to create a text simplification framework that tailors to specific reading audiences. But, yesterday evening, I got to thinking: what reading audiences do I want to consider? How do I quantify them? And once I have that, does it make sense to only perform lexical simplifications to each reading audience?
Today, I take a stab at answering these two questions.
As for the first question, I could take a variety of approaches. For one, I could do in-depth research into several reading audiences (e.g., learners of English as second language, people with intellectual disabilities, or people with cognitive impairment caused by Alzheimer’s disease), and then create and tune a simplification system for each audience. I could have rough “tiers” of reading levels (e.g., beginner, intermediate, and advanced–similar to Specia et. al’s article on “Readability Assessment for Text Simplification”). I could also do it by grade level (similar to Petersen and Ostendorf’s paper “A machine learning approach to reading level assessment”).
I have not had much time to think about the second question. But I was thinking that for every single reading audience that I mentioned previously, doesn’t any significant text simplification relies on more than just lexical simplification? For example, for people who read at a 2nd grade level, it may make more sense to simplify the sentence “[S] [V], and [S] [V]” by performing a syntactical simplification and transforming it into “[S] [V]. [S] [V].” And even if there were cases that lexical simplification would improve text readability, are the simplification corpuses that we have distinct enough for each audience?
For example, in PPDB (the paraphrasing database I’ve been looking at over the past year), there are several rewordings for the phrase incorporate in. They are:
- include them in
- be present in
- introduce in
- be part of
- integrate into
- be contained in
- be integrated with
Is there a significant enough correlation between the rewordings, and a specific reading audience? Could, for example, “be part of” be a better rewording for a reader at a 5th grade level than “introduce in” because it’s shorter? And if so, does the better rewording significantly improve the understanding of the text? Right now, my theory is that lexical simplifications alone do not allow for the text to vary significantly from audience to audience. Maybe a combination of lexical and syntactic simplifications is the way to go? I hope to have a pow-wow with Professor Medero tomorrow, to see what she thinks about these questions.