TextScroll: Week 8, July 28-29 – Michelle

On Thursday I looked at last year’s papers/data analysis programs to get ideas and see if I could transfer anything over. I outlined some ideas for functions that could display some meaningful stats, then headed over to Prof. Medero to run my ideas by her and what she thought would be the best way to implement them. I asked about what combination of stats to generate and display, how to present these stats, etc., and she said that I should create a wide variety of stat generators/displayers that a user can mix and match to find patterns in the data. For implementation, she recommended matplotlib for generating graphs in Python and LaTex for generating tables, so next steps are to learn those. The rest of the day was spent learning how to import .jsons into dictionaries and trying to install matplotlib on either my Windows laptop or my Mac work computer. The Windows installation involved installing a long list of other libraries first, while the “pip install” on Mac was getting a permission error even though I was running it with “sudo”. In the end, I installed Enthought’s Canopy on both machines, which already has matplotlib, numpy, and everything I’ll need installed together with it.

Today involved brainstorming, learning matplotlib, and writing a handful of functions. I thought a lot about how to develop incrementally so that my functions could be readily mixed and matched by a future researcher. It was a bit tricky to extract the specific values I needed, since they were often embedded within lists of dictionaries, dictionaries with lists, a y value in a tuple in a list in a dictionary in a list, etc., but hopefully this is will be a lot simpler to work with than last year’s suple dictionary. The functions i have so far can take a bunch of plots and graph them either on the same or different graphs, letting you customize color, x increment, and graph size in optional variables. I also wrote a simple .txt converter that can be brought into LaTex/Excel easily. Next week I’ll need to figure out how to automatically fetch .jsons from Firebase and fine tune TextScroll’s raw data collection so it works better with the graphing functions I’m planning to make.

July 29 (Week 6) – Maury

Today has brought me to searching for ways to combine different language models It’s been mostly a reading day, as I’ve been delving into research about different ways to do this, such as linear interpolation and log-linear interpolation. I haven’t gotten any results yet, but I have realized that SRILM supports linear interpolation as a command-line flag (“-mix-lm”).  So, first thing Monday morning, this is what I’m going to try.

July 28 (Week 6) – Maury

So, yesterday’s discussion with Prof. Medero was quite illuminating. As I mentioned in yesterday’s blog post, my ultimate goal is to create a table of perplexities using a Bash script. And so far, this is what I have for the table.

Corpora to Train Language Model On Perplexity on Test Set of Corpus for ESOLs (2k)
General English (12m words) 22.5623
English written by ESOL (532k words) 23.2972
General English + English Written by ESOL ?
English written for ESOL (230k) 20.7549

I’m a little confused as to why the perplexity for general English is lower than the perplexity of English written by ESOL. That doesn’t make intuitive sense to me, because I would expect the latter corpus to be simpler, therefore less “surprised” by the test set of corpus of text for ESOLs. Still working out the reasoning with Prof. Medero, but I think that it might have to do with the sizes of the corpora that we’re using. The test set is pretty small, as are the ESOL corpora in comparison to the general English corpora. I’m starting to look into finding larger ESOL corpora to see if that might have an effect on the perplexity, but for tomorrow, I am going to be just focusing on filling in the penultimate row in that table.

TextScroll: Week 8, July 27 – Michelle

Messed around with Firebase remote config, and it looks like the app takes a couple restarts before the new configurations register on the app (start->home->start->updated). This still happens even when I put the “fetch data” commands on the start button, so it is fetching each time I tap the start button. I also started writing up some notes/resources for future researchers who’ll be working on TextScroll.

Prof. Medero gave me some pointers and goals for the data analysis program I’ll be writing to process the raw data from Firebase, so tomorrow I’ll take a closer look at what last year’s researchers did and translate that into my version of the project.

July 27 (Week 6) – Maury

I spoke to Prof. Medero today, and she cleared up a lot of the questions that I have! She explained it by first supposing that we have three corpora:

  1. Corpora of general English (this, right now, would be the Wikipedia articles in Kauchak’s corpus)
  2. Corpora of English written by second-language English learners (hereafter referred to as ESOL) (this would be the CIC-FCE corpus I have)
  3. Corpora of English written for ESOLs (the English online corpus that I compiled)

The goal this summer is to use 1. and 2. to model what 3. would look like. This, of course, may raise several follow-up questions.

Why would we want to model 3.? Well, modeling text written for ESOLs is a necessary first step if we ultimately want to generate simplifications for ESOLs. We want to know what is considered difficult, so that we know what to simplify.  I thought that I would accomplish both of these steps throughout the summer. But I simply do not have the time to do that.

Well, why use other corpora to model 3.? Can’t we just use the corpora we’re trying to model? As it turns out, we can do this for ESOLs. There is plenty of corpora out there written forESOLs. But suppose that we wanted to extend our approach to other audience besides ESOLs (e.g., dyslexics, aphasiacs). It turns out that finding corpora written for more specific audiences is quite difficult. Therefore, in the hope of coming up with generalizable results, we want to use corpora that we can easily find (general English and English written by specific reading audiences) to model corpora that we cannot easily find.

How are we going to combine two different corpora to model 3.? Well, this is an open question, so far. Prof. Medero suggested some sort of linear interpolation of language models, or some kind of log-likelihood ratio combination. It’s up to me to investigate these approaches in the coming weeks, and see what works best?

How do we measure success in modeling 3.? This is also somewhat of an open question. Prof. Medero and I posited using the measure of perplexity. The ideal goal is to be able to train two models: one based off of 3., and one that combines 1. and 2. in some way. If we generate a test set of corpora written for ESOLs, and get the same perplexities using each model, then we will have achieved our goal. Obviously, there are some nuances to this approach, but this is the general idea.

Armed with all of this information, I’m setting out to create a table that resembles the following:

Corpora to Train Language Model On Perplexity on Test Set of Corpus for ESOLs
1. + 2.

The goal is to have the penultimate and last rows of the tables give similar enough perplexities. If that’s the case, then we know we will have achieved our goal. To that end, I have begun creating a Bash script that generates this table, so that we can generate this table quickly and easily.

TextScroll: Week 8, July 26 – Michelle

Still no luck with GPUImage. In the meantime I played around with parsing and added some variables to Firebase’s remote config. The problem where the value fetched from the database is always 0 is happening again, but will probably work properly tomorrow. However, if Firebase doesn’t reliably update the app, I might have to take this feature out. I’ll mess around tomorrow to see when remote config works/doesn’t work.

July 25-26 (Week 6) – Maury

These past two days, I have spent time thinking about the direction of the project, and I have some questions that I really want answered.

  • If we have corpora of text written by English learners, why don’t we just use that to model text written by English learners? Why should we get that model by interpolating models of text written for normal English and text written for English learners?
  • I know that we can measure success using interpolation. But when will the difference in perplexity be significant?
  • Are we realistically going to get to the next part of this project, which is actually generating simplifications?

While thinking about those issues, I found an article that delves into an improvement of part-of-speech language modeling. I skimmed through it, and I realized that what would probably make sense is instead of the “sentences” passed into the language models being pre-order traversals of syntax trees, the “sentences” should just be the actual words of the sentence but just replaced with their parts-of-speech. I tweaked a Python script to do that, and the results that I report from now on will use that approach until I figure out an even more meaningful way to use the parse trees.