Today, I completed making my custom linear interpolation script work with higher order n-grams. I still need to figure out how to interpolate the back-off weights, but I have placed that on the back burner for now.
Now, my attention has been focused on researching a phrase that Prof. Medero pointed me to, called “domain adaptation.” And this is basically investigating how to train a language model on one distribution of data (for us, it’d be the general corpus and by-learners corpus), and then have it perform well when we test it on another distribution of data (the for-learners corpus). The research I have done so far seems promising. I have ended the day with two key takeaways:
- I should set another baseline for my system, and that is training a language model that simply concatenates the general corpus and by-learners corpus together. This was inspired by the paper “Experiments in Domain Adaptation for Statistical Machine Translation” by Koehn and Schroeder.
- I should look into modifying the corpus that I am training on! One common domain adaptation method involves “selecting, joining, or weighting the datasets upon which the models (and by extension, systems) are trained” (from this paper by Axelrod et al.). So one idea I came up with is training a model with the by-learners corpus, and using it to compute sentence-level perplexities for all of the sentences in the general corpus. I can then filter out all of the sentences that scored above a certain perplexity. And finally, I can train a new model based off of the sentences that remain, and use that to test performance. I am still working on implementing this, but I will hopefully have results by tomorrow.