Today, I completed making my custom linear interpolation script work with higher order n-grams. I still need to figure out how to interpolate the back-off weights, but I have placed that on the back burner for now.
Now, my attention has been focused on researching a phrase that Prof. Medero pointed me to, called “domain adaptation.” And this is basically investigating how to train a language model on one distribution of data (for us, it’d be the general corpus and by-learners corpus), and then have it perform well when we test it on another distribution of data (the for-learners corpus). The research I have done so far seems promising. I have ended the day with two key takeaways:
- I should set another baseline for my system, and that is training a language model that simply concatenates the general corpus and by-learners corpus together. This was inspired by the paper “Experiments in Domain Adaptation for Statistical Machine Translation” by Koehn and Schroeder.
- I should look into modifying the corpus that I am training on! One common domain adaptation method involves “selecting, joining, or weighting the datasets upon which the models (and by extension, systems) are trained” (from this paper by Axelrod et al.). So one idea I came up with is training a model with the by-learners corpus, and using it to compute sentence-level perplexities for all of the sentences in the general corpus. I can then filter out all of the sentences that scored above a certain perplexity. And finally, I can train a new model based off of the sentences that remain, and use that to test performance. I am still working on implementing this, but I will hopefully have results by tomorrow.
Today, I finished performing my own linear interpolation of the “general” corpus and “by-learners” corpus. And I know it works because its perplexity across the test set is exactly the same as the perplexity given by the SRILM-interpolated model across the same corpus. The only catch is that the linear interpolation script does not work with higher order n-grams. So, today I am going to spend time making that change. With that, I also need to research how the back-off weights are interpolated for higher order n-grams.
Today, I completed work on modifying the Bash script. It can now take in a configuration file that sets the parameters of the language models I want to create. It then creates all of the language models, computes their perplexity across a common test set, and prints out a table that compare the perplexities of each language model. Whew! Now that I have that done, I can now turn my attention to somehow combining the “general” corpus and the “by-learners” corpus (whether it be interpolation or something else) to see if I can lower the perplexity even more. For now, as an exercise, I am going to start by recreating SRILM’s linear interpolation. And for that, I have begun looking at the Python module arpa that lets me parse the language models of the corpora. Then, I can hopefully linearly interpolate the two language models. I should have an update on that tomorrow.
Today, I’m starting to work on re-factoring the Bash script code, so that it can read in a configuration file and train/ test the language model based off of those parameters. This will be helpful to generate the table that I mentioned previously for different configurations of language models (e.g., for bigrams or trigrams). So, I’ve spent most of the day looking up the best Bash approach to passing in the configuration file. Hopefully, I’ll have this running by next week so that I can start experimenting with ways to interpolate the corpora that I have.
Today, I came to terms with the results from the significance tests that I performed yesterday. Prof. Medero and I tried the Sign significance test as well, and saw that the test also reported that the perplexities were statistically significant. So, they are statistically significant. But why, if the overall perplexities are so close in value? I feel like there could be a few explanations. The best one I came up with is that the overall perplexity statistic that we’re using (in the table) are not directly related to an average of the sentence-level perplexities given by each model. According to the SRILM website, they are computed by the expression
10^(-logprob / (words - OOVs + sentences))
where -logprob is a log-probability of seeing all of the tokens in the test set. So, it could be that this expression is understating the differences between the sentence-level perplexities, which is likely given that the test set–in number of sentences–is small.
Now that we have established significance testing, the goal is to explore ways to change the language model parameters and see how that affects the distributions of perplexities. Some suggestions that Prof. Medero had for me were changing the order of the language models we’re using (using bi- or tri-grams), as well as building unigram language models based off of the non-terminals as well as parts-of-speech in the syntax trees of the corporas. I’m going to need to make changes to the Bash script to make exploring those easier, so that’s going to be my focus for today.
Significance testing is done! So for the table that I generated in Monday’s blog post, I perform the Wilcoxon signed-rank test comparing a sentence-level perplexities associated with the baseline (the worst perplexity that I encounter) with sentence-level perplexities associated with every other model. The weird thing that I am encountering now is that the tests indicate that the model’s difference in performance is statistically significant. Prof. Medero and I find this particularly weird, because the perplexities are so close in numeric value (less than 1.000). Why would the models perform significantly different from each other if they have similar overall perplexity scores? So, I’m currently doing some digging into how these overall perplexity scores are computed, to see if I can gather some insight into why the significance testing is giving me the answers that it does.
Today, I spent the majority of the day looking at significance testing and writing a Python script that my Bash script can invoke to perform significance testing on the perplexity results. So the general idea that Prof. Medero has is that I have a test set of several sentences, and I compute the perplexity of every language model across that same test set. I first find a baseline perplexity, the highest perplexity that I encounter in my table. And I am going to compare that the sentence-level perplexities of that baseline with the sentence-level perplexities of all other language models. Per Andreas Stolcke’s suggestion, I am going to run the Wilcoxon signed-rank test in my Python script to see whether the perplexities significantly vary across pairs of sentence perplexities.
So far, the Python script is almost done. Al I have to do is to make sure that I format the SRILM output properly in the Bash script so that it can be passed to the Python script. I should have significance testing done by tomorrow.