Today, I came to terms with the results from the significance tests that I performed yesterday. Prof. Medero and I tried the Sign significance test as well, and saw that the test also reported that the perplexities were statistically significant. So, they are statistically significant. But why, if the overall perplexities are so close in value? I feel like there could be a few explanations. The best one I came up with is that the overall perplexity statistic that we’re using (in the table) are not directly related to an average of the sentence-level perplexities given by each model. According to the SRILM website, they are computed by the expression
10^(-logprob / (words - OOVs + sentences))
where -logprob is a log-probability of seeing all of the tokens in the test set. So, it could be that this expression is understating the differences between the sentence-level perplexities, which is likely given that the test set–in number of sentences–is small.
Now that we have established significance testing, the goal is to explore ways to change the language model parameters and see how that affects the distributions of perplexities. Some suggestions that Prof. Medero had for me were changing the order of the language models we’re using (using bi- or tri-grams), as well as building unigram language models based off of the non-terminals as well as parts-of-speech in the syntax trees of the corporas. I’m going to need to make changes to the Bash script to make exploring those easier, so that’s going to be my focus for today.