Today, I spent the majority of the day looking at significance testing and writing a Python script that my Bash script can invoke to perform significance testing on the perplexity results. So the general idea that Prof. Medero has is that I have a test set of several sentences, and I compute the perplexity of every language model across that same test set. I first find a baseline perplexity, the highest perplexity that I encounter in my table. And I am going to compare that the sentence-level perplexities of that baseline with the sentence-level perplexities of all other language models. Per Andreas Stolcke’s suggestion, I am going to run the Wilcoxon signed-rank test in my Python script to see whether the perplexities significantly vary across pairs of sentence perplexities.
So far, the Python script is almost done. Al I have to do is to make sure that I format the SRILM output properly in the Bash script so that it can be passed to the Python script. I should have significance testing done by tomorrow.