Significance testing is done! So for the table that I generated in Monday’s blog post, I perform the Wilcoxon signed-rank test comparing a sentence-level perplexities associated with the baseline (the worst perplexity that I encounter) with sentence-level perplexities associated with every other model. The weird thing that I am encountering now is that the tests indicate that the model’s difference in performance is statistically significant. Prof. Medero and I find this particularly weird, because the perplexities are so close in numeric value (less than 1.000). Why would the models perform significantly different from each other if they have similar overall perplexity scores? So, I’m currently doing some digging into how these overall perplexity scores are computed, to see if I can gather some insight into why the significance testing is giving me the answers that it does.