August 3 (Week 7) – Maury

Significance testing is done! So for the table that I generated in Monday’s blog post, I perform the Wilcoxon signed-rank test comparing a sentence-level perplexities associated with the baseline (the worst perplexity that I encounter) with sentence-level perplexities associated with every other model. The weird thing that I am encountering now is that the tests indicate that the model’s difference in performance is statistically significant. Prof. Medero and I find this particularly weird, because the perplexities are so close in numeric value (less than 1.000). Why would the models perform significantly different from each other if they have similar overall perplexity scores? So, I’m currently doing some digging into how these overall perplexity scores are computed, to see if I can gather some insight into why the significance testing is giving me the answers that it does.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s