Continuing from yesterday, I added illustrations drawn with Photoshop to the poster and ran it by Prof. Medero, who suggested some small changes. After adding some finishing touches to the poster, I continued updating documentation and making sure all my files are readily accessible for future use.
Yesterday, I worked a bit on the Python data analysis program so that I can leave it at a place where the basic building blocks are all there.I also fixed some issues with GitHub uploading with Prof. Medero and Michael and updated the documentation accordingly.
Today I started making the poster I was working on planning out for a portion of yesterday. Below is where it’s currently at. I’m planning to add a few illustrations demonstrating how the app works in the white space at the top right. Hope that it’ll be ready for printing tomorrow after some final touches.
Today, I completed making my custom linear interpolation script work with higher order n-grams. I still need to figure out how to interpolate the back-off weights, but I have placed that on the back burner for now.
Now, my attention has been focused on researching a phrase that Prof. Medero pointed me to, called “domain adaptation.” And this is basically investigating how to train a language model on one distribution of data (for us, it’d be the general corpus and by-learners corpus), and then have it perform well when we test it on another distribution of data (the for-learners corpus). The research I have done so far seems promising. I have ended the day with two key takeaways:
- I should set another baseline for my system, and that is training a language model that simply concatenates the general corpus and by-learners corpus together. This was inspired by the paper “Experiments in Domain Adaptation for Statistical Machine Translation” by Koehn and Schroeder.
- I should look into modifying the corpus that I am training on! One common domain adaptation method involves “selecting, joining, or weighting the datasets upon which the models (and by extension, systems) are trained” (from this paper by Axelrod et al.). So one idea I came up with is training a model with the by-learners corpus, and using it to compute sentence-level perplexities for all of the sentences in the general corpus. I can then filter out all of the sentences that scored above a certain perplexity. And finally, I can train a new model based off of the sentences that remain, and use that to test performance. I am still working on implementing this, but I will hopefully have results by tomorrow.
Today, I finished performing my own linear interpolation of the “general” corpus and “by-learners” corpus. And I know it works because its perplexity across the test set is exactly the same as the perplexity given by the SRILM-interpolated model across the same corpus. The only catch is that the linear interpolation script does not work with higher order n-grams. So, today I am going to spend time making that change. With that, I also need to research how the back-off weights are interpolated for higher order n-grams.
Last Friday and today were spent troubleshooting the weird visual bugs and nil errors that appeared in TestViewController after I transferred the working code from TutorialViewController over. After rearranging the code (and dearranging the code if I made things worse) I fixed the nil exception errors, but then found out that GPUImage can’t handle images larger than 2048×2048 at the moment, and the sample texts in the main experiment have a width of around ~34,000. For the time being, the blur filter stops working for these text samples, and I’m currently searching for a fix. I’ve tried reimplementing other 3rd party extensions I’ve found, but none of them worked, and uninstalling/reinstalling frameworks was quite a headache.
Goals for the end of this week are:
- Make a poster
- Leave TextScroll and documentation in a place where a new researcher can continue
- Get basic data analysis working so that new researchers can easily build on that
Today, I completed work on modifying the Bash script. It can now take in a configuration file that sets the parameters of the language models I want to create. It then creates all of the language models, computes their perplexity across a common test set, and prints out a table that compare the perplexities of each language model. Whew! Now that I have that done, I can now turn my attention to somehow combining the “general” corpus and the “by-learners” corpus (whether it be interpolation or something else) to see if I can lower the perplexity even more. For now, as an exercise, I am going to start by recreating SRILM’s linear interpolation. And for that, I have begun looking at the Python module arpa that lets me parse the language models of the corpora. Then, I can hopefully linearly interpolate the two language models. I should have an update on that tomorrow.
Today, I’m starting to work on re-factoring the Bash script code, so that it can read in a configuration file and train/ test the language model based off of those parameters. This will be helpful to generate the table that I mentioned previously for different configurations of language models (e.g., for bigrams or trigrams). So, I’ve spent most of the day looking up the best Bash approach to passing in the configuration file. Hopefully, I’ll have this running by next week so that I can start experimenting with ways to interpolate the corpora that I have.