Written Questions
I already played around with the interactive demo in Phase 8, but I want to play with it MORE!
You're in luck!
In this week's written questions, you'll be asked to more systematically explore the machine learning demo that you got working in Phase 8. In the process, you'll build some more intuition about how it works.
There'll also be some more "traditional" questions that ask you to look at and interpret graphs. After all, this assignment is still (ostensibly) about building data structures...
Exploration: What Measure is a Pokemon?
I think I found a bug! I entered "valgrind" into the interactive mode and the AI said it sounds like a Pokemon, even though I know for a fact that Valgrind is software!
That's not a bug...but it's a great catch!
Remember that 69% accuracy you got if you implemented everything correctly? That number isn't 100%, which means that the model makes mistakes (classifies some Pokemon as software and vice versa). But why?
Oh, I think I know the answer! It's because the things in the test set aren't in the training set, right? So it can't "remember" the right label for them, because it's never seen them?
That's part of it, but there's more to it than that.
MORE? Music to my ears!
There's a common misconception that machine learning models simply memorize their training data and rearrange it, like making a collage out of things you found in your room. But what's actually happening is more subtle than that. Every machine learning model—from the simple one you built to the most cutting edge LLMs—have a finite number of parameters (roughly, variables in the math whose values can be trained). During training, the model tries to encode as much information as it can about the very information-rich training data into this finite number of parameters. This requires some degree of generalization...and, consequently, some loss of information.
This is why you sometimes hear people say that machine learning is like a form of lossy compression.
Squwak! Generative AI models are sometimes nicknamed "stochastic parrots" for similar reasons! Squwak!
Today's most cutting-edge models have billions of parameters. By contrast, our simple model has mere thousands of parameters—two for each feature (one representing its count in the Pokemon class, and the other in the software class). This makes our model much less capable than something like ChatGPT. But on the flip side, it also makes our model much easier to analyze...and to manipulate!
In fact, the interactive demo already implements some model analysis. When you input "valgrind" you should see something like the following:
That sounds more like a pokemon!
Here's how each feature contributed to my prediction:
val: +0.297027
alg: +1.10796
gri: +0.00934445
rin: +1.61878
ind: -0.278338
The numbers indicate how much more or less likely each feature is to be seen in the predicted class "pokemon". For example, "alg" is much more likely to be seen in Pokemon names, and "ind" is slightly less likely to be seen in Pokemon names (i.e., more likely to be seen in software names).
Here is your exploration task: Can you come up with a simple change to the string "valgrind" that will cause the model to predict "software" instead of "pokemon"? "Simple" is subjective of course, but try to keep the string mostly recognizable as "valgrind". You can add new characters (e.g., "valgrind.com") or substitute individual characters for similar-looking ones (e.g., 'a' to '@'), but try to avoid deleting too many characters.
Feel free to play around and find a solution you're happy with—the more whimsical, the better! To guide your exploration, you can use the feature explanations in the interactive demo, lean on your own intuitions about what software and/or pokemon names often look like, or scan through the training data (dataset-train.txt) to look for common patterns. When you're ready, tell us what you did in the written questions:
Questions
Tell us about your exploration in the written questions on Gradescope!
(When logged in, completion status appears here.)