Lab 4: Yelp review regression

Introduction

This lab predicts star ratings for Yelp reviews. Although this could be architected as a regression problem, we make it easier by structuring it as a classification problem: treat reviews with 4 or 5 stars as positive, and 1 or 2 stars as negative.

Data

The data is available with 5-star reviews from fastai at URLs.YELP_REVIEWS. Alternatively, a slightly modified version of the data is available (with positive-negative labels already created from the star ratings) from fastai at URLs.YELP_REVIEWS_POLARITY.

The training and validation datasets are huge. I suggest you use at most 10% of the data for training/validation because otherwise training will take forever.

The three ways to solve this problem

Try all of the following ways of solving this problem:

For parts 2 and 3, you'll need a dictionary of unique words (which can be obtained from the dataloader). From then on, you can use the index into this dictionary rather than the string itself.

Use the pretrained language model from fastai. Although you'll want to fine tune the classifer, there's no need to fine tune the language model itself.
Bag of words: create a vector that represents the set of words that are present in the review. Use that to predict the sentiment (positive or negative).
A good way to approach this is to use a standard linear neural network with 1 or 2 hidden layers. How many outputs from the last layer?
You'll need a new layer at the beginning of the neural network that converts from the format the dataloader provides, a tensor of word numbers, one for each word in the review:
```
     [3, 5, 1, 9, 57, ..., 12]
     
```
to a tensor of length equal to the number of words in the dictionary (dls.train.vocab[0]), with a 1 at each location that a word is present:
```
     [0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, ...]
     
```
(note that in this example entries at index 3, 5, 1, 9, ..., and 12 are set to 1; all others are 0).
Embedding: create an embedding matrix (of size 10, perhaps?) that is applied to each word. Limit the review to some reasonable size (100 words, perhaps, chosen from beginning and end of the review). Use those words, run through the embedding and then to the remainder of a neural network.
Make sure you are using the same embedding matrix for each word.

Compare and contrast the results of all three approaches.

Suggestions

You might find my notebook for training MNIST dataset useful for parts 2 and 3.
Start with small subsets of the data; try training with at most a few hundred training examples to begin with.
TextDataLoaders.from_df may be useful.
If you are using TextDataLoaders functions and are specifying a validation column, the values in that column should be 0 (for train), and 1 (for validation).
Dataframes have a sample method which generate a random fraction of the rows. That's a very handy way to obtain smaller subsets of the data.
You should be able to use the same DataLoaders for each of the three parts.
If you get Cuda errors, that means something is going wrong on the GPU. Error messages aren't too great from the GPU. Consider turning off the GPU and running on the CPU where you get good error messags. Once you've fixed your problems, you can turn the GPU back on.

Challenge 1 Use a pretrained embedding like Glove or Word2vec for step 3.

Challenge 2 Use embeddings in conjunction with an LSTM for step 3

Challenge 3 Fine-tune the underlying language model for step 1 before fine-tuning the classifer based on it.

Challenge 4 Train a regression model rather than a classification model for the three steps.

This completes the lab. Submit instructions

Make sure that the output of all cells is up-to-date.
Rename your notebook:
1. Click on notebook name at the top of the window.
2. Rename to "CS152Sp21Lab4 FirstName1/FirstName2" (using the correct lab number, along with your two first names). I need this naming so I can easily navigate through the large number of shared docs I will have by the end of the semester.
Choose File/Save
Share your notebook with me:
1. Click on the Share button at the top-right of your notebook.
2. Enter rhodes@g.hmc.edu as the email address.
3. Click the pencil icon and select Can comment.
4. Click on Done.
Enter the URL of your colab notebook in this submittal form. Do not copy the URL from the address bar (which may contain an authuser parameter and which I will not be able to open). Instead, click Share and Copy link to obtain the correct link. Enter your names in alphabetical order.
At this point, you and I will go back and forth until the lab is approved.
1. I will provide inline comments as I evaluate the submission (Google should notify you of these comments via email).
2. You will then need to address those comments. Please do not resolve or delete the comments. I will use them as a record of our conversation. You can respond to them ("Fixed" perhaps).
3. Once you have addressed all the comments in this round, fill out the submittal form again.
4. Once I am completely satisifed with your lab, I will add a LGTM (Looks Good to Me) comment
5. At that point, setup an office hour appointment with me. Ill meet with you and your partner and we'll have a short discussiona about the lab. Both of you should be able to answer questions about any part of the lab.