URLs.YELP_REVIEWS
.
Alternatively, a slightly modified version of the data is available (with positive-negative labels already created from the star ratings) from fastai at URLs.YELP_REVIEWS_POLARITY
.
The training and validation datasets are huge. I suggest you use at most 10% of the data for training/validation because otherwise training will take forever.
For parts 2 and 3, you'll need a dictionary of unique words (which can be obtained from the dataloader). From then on, you can use the index into this dictionary rather than the string itself.
A good way to approach this is to use a standard linear neural network with 1 or 2 hidden layers. How many outputs from the last layer?
You'll need a new layer at the beginning of the neural network that converts from the format the dataloader provides, a tensor of word numbers, one for each word in the review:
[3, 5, 1, 9, 57, ..., 12]to a tensor of length equal to the number of words in the dictionary (dls.train.vocab[0]), with a 1 at each location that a word is present:
[0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, ...](note that in this example entries at index 3, 5, 1, 9, ..., and 12 are set to 1; all others are 0).
Make sure you are using the same embedding matrix for each word.
Challenge 1 Use a pretrained embedding like Glove or Word2vec for step 3.
Challenge 2 Use embeddings in conjunction with an LSTM for step 3
Challenge 3 Fine-tune the underlying language model for step 1 before fine-tuning the classifer based on it.
Challenge 4 Train a regression model rather than a classification model for the three steps.
This completes the lab. Submit instructions