Lab 6: Text Generation using RNNs
Updated 4/7/21: 5:03 PM
Introduction
In this lab, you'll train a language model on a corpus of text. Then, you'll
use the trained model to generate text based on a primed text string.
For example, you'll train a language model on Shakespeare's plays, and then, given a prompt like:
What is the temptation
you'll generate follow-on text that might look like:
or:
You'll find useful information in chapter 10 and 11 of the fastai book.
The two corpora you'll be using are:
Parts
Part 1: Shakespeare using word tokenization
Steps
Follow the steps in Chapter 10, fairly closely:
- Create a language model data loader (see Language Module Using DataBlock section of Chapter 10).
- Show the batch and ensure it looks proper (same section).
- Create a non-pretrained LanguageModelLearner (see Fine-Tunting the Language Model section of Chapter 10).
- Train the model for one or two epochs.
- Generate some fake Shakespeare (at least two examples of at least 50 words, starting with the text prompt "What is the temptation" (see Text Generation section of Chapter 10).
Hints
Part 2: Shakespeare using character tokenization
In this part, you'll switch from feeding in words and predicting subsequent words to instead feeding in characters and predicting subsequent characters.
You'll want to read about tokenization in the following sections of Chapter 10: Tokenization through Numericalization with fastai
You'll need to:
- Create a custom tokenizer.
- Decide what rules you'll want, if any.
- Pass in a vocabulary to the TextBlock.
Otherwise, the steps will be fairly similar to Part 1.
Part 3: Linux source code using character tokenization
In this section you'll:
- Train on linux source code rather than shakespeare.
- Write your own greedy_predict routine that'll:
- Take a prompt and a length. You'll output the language model's prediction for the most likely output of the given length that starts with the given prompt.
-
You'll repeatedly call the language model.
At each step, you'll get a probablity distribution
for characters. You'll greedily choose the next character (highest proability).
- Write your own random_predict routine that'll:
- Take a prompt, and a length. You'll output the language model's prediction for the most likely output of the given length that starts with the given prompt.
-
You'll repeatedly call the language model.
At each step, you'll get a probablity distribution
for characters.
You'll sample the next character from the probability distribution.
- Write your own beam_predict routine that'll:
- Take a prompt, a length, and k. You'll output the language model's prediction for the most likely output of the given length that starts with the given prompt.
- Do a beam search (an improvement on greedy prediction) to compute the most likely output.
Challenges
We've looked at word tokenization, sub-word tokenization, and character tokenization. Let's take it to the extreme and do bit tokenization.
Train a language model on the linux kernel corpus using bit tokenization and generate output using beam search.
Rewrite random_predict and beam_predict so that they can take multiple prompts. Put the multiple prompts into a batch and calculate the predictions in parallel. For the beam prediction, do all the prediction in parallel using the batching mechanism.
This completes the lab.
Submit instructions
- Make sure that the output of all cells is up-to-date.
- Rename your notebook:
- Click on notebook name at the top of the window.
- Rename to "CS152Sp21Lab6 FirstName1/FirstName2" (using the correct lab number, along with your two first names).
I need this naming so I can easily navigate through the large number of shared docs I will have by the end of the semester.
- Choose File/Save
- Share your notebook with me:
- Click on the Share button at the top-right of your notebook.
- Enter rhodes@g.hmc.edu as the email address.
- Click the pencil icon and select Can comment.
- Click on Done.
- Enter the URL of your colab notebook in this submittal form.
Do not copy the URL from the address bar (which may contain an authuser parameter and which I will not be able to open).
Instead, click Share and Copy link to obtain the correct link.
Enter your names in alphabetical order.
- At this point, you and I will go back and forth until the lab is approved.
- I will provide inline comments as I evaluate the submission (Google should notify you of these comments via email).
- You will then need to address those comments. Please do not resolve or delete the comments. I will use them as a record of our conversation. You can respond to them ("Fixed" perhaps).
- Once you have addressed all the comments in this round, fill out the submittal form again.
- Once I am completely satisifed with your lab, I will add a LGTM (Looks Good to Me) comment
- At that point, setup an office hour appointment with me. Ill meet with you and your partner and we'll have a short discussiona about the lab. Both of you should be able to answer questions about any part of the lab.
'