TF by DL.AI — Course 3: NLP in TF

Tokenizing + padding

📙 Notebook: Tokenizer basic examples.
📙 Notebook: Sarcasm detection.

A common simple character encoding is ASCII,

We can encode each word as a number (token) — Tokenizer.

Tokenize words > build all the words to make a corpus > turn your sentences into lists of values based on these tokens. > manipulate these lists (make the same length, for example)

👉 tf.keras.preprocessing.text.Tokenizer

👉 tf.keras.preprocessing.sequence.pad_sequences

👉 Sarcasm detection dataset.

Word embeddings

👉 Embedding projector - visualization of high-dimensional data
👉 Large Movie Review Dataset

IMDB review dataset

📙 Notebook: Train IMDB review dataset.
👉 Video explain the code.

Word embeddings = the idea in which words and associated words are clustered as vectors in a multi-dimensional space. That allows words with similar meaning to have a similar representation.

The meaning of the words can come from labeling of the dataset.

Example: "dull" and "boring" show up a lot in negative reviews → they have similar sentiments → they are close to each other in the sentence → thus their vectors will be similar → NN train + learn these vectors + associating them with the labels to come up with what's called in embedding.

The purpose of embedding dimension is the number of dimensions for the vector representing the word encoding.

If you wanna visualize the result (in 3D) with Embedding projector

Sarcasm dataset

📙 Notebook: Train Sacarsm dataset.

In text data, it usually happens that the accuracy increase over the number of training but the loss increase sharply also. We can "play" with hyperparameter to see the effect.

Pre-tokenized datasets

👉 datasets/imdb_reviews.md at master · tensorflow/datasets
👉 tfds.features.text.SubwordTextEncoder | TensorFlow Datasets
📙 Notebook: Pre-tokenizer example.
👉 Video exaplain the codes.

There are someones who did the work (tokenization) for you.

Try on IMDB dataset that has been pre-tokenized.

The tokenization is done on subwords!

The sequence of words can be just important as their existence.

The code run quite long (4 minutes each epoch if using GPU on colab) because there are a lot of hyperparameters and sub-words.

Result: 50% acc & loss is decreasing but very small.

Because we are using sub-words, not for-words → they (sub-words) are nonsensical. → they are only when we put them together in sequences → learning from sequences would be a great way forward → RNN (Recurrent Neural Networks)

Sequence models

The relative ordering, the sequence of words, matters for the meaning of the sentence .

For NN to take into account for the ordering of the words: RNN (Recurrent Neural Networks), LSTM (Long short-term memory).

Why not RNN but LSTM ? With RNN, the context is preserved from timstamp to timestamp BUT that may get lost in longer sentences → LSTM gets better because it has cell state.

Example of using LSTM: "I grew up in Ireland, I went to school and at school, they made me learn how to speak..." → "speak" is the context and we go back to the beginning to catch "Ireland", then the next word could be "leanr how to speak Gaelic"!

RNN idea

👉 Note of the course of sequence model.

The usual NN, something like "f(data, labels)=rules" cannot take into account of sequences.

An example of using sequences: Fibonacci sequence → the result of current function is the input of next function itself,...

RNN basic idea (source).

LSTM idea

👉 (Video) Illustrated Guide to LSTM's and GRU's: A step by step explanation & its article.

Sometimes, the sequence context leads to lose information like the example of "Ireland" and "Gaelic" before.

LSTM has an additional pipeline called Cell State. It can pass through the network to impact it + help to keep context from earlier tokens relevance.

LSTM basic idea (image from the course).

📙 Notebook: IMDB Subwords 8K with Single Layer LSTM

📙 Notebook: IMDB Subwords 8K with Multi Layer LSTM

1 layer vs 2 layer LSTM accuracy after 50 epochs (image from the course). 2 layer is better (smoother) which makes us more confident about the model. The validation acc is sticked to 80% because we used 8000 sub-words taken from training set, so there may be many tokens from the test set that would be out of vocabulary.

With vs without LSTM

With vs without LSTM (image from the course). With LSTM is really better but there is still overfitting here.

Using a ConvNet

👉 Video explains the dimension.
📙 Notebook: IMDB Subwords 8K with 1D Convolutional Layer.

Using Convolution network. (image from the course). It's really better but there is overfitting there.

IMDB dataset

📙 Notebook: IMDB Reviews with GRU (and optional LSTM and Conv1D).
👉 Video compares the results.

Try with 3 different choices:

Simple NN: 5s/epoch, 170K params, nice acc, overfitting.

LSTM: 43s/epoch, 30K params, acc better, overfitting.

GRU (Gated Recurrent Unit layer, a different type of RNN): 20s/epoch, 169K params, very good acc, overfitting.

Conv1D: 6s/epoch, 171K params, good acc, overfitting.

Remark: With the texts, you'll probably get a bit more overfitting than you would have done with images. Because we have out of voca words in validation data.

Sequence models and literature

One application of sequence models: read text then generate another look-alike text.

📙 Notebook 1 & explaining video.

How they predict a new word in the notebook? → Check this video.

Using more words will help.

📙 Notebook 3 (more data)

A little changes from the previous,

Different convernges can create different poetry.

If we use one-hot for a very big corpus → take a lot of RAM → use character-based prediction → #unique characters is far less than #unique words. → notebook "Text generation with RNN"

📙 Notebook Using LSTMs, see if you can write Shakespeare!