- A common simple character encoding is ASCII,
- We can encode each word as a number (token) โ
Tokenizer
.
- Tokenize words > build all the words to make a corpus > turn your sentences into lists of values based on these tokens. > manipulate these lists (make the same length, for example)
- Word embeddings = the idea in which words and associated words are clustered as vectors in a multi-dimensional space. That allows words with similar meaning to have a similar representation.
- The meaning of the words can come from labeling of the dataset.
- Example: "dull" and "boring" show up a lot in negative reviews โ they have similar sentiments โ they are close to each other in the sentence โ thus their vectors will be similar โ NN train + learn these vectors + associating them with the labels to come up with what's called in embedding.
- The purpose of embedding dimension is the number of dimensions for the vector representing the word encoding.
If you wanna visualize the result (in 3D) with Embedding projector
๐ Notebook: Train Sacarsm dataset.
- In text data, it usually happens that the accuracy increase over the number of training but the loss increase sharply also. We can "play" with hyperparameter to see the effect.
๐ datasets/imdb_reviews.md at master ยท tensorflow/datasets
๐ tfds.features.text.SubwordTextEncoder ย |ย TensorFlow Datasets
๐ Notebook: Pre-tokenizer example.
๐ Video exaplain the codes.
๐ tfds.features.text.SubwordTextEncoder ย |ย TensorFlow Datasets
๐ Notebook: Pre-tokenizer example.
๐ Video exaplain the codes.
- There are someones who did the work (tokenization) for you.
- Try on IMDB dataset that has been pre-tokenized.
- The tokenization is done on subwords!
- The sequence of words can be just important as their existence.
- The code run quite long (4 minutes each epoch if using GPU on colab) because there are a lot of hyperparameters and sub-words.
- Result: 50% acc & loss is decreasing but very small.
- Because we are using sub-words, not for-words โ they (sub-words) are nonsensical. โ they are only when we put them together in sequences โ learning from sequences would be a great way forward โ RNN (Recurrent Neural Networks)
- The relative ordering, the sequence of words, matters for the meaning of the sentence .
- For NN to take into account for the ordering of the words: RNN (Recurrent Neural Networks), LSTM (Long short-term memory).
- Why not RNN but LSTM ? With RNN, the context is preserved from timstamp to timestamp BUT that may get lost in longer sentences โ LSTM gets better because it has cell state.
- Example of using LSTM: "I grew up in Ireland, I went to school and at school, they made me learn how to speak..." โ "speak" is the context and we go back to the beginning to catch "Ireland", then the next word could be "leanr how to speak Gaelic"!
- The usual NN, something like "f(data, labels)=rules" cannot take into account of sequences.
- An example of using sequences: Fibonacci sequence โ the result of current function is the input of next function itself,...
- Sometimes, the sequence context leads to lose information like the example of "Ireland" and "Gaelic" before.
- LSTM has an additional pipeline called Cell State. It can pass through the network to impact it + help to keep context from earlier tokens relevance.
๐ Notebook: IMDB Subwords 8K with Single Layer LSTM
๐ Notebook: IMDB Subwords 8K with Multi Layer LSTM
๐ Notebook: IMDB Reviews with GRU (and optional LSTM and Conv1D).
๐ Video compares the results.
๐ Video compares the results.
Try with 3 different choices:
- Simple NN: 5s/epoch, 170K params, nice acc, overfitting.
- LSTM: 43s/epoch, 30K params, acc better, overfitting.
- GRU (Gated Recurrent Unit layer, a different type of RNN): 20s/epoch, 169K params, very good acc, overfitting.
- Conv1D: 6s/epoch, 171K params, good acc, overfitting.
Remark: With the texts, you'll probably get a bit more overfitting than you would have done with images. Because we have out of voca words in validation data.
One application of sequence models: read text then generate another look-alike text.
๐ Notebook 1 & explaining video.
- How they predict a new word in the notebook? โ Check this video.
- Using more words will help.
A little changes from the previous,
- Different convernges can create different poetry.
- If we use one-hot for a very big corpus โ take a lot of RAM โ use character-based prediction โ #unique characters is far less than #unique words. โ notebook "Text generation with RNN"
๐ Notebook Using LSTMs, see if you can write Shakespeare!