Last modified on 21 Sep 2020.

This is my note for the 3rd course of TensorFlow in Practice Specialization given by and taught by Laurence Moroney on Coursera.

πŸ‘‰ Check the codes on my Github.
πŸ‘‰ Official notebooks on Github.

πŸ‘‰ Go to course 1 - Intro to TensorFlow for AI, ML, DL.
πŸ‘‰ Go to course 2 - CNN in TensorFlow.
πŸ‘‰ Go to course 4 - Sequences, Time Series and Prediction.

Tokernizing + padding

πŸ‘‰ Notebook: Tokenizer basic examples.
πŸ‘‰ Notebook: Sarcasm detection.

  • A common simple character encoding is ASCII,
  • We can encode each word as a number (token) – Tokenizer.
  • Tokenize words > build all the words to make a corpus > turn your sentences into lists of values based on these tokens. > manipulate these lists (make the same length, for example)
from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
    'i love my dog',
    'I, love my cat',
    'You love my dog so much!'

tokenizer = Tokenizer(num_words = 100, oov_token="<OOV>")
            # more words, more accuracy, more time to train
            # oov_token: replace unseen words by "<OOV>"
tokenizer.fit_on_texts(sentences) # fix texts based on tokens
# indexing words
word_index = tokenizer.word_index
# {'<OOV>': 1, 'love': 2, 'my': 3, 'i': 4, 'dog': 5, 'cat': 6, 'you': 7, 'so': 8, 'much': 9}
# "!", ",", capital, ... are removed

πŸ‘‰ tf.keras.preprocessing.text.Tokenizer

# encode sentences
sequences = tokenizer.texts_to_sequences(sentences)
# [[4, 2, 3, 5], [4, 2, 3, 6], [7, 2, 3, 5, 8, 9]]
# if a word is not in the word index, it will be lost in the text_to_sequences()

πŸ‘‰ tf.keras.preprocessing.sequence.pad_sequences

# make encoded sentences equal
from tensorflow.keras.preprocessing.sequence import pad_sequences

padded = pad_sequences(sequences, value=-1,
                       maxlen=5, padding="post", truncating="post")
         # maxlen: max len of encoded sentence
         # value: value to be filld (default 0)
         # padding: add missing values at beginning or ending of sentence?
         # truncating: longer than maxlen? cut at beginning or ending?
# [[ 4  2  3  5 -1]
#  [ 4  2  3  6 -1]
#  [ 7  2  3  5  8]]

πŸ‘‰ Sarcasm detection dataset.

# read json text
import json
with open("/tmp/sarcasm.json", 'r') as f:
    datastore = json.load(f)

sentences = []
labels = []
urls = []
for item in datastore:

Word embeddings

πŸ‘‰ Embedding projector - visualization of high-dimensional data
πŸ‘‰ Large Movie Review Dataset

IMDB review dataset

πŸ‘‰ Notebook: Train IMDB review dataset.
πŸ‘‰ Video explain the code.

  • Word embeddings = the idea in which words and associated words are clustered as vectors in a multi-dimensional space. That allows words with similar meaning to have a similar representation.
  • The meaning of the words can come from labeling of the dataset.
    • Ex: β€œdull” and β€œboring” show up a lot in negative reviews => they have similar sentiments => they are close to each other in the sentence => thus their vector will be similar => NN train + learn these vectors + associating them with the labels to come up with what’s called in embedding.
import tensorflow as tf
print(tf.__version__) # check version of tensorflow

# If you are using tf1, you need below code
# IMDB reviews dataset
import tensorflow_datasets as tfds
imdb, info = tfds.load("imdb_reviews", with_info=True, as_supervised=True)

train_data, test_data = imdb['train'], imdb['test']

for s,l in train_data: # "s" for sentences "l" for labels
    # The values for "s" and "l" are tensors
    # so we need to extracr their values
# Prepare for the NN
vocab_size = 10000
embedding_dim = 16 # embedding to dim 16
max_length = 120 # of each sentence
trunc_type='post' # cut the last words
oov_tok = "<OOV>" # replace not-encoded words by this

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
    # encoding the words
word_index = tokenizer.word_index
    # list of word index (built based on training set)
    # there may be many oov_tok in test set
sequences = tokenizer.texts_to_sequences(training_sentences)
    # apply on sentences
padded = pad_sequences(sequences,maxlen=max_length, truncating=trunc_type)
    # padding the sentences

# apply to the test set
testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences,maxlen=max_length)
# Simple NN
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
                              # The result of embedding will be a 2D array:
                              # length of sentence x embedding_dim
    # Alternatively (a little diff on speed and accuracy):
    # tf.keras.layers.GlobalAveragePooling1D()
    #   average across the vectors to flatten it out
    tf.keras.layers.Dense(6, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
# Training, training_labels_final, epochs=10, validation_data=(testing_padded, testing_labels_final))
# the result
e = model.layers[0] # get the result of the embedding layers
weights = e.get_weights()[0]
print(weights.shape) # shape: (vocab_size, embedding_dim)

If you wanna visualize the result (in 3D) with Embedding projector,

import io

out_v ='vecs.tsv', 'w', encoding='utf-8')
out_m ='meta.tsv', 'w', encoding='utf-8')
for word_num in range(1, vocab_size):
  word = reverse_word_index[word_num]
  embeddings = weights[word_num]
  out_m.write(word + "\n")
  out_v.write('\t'.join([str(x) for x in embeddings]) + "\n")

  from google.colab import files
except ImportError:

Sarcasm dataset

β€’Notes with this notation aren't good enough. They are being updated. If you can see this, you are so smart. ;)