This is my note for the 3rd course of TensorFlow in Practice Specialization given by deeplearning.ai and taught by Laurence Moroney on Coursera.

π Check the codes on my Github.
π Official notebooks on Github.

π Go to course 1 - Intro to TensorFlow for AI, ML, DL.
π Go to course 2 - CNN in TensorFlow.
π Go to course 4 - Sequences, Time Series and Prediction.

π Notebook: Tokenizer basic examples.
π Notebook: Sarcasm detection.

• A common simple character encoding is ASCII,
• We can encode each word as a number (token) β `Tokenizer`.
• Tokenize words > build all the words to make a corpus > turn your sentences into lists of values based on these tokens. > manipulate these lists (make the same length, for example)
``````from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
'i love my dog',
'I, love my cat',
'You love my dog so much!'
]

tokenizer = Tokenizer(num_words = 100, oov_token="<OOV>")
# more words, more accuracy, more time to train
# oov_token: replace unseen words by "<OOV>"
tokenizer.fit_on_texts(sentences) # fix texts based on tokens
``````
``````# indexing words
word_index = tokenizer.word_index
print(word_index)
# {'<OOV>': 1, 'love': 2, 'my': 3, 'i': 4, 'dog': 5, 'cat': 6, 'you': 7, 'so': 8, 'much': 9}
# "!", ",", capital, ... are removed
``````
``````# encode sentences
sequences = tokenizer.texts_to_sequences(sentences)
print(sequences)
# [[4, 2, 3, 5], [4, 2, 3, 6], [7, 2, 3, 5, 8, 9]]
# if a word is not in the word index, it will be lost in the text_to_sequences()
``````
``````# make encoded sentences equal

# maxlen: max len of encoded sentence
# value: value to be filld (default 0)
# truncating: longer than maxlen? cut at beginning or ending?
# [[ 4  2  3  5 -1]
#  [ 4  2  3  6 -1]
#  [ 7  2  3  5  8]]
``````
``````# read json text
import json
with open("/tmp/sarcasm.json", 'r') as f:

sentences = []
labels = []
urls = []
for item in datastore:
labels.append(item['is_sarcastic'])
``````

## Word embeddings

### IMDB review dataset

π Notebook: Train IMDB review dataset.
π Video explain the code.

• Word embeddings = the idea in which words and associated words are clustered as vectors in a multi-dimensional space. That allows words with similar meaning to have a similar representation.
• The meaning of the words can come from labeling of the dataset.
• Ex: βdullβ and βboringβ show up a lot in negative reviews => they have similar sentiments => they are close to each other in the sentence => thus their vector will be similar => NN train + learn these vectors + associating them with the labels to come up with whatβs called in embedding.
``````import tensorflow as tf
print(tf.__version__) # check version of tensorflow

# If you are using tf1, you need below code
tf.enable_eager_execution()
``````
``````# IMDB reviews dataset
import tensorflow_datasets as tfds
imdb, info = tfds.load("imdb_reviews", with_info=True, as_supervised=True)

train_data, test_data = imdb['train'], imdb['test']

for s,l in train_data: # "s" for sentences "l" for labels
# The values for "s" and "l" are tensors
# so we need to extracr their values
training_sentences.append(s.numpy().decode('utf8'))
training_labels.append(l.numpy())
``````
``````# Prepare for the NN
vocab_size = 10000
embedding_dim = 16 # embedding to dim 16
max_length = 120 # of each sentence
trunc_type='post' # cut the last words
oov_tok = "<OOV>" # replace not-encoded words by this

from tensorflow.keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(training_sentences)
# encoding the words
word_index = tokenizer.word_index
# list of word index (built based on training set)
# there may be many oov_tok in test set
sequences = tokenizer.texts_to_sequences(training_sentences)
# apply on sentences

# apply to the test set
testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
``````
``````# Simple NN
model = tf.keras.Sequential([
tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
# The result of embedding will be a 2D array:
# length of sentence x embedding_dim
tf.keras.layers.Flatten(),
# Alternatively (a little diff on speed and accuracy):
# tf.keras.layers.GlobalAveragePooling1D()
#   average across the vectors to flatten it out
tf.keras.layers.Dense(6, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
model.summary()
``````
``````# Training
``````
``````# the result
e = model.layers[0] # get the result of the embedding layers
weights = e.get_weights()[0]
print(weights.shape) # shape: (vocab_size, embedding_dim)
``````

If you wanna visualize the result (in 3D) with Embedding projector,

``````import io

out_v = io.open('vecs.tsv', 'w', encoding='utf-8')
out_m = io.open('meta.tsv', 'w', encoding='utf-8')
for word_num in range(1, vocab_size):
word = reverse_word_index[word_num]
embeddings = weights[word_num]
out_m.write(word + "\n")
out_v.write('\t'.join([str(x) for x in embeddings]) + "\n")
out_v.close()
out_m.close()

try: