Different/Equal length, X and/or Y is a sequence,...
Notations
: index of position of a word in the sequences
: length of sequence
: index of training example
Representing words → based on Vocabulary (built based on occurent words in the sequences or some online already-built vocabs) → a common vector of all words
Each words represents by a one-hot vector based on vocabulary vector
If some words are not in the vocab, we use "<UNK>" (Unknown)
RNN Model
Why don't use a standard networks?
Input and output can be different lengths in diff examples () even if you could use padding to the max length of all texts but it's not a good representation!
Doesn't share features learned across diff positions of text (ex: word "Harry" in some position and other positions give some info abt person's name)
Like in CNN, something learnt from 1 part of the image can be generalized quickly to other part of the image.
Reduce #params in model ← we don't want very large input layer (with one-hot vector)
RNN (Unidirectional)
at time step 2, it uses not only the input but also the info from time step 1 (activation )
- The right version is rolled one but the same meaning with the left one (it appears in some textbook but unclear/difficult to implement, Andrew doesn't use it in the course) - This is "Unidirectional RNN" which means that we can only use the info of the previous words!!! → not very strong because (ex:) He said, "Teddy Roosevelt was a great President" → Teddy is a name of a person He said, "Teddy bears are on sale!" → Teddy is not a name of a person! - We use notations to indicate the params
Forward propagation
Use to compute
Backpropagationthrough time (red arrows in below fig)
Going backward in time
Different types of RNNs
Language model and sequence generation
Speech recognition system → propability(sequences of words)
Ouput 10K weight softmax output (10K is the number of words in the dictionary, i.e. corpus) → prob of each word, which one is highest → that one is the word the user said!
Sampling novel sequences
After training, we have activations , we then use them to sample a "novel" sequence. → word-level RNN (based on vocaabulary)
An important aspect to be explored, once a Language Model has been trained, is how well it can generate new or novel sequences.
Character-level language model → not used too much today
Pros: Don't worry about unknown words (word not appears in your vocabulary)
Cons: much longer sentences!! → computational
Vanishing gradients with RNNs
language can have very long-term dependencies, where it worked at this much earlier can affect what needs to come much later in the sentence. Ex:
The cat, which already ate...., was full.
The cats, which ...................., were full.
→ RNN is not good to capture very logn term dependencies. ← because of vanishing gradient
→ the basic RNN model has many local influences
There are also the problem of "exploding gradients" (increasing by the depth of NN) → there are many NaNs values in the output! ← solution: gradient clipping (rescale some of gradient vectors) → vanishing gradient is much harder to solve!
Gated Recurrent Unit (GNU) → solution for "vanishing gradient" problem → capture much longer range dependencies
Compared with an RNN unit
Notations: = memory cell, in this GRU, (activation) but in LSTM, they're different!
Intuition, Gate function is either 0 or 1 (by using sigmoid). "u" stands for "update"
With ~ 0 → ~ → maintained through very long sequence → help to solve vanishing problem! (green color in above fig)
Long Short Term Memory (LSTM)
It's even more powerful (more general) than GRU. However, in the history, LSTM came first.
LSTM is default first thing to try.
Paper is really diffcult to read
We don't have the case like in the case of GRU
Gates: update (), forget () and output (). → LSTM has 3 gates instead of 2 in GRU.
Unidirectional
Bidirectional RNN (BRNN)
At a point in time to take the info of earlier and later in the sequence.
Come back to the example of Teddy: - He said, "Teddy Roosevelt was a great President" → Teddy is a name of a person - He said, "Teddy bears are on sale!" → Teddy is not a name of a person!
Acyclic graph: forward prop contains 2 directions (violet and green in the fig below)
Ex: To get from both sides:
From (violet way)
From (yellow way)
Cons: We DO need the entire sequence of data before you can make prediction anywhere.
Ex: speech recognition → wait for person to stop talking (so that we have the entite sentence) and then we can make the prediction → not so good in real time
→ use other techniques!
Deep RNNs
Stack multiple layers of RNN togehter!
Notation
For normal NN → many layers means deep
For RNN → 3 layers is a lot!
Sometimes, we see that instead of output directly , we can connect with some normal NN like in below fig.
→ We don't see deep RNNs quite otfen because of their computational cost!
Week 2 - NLP & Word Embeddings
Introduction to Word Embeddings
Word Representation → word embedding
1 of the most important idea in NLP
If we use one-hot, it's difficult for ML algo to learn from each other words because of their representation in the vocabulary, eg. Apple (456) is very far away from Orange (6257) → cannot learn from "I want a glass of orange juice" to "I want a glass of apple juice". ← because the inner product between any 2 vector is 0.
→ We use Featurized representation instead!
300 dimensional vector or 300 dimensional embedding
We can embed 300 D → 2 D for visualizing ← using t-SNE or UMAP
Using word embeddings
There are already "pre-trained" very large text corpuses on the internet (~1B words, 100B words) → apply to your tasks with much smaller 100K words. → allowed to carry out Transfer Learning (learn from 1B and transfer to 100K) → (optinal) continue to finetune the word embedding with new data → Use BRNN (bidirectional RNN) instead of simple RNN.
Word embedding have relation to face encoding
The word "embedding" and "encoding" are used interchangeable
Properties of word embeddings
It can help with analogy reasoning
"sim" means "similarity"
Without t-SNE, it's better to find the analogies. After using something like t-SNE to embed to smaller dimension → the similarity is not sure to true.
Cosine similarity → common way to measure the similarity between 2 word embeddings
Embedding matrix
When you implement an algo to learn word embedding → you end up with an Embedding matrix
In practice, we use "Embedding layer" instead of mulplication matrix like above! ← more efficient!
Learning Word Embedding → some cocrete algos: Word2Vec & GloVe
Learning word embeddings (go from more complicated but more intuitive → simplier)
Word2Vec Skip-grams model → any nearby 1-word, eg. "orange" or "glass" or "my",...
Take 1 word, skip some word between → gives the target
Cons: Computational cost on the softmax ← because of → use hierarchical softmax
In practice, we use (c=context words) different for common words (the, of, a, and, to,...) with uncommon (but more important) words (orange, apple, durian,...)
Negative Sampling → modified learning problem → to be able to train much bigger training example
positive sampling: orange (context) - juice (word) → 1 (target)
Sample based on "how often words appear in the corpus" ( → cons: very high representation of "the, of, and,..."
use 1/size of corpus → not representative
Usually use
GloVe word vectors (= Global Vectors for word representation)
has some momentum in the NLP community; not used as much as word2vec or skip-gram model but enthusiast
→ how often do words i and j appear close each other. (i,j like c,t ↔ context and target words)
What GloVe does is optimize
Applications using Word Embeddings
Sentiment Classification → based on piece of text and tell that someone likes/dislike something
Challenge → not have huge labeled training set for it
Simple sentiment classification model
Cons: ignore word orders, e.g. the sentence in the fig is negative but there are many "good" → use RNN instead!
RNN for sentiment classification
Debiasing word embeddings
"bias" → not the "bias" in "bias variant" → means "gender, ethnicity, sexual...bias
The problem of bias in word embeddings:
Addressing bias in word embeddings
More explanation: - step 1: xác định hướng của bias và non-bias - step 2: những từ neutral phải được tịnh tiến để lam mất đi bias ("babysister" và "doctor" project to Oy để làm mất đi bias "babaysister"→female, "doctor"→male - step 3: làm cho các khoảng cách bằng nhau: ví dụ khoảng cách của "babysister" gần "grandmother" hơn → ko đúng → phải làm cho nó ngang bằng giữa "grandmother" và "grandfather"
How you decide which words to neutralize? (step 2)
train the classifier → figure out what words are definitional or not ← a linear classifier can tell you
most of words in english are not definitional (like babysister and doctor)
Week 3 - Sequence models & Attention mechanism
Various sequence to sequence architectures
Basic Sequence to sequence model ← translation
Encoder networks: (RNN, GRU, LSTM) input french words → 1 word at a time → output vector represents the input setence
Decoder networks: take the output of encoder → train 1 word at the time → output
This model works: given enough pair of french-english sentences → works well
An architecture very similar to above also works for image cationing ← describe an image
Use ConvNet (eg. pretrained AlexNet in the pig) → instead of softmax, we feed it into a RNN network
Picking the most likely sentence
The similarity between sequence-to-sequence and language model (week 1)
Consider machine translation as building a conditional language model
Language mode gives the probability of a sentence, give nouvel sentences.
Machine translation model: 2 parts - encoder (green), decoder (vilot) where decoder looks alike language model
When you use this model for machine translation, you not try to sample at random this distribution! → instead, you wanna maximize ← using Beam Search!!!
Beam Search
Why not Greedy Search?
Maximum từng cái , max cái này tới max cái khác thay vì toàn bộ
For French sentence ""Jane visite l'Afrique en Septembre" → if use Beam Search, câu trên, if use Greedy Search, câu dưới → ko quá chính xác!
Another reason, the number of words is huge → word by word is not good → use approximate translation is better!
Step 1: Use "beam width" (eg. B=3) → for the 1st word (french) → try to choose most likely 3 words
Step 2: Each of above 3 words → try to choose the most likely 3 next words for the 2nd word (french)
If B=1 → Beam Search becomes Greedy Search!
Refinements to Beam Search
Length normalization
Sometime max (A) when A very small → we choose max (log(A)) instead!!!
gives the same result as
Beam width → the larger width, the more posibilities considering, the better result → but the more computationaly expensive your algo is
Try 1→3→10, 100, 1000, 3000 → be careful on production / commercial
Doesn't like BFS (Breadth First Search) or DFS (Depth First Search), Beam Search run faster but is not guaranteed to find exact maximum.
Error analysis in beam search ← what if beam search makes a mistake?
What fraction of errors due to Beam Search or RNN model?
If Beam Search? → increase the beam width
If RNN? → deepeer layer analysis, regularization, more training size, netwok architecture,...
Bleu Score ← multiple english translations are equally to french sentences?
Bleu score gives you an automatical way to evaluate your algo → speed up
"bleu" = bilingual evaluation understudy
paper is readable
Unigram
Bigrams → (general) n-grams
Attention modelintuition → look at a part of a sentence at the time
A modification of encoder-decoder → attention model makes all of this work much better → 1 of the most influential ideas in deep learning
the longer sentence → the lower bleu score ⇒ because it's difficult for NN to memorize
How much you should pay attention to a piece of a sentence
tells that when you trying to generate english word → how much you should pay attention to the french words ⇒ this allows on every time step, look only within a local window of french sentence to pay attention when generating a specific english word.
Speech recognition ← how sequence-to-sequence model applied to audio data
audio (Ox=time, Oy=air pressure) → frequency (spectrogram) (Ox=time, Oy=frequency, different colors=amount of energy) ← need preprocessing step.
Speech recognition usually uses "phonemes" (âm vị)
phonemes = In linguistics, the smallest unit of speech that distinguishes one word sound from another. Phonemes are the elements on which computer speech is based.
dataset → academy (300h, 3000h), commercial (100000h)
Using Attention model
Using CTC cost (Connectionist temporal classification)
The number of time steps is really large! (eg. 10s of audio → 1000 inputs = 100 Hz*10s) → #input large → but the output cannot be many like that!
Trigger word detection systems (like Alexa, Google Home, Apple Siri, Baidu DuerOS)