What & Why? #
- ML needs vectors (array of numbers) as input Texts need to be converted to (array of) numbers (or vectorize the texts).
- One-hot encodings inefficient (almost all elements of the sparse matrix are zeros)
- Encode each word with a unique number more efficient than one-hot but not capturing the relationship between words.
- That's why we think of Word embedding which helps:
- Dense representation.
- There are relationships between similar words.
- Word embedding = the idea in which words and associated words are clustered as vectors in a multi-dimensional space. That allows words with similar meaning to have a similar representation.
- a dense vector of floating point values.
- similar words have a similar encoding.
- like a "lookup table"
- values are trainable parameters.
- Small dataset: commonly 8.
- Big dataset: commonly up to 1024.
- The meaning of the words can come from labeling of the dataset.
- Example: "dull" and "boring" show up a lot in negative reviews they have similar sentiments they are close to each other in the sentence thus their vectors will be similar NN train + learn these vectors + associating them with the labels to come up with what's called in embedding.
- The purpose of embedding dimension is the number of dimensions for the vector representing the word encoding.
An example of 4-dimensional embedding. Source of the idea.
# Embed a 1,000 word vocabulary into 5 dimensions.
embedding_layer = tf.keras.layers.Embedding(1000, 5)