Word Embedding

29-12-2020 / Edit on Github

What & Why? #

  • ML needs vectors (array of numbers) as input \Leftarrow Texts need to be converted to (array of) numbers (or vectorize the texts).
    • One-hot encodings \Rightarrow inefficient (almost all elements of the sparse matrix are zeros)
    • Encode each word with a unique number \Rightarrow more efficient than one-hot but not capturing the relationship between words.
    • \Rightarrow That's why we think of Word embedding which helps:
      • Dense representation.
      • There are relationships between similar words.
  • Word embedding = the idea in which words and associated words are clustered as vectors in a multi-dimensional space. That allows words with similar meaning to have a similar representation.
    • a dense vector of floating point values.
    • similar words have a similar encoding.
    • like a "lookup table"
    • values are trainable parameters.
  • Dimensional?
    • Small dataset: commonly 8.
    • Big dataset: commonly up to 1024.
  • The meaning of the words can come from labeling of the dataset.
    • Example: "dull" and "boring" show up a lot in negative reviews \Rightarrow they have similar sentiments \Rightarrow they are close to each other in the sentence \Rightarrow thus their vectors will be similar \Rightarrow NN train + learn these vectors + associating them with the labels to come up with what's called in embedding.
  • The purpose of embedding dimension is the number of dimensions for the vector representing the word encoding.

Word embedding examples
An example of 4-dimensional embedding. Source of the idea.

How? #

Tensorflow #

👉 Word embeddings | TensorFlow Core
👉 Note about Word embeddings from the course of deeplearning.ai.
👉 Embedding projector -- visualization of high-dimensional data

# Embed a 1,000 word vocabulary into 5 dimensions.
embedding_layer = tf.keras.layers.Embedding(1000, 5)

Notes with this notation aren't good enough. They are being updated.