# Word Embedding •

2021-07-05 / Edit on Github

## What & Why? #

• ML needs vectors (array of numbers) as input $\Leftarrow$ Texts need to be converted to (array of) numbers (or vectorize the texts).
• One-hot encodings $\Rightarrow$ inefficient (almost all elements of the sparse matrix are zeros)
• Encode each word with a unique number $\Rightarrow$ more efficient than one-hot but not capturing the relationship between words.
• $\Rightarrow$ That's why we think of Word embedding which helps:
• Dense representation.
• There are relationships between similar words.
• Word embedding = the idea in which words and associated words are clustered as vectors in a multi-dimensional space. That allows words with similar meaning to have a similar representation.
• a dense vector of floating point values.
• similar words have a similar encoding.
• like a "lookup table"
• values are trainable parameters.
• Dimensional?
• Small dataset: commonly 8.
• Big dataset: commonly up to 1024.
• The meaning of the words can come from labeling of the dataset.
• Example: "dull" and "boring" show up a lot in negative reviews $\Rightarrow$ they have similar sentiments $\Rightarrow$ they are close to each other in the sentence $\Rightarrow$ thus their vectors will be similar $\Rightarrow$ NN train + learn these vectors + associating them with the labels to come up with what's called in embedding.
• The purpose of embedding dimension is the number of dimensions for the vector representing the word encoding.

An example of 4-dimensional embedding. Source of the idea.

## How? #

### Tensorflow #

👉 Word embeddings | TensorFlow Core
👉 Note about Word embeddings from the course of deeplearning.ai.
👉 Embedding projector -- visualization of high-dimensional data

# Embed a 1,000 word vocabulary into 5 dimensions.embedding_layer = tf.keras.layers.Embedding(1000, 5)

This is a draft note.