Introduction to Word Embeddings
Word Representation
\(V = [a, aaron, ..., zulu, <UNK>]\) 1-hot representation
Featurized representation:word embedding
Feature | Man(5391) | Woman(9853) | King(4914) | Quene(7157) | Apple(456) | Orange(6257) |
Gender | -1 | 1 | -0.95 | 0.97 | -0.01 | 0.00 |
Royal | 0.01 | 0.02 | 0.93 | 0.95 | -0.01 | 0.00 |
Age | 0.03 | 0.02 | 0.7 | 0.69 | 0.03 | -0.02 |
Food | 0.04 | 0.01 | 0.02 | 0.01 | 0.95 | 0.97 |
Size | … | … | … | … | … | … |
Cost | … | … | … | … | … | … |
Alive | … | … | … | … | … | … |
Verb | … | … | … | … | … | … |
Notation | \(e^{5391}\) | \(e^{9853}\) | … | … | … | … |
Visualizing word embeddings
t-SNE algorithm
Using word embeddings
Named entity recognition example
Sally Johnson is an orange farmer Robert Lin is an apple farmer xxx xxx is a durian cultivator
Transfer learning and word embeddings
- Learn word embeddings from large text corpus. (1-100B words) (Or download pre-trained embedding online)
- Transfer embedding to new task with smaller training set. (say, 100k words)
- Optional: Continue to finetune the word embeddings with new data.
Word embeddings tend to make the biggest difference when the task has a relatively smaller training set.
Useful for: named entity recognition, text summarization, co-reference resolution, parsing, etc.
Less useful for: language modeling, machine translation, especially when those tasks already have a lot of data.
Relation to face encoding
Properties of word embeddings
Feature | Man(5391) | Woman(9853) | King(4914) | Quene(7157) | Apple(456) | Orange(6257) |
Gender | -1 | 1 | -0.95 | 0.97 | 0.00 | 0.01 |
Royal | 0.01 | 0.02 | 0.93 | 0.95 | -0.01 | 0.00 |
Age | 0.03 | 0.02 | 0.7 | 0.69 | 0.03 | -0.02 |
Food | 0.09 | 0.01 | 0.02 | 0.01 | 0.95 | 0.97 |
Notation | e5391(eman) | ewoman | eking | equeen | eapple | eorange |
Man -> Woman as King -> ?
\[{e_{man} - e_{woman} \approx \left[\begin{array}{cccc}-2\\0\\0\\0\end{array}\right]}\] \[{e_{king} - e_{queen} \approx \left[\begin{array}{cccc}-2\\0\\0\\0\end{array}\right]}\]Analogies using word vectors
\[{e_{man} - e_{woman} \approx e_{king} - e_{w}}\]Find word Wi
\[{argmax_w sim(e_{w}, e_{king} - e{man} + e{woman})}\]Cosine similarity
\[{sim(e_{w}, e_{king} - e_{man} + e_{woman})}\] \[{sim(u, v) = \frac{u^{T}v}{||u||_{2}||v||_{2}}}\]Man:Woman as Boy:Girl
Ottawa:Canada as Nairobi:Kenya
Big:Bigger as Tall:Taller
Yen:Japan as Ruble:Russia
Embedding matrix
Embedding matrix
\[E*O_{6257} = [] = e_{6257}\] \[E*O_{j} = e_{j} = embedding for word j\]Learning Word Embeddings: Word2vec & GloVe
Learning word embeddings
Neural language model
I | want | a | glass | of | orange | ___. |
4343 | 9665 | 1 | 3852 | 6163 | 6257 |
I o4343 —> E —> e4343 —> O want o9665 —> E —> e9665 —> O a o1 —> E —> e1 —> O —\ Softmax glass o3852 —> E —> e3852 —> O —/ 10000 of o6163 —> E —> e6163 —> O orange o6257 —> E —> e6257 —> O
Other context/target pairs
I want a glass of orange juice to go alone with my cereal.
Last 4 words: a glass of orange ?
4 words on left & right: a glass of orange ? to go along with my ceral
Last 1 word: orange ?
Nearby 1 word(skip gram): glass … ?
Vocab size = 10000k
x —> y
Context c (“orange”) 6257 —> Target t (“juice”) 4834
Oc —> E —> ec —> O softmax —> y hat
ec = Eoc
Problems with softmax classification
Expensive calculation — Do Hierarchical softmax classifier (binary classifier with common words in the upper levels)
How to sample the context c?
Randomly choose from the context will result in getting the common words like “the”, “of”, “a”…
In practice, different heuristics will be used to balance out something from the common words together with the less common words.
Negative Sampling
Defining a new learning problem
Context | Word | Target |
orange | juice | 1 |
orange | king | 0 |
orange | book | 0 |
orange | the | 0 |
orange | of | 0 |
k pairs with target 0 where k = 5 to 20 for smaller data sets and 2 to 5 for large data sets Inputs x as context and word, output y as target
Regression model Instead of doing 10000 calculations, it’s doing k + 1
Selecting negative examples
Somewhere in-between the extreme of taking uniform distribution and the other extreme of justing taking whatever was observed distribution in training set
GloVe word vectors
GloVe (global vectors for word representation)
c, t
Xij = #times i appears in the context of j
where i is t, j is c
Xij = Xji
minimize \(\sum_{i=1}^{10,000}\sum_{j=1}^{10,000}f(x_{ij})(\theta_i^Te_j + b_i + b_j^{'} - logX_{ij})^2\)
where \(f(x_{ij})\) is the weighting function
\(\theta_i\) and \(e_j\) are symmetric
A note on the featurization view of word embeddings
Word & Properity | Man (5391) | Woman (9853) | King (4914) | Queen (7157) |
Gender | -1 | 1 | -0.95 | 0.97 |
Royal | 0.01 | 0.02 | 0.93 | 0.95 |
Age | 0.03 | 0.02 | 0.70 | 0.69 |
Food | 0.09 | 0.01 | 0.02 | 0.01 |
minimize \(\sum_{i=1}^{10,000}\sum_{j=1}^{10,000}f(x_{ij})(\theta_i^Te_j + b_i + b_j^{'} - logX_{ij})^2\)
Applications using Word Embeddings
Sentiment Classfication
Sentiment classification problem
x | y |
The desert is excellent. | 4 stars |
Service was quite slow. | 2 stars |
Good for a quick meal,but nothing special. | 3 stars |
Simple sentiment classification model
The | dessert | is | excellent | 4 stars |
8928 | 2468 | 4694 | 3180 |
The | O8928 —> E | e8928 |
desert | O2468 —> E | e2468 |
is | O4694 —> E | e4694 |
excellent | O3180 —> E | e3180 |
Disadvantage: ignore word order
RNN for sentiment classification
Use many-to-one RNN
Debiasing word embeddings
The problem of bias in word embeddings
Man:Woman as King:Queen
Man:Computer_Programmer as Woman:Homemaker
Father:Doctor as Mother:Nurse
Word embeddings can reflect gender, ethnicity, age, sexual orientation, and other biases of the text used to train the model.
Addressing bias in word embeddings
- Identify bias direction. ehe - eshe emale - efemale average
- Neutralize: For every word that is not definitional, project to get rid of bias.
- Equalize pairs.
blog comments powered by Disqus