openboundlabs.com


Goal
Simplest embedding problem using Tensorflow

Inspiried from this: http://matpalm.com/blog/2015/03/28/theano_word_embeddings/

https://colah.github.io/posts/2015-01-Visualizing-Representations/
https://colah.github.io/posts/2014-07-NLP-RNNs-Representations/

I first worked with embedding vectors in machine learning / neural
networks many years ago where we called them context vectors and used
them to do things like automatic text summarization, named entity
recognition, and semantic machine translation.

Embedding is powerful: word2vec, glove, neural image captioning, ...
Hinton started calling them thought vectors.

The basic idea is to embed objects that might not have much direct
relation to each other into a space where higher semantic meaning
might be more easily seen using simple operations like distance. An
example is English words. Looking in the dictionary the words "happen"
and "happy" occur right next to each other. However their meanings are
quite different; alphabetic distance isn't too useful when dealing
with word meanings. Wouldn't it be nice to embed those words in a
space where words with similar meanings are close together (viz:
synonyms happen==occur / happy==cheerful)....

================================

The Setup

- Predict whether two objects (o and p) are "compatible" / should occur together or not.

- Embed N arbitrary objects identified by integers into 2d (x,y) (so I can plot them)

- Try learning a simple linear ordering "compatibility" function:

- Make adjacent pairs compatible Prob(o,p)=1.0 : (o,p) = (0,1), (1,2), (4,5), ...

- Make all others incompatible Prob(o,p)=0.0 : (o,p) = (0,2), (3,5), ...

- Assume pairs are ordered so (o,p) -> p>o

- Try a very simple learning network:

Input: (o, p) (object integers)

Embed: embedding(o) = 2d

Computation: f(o,p) = single linear layer.
input = [embedding(x), embedding(y)]
output= logit for [1-P, P] where P is probability o,p are compatible

Output: softmax of network output

The key here is that when learning, gradients can be pushed back to
the embedding vectors so they can be updated. It is _automatic_ in
frameworks like Tensorflow.

Source Code: https://github.com/michaelbrownid/simplest-embedding
or simply simplest-embedding.python.txt

The relevant parts are quite simple:

# input batchSize by two integer object numbers
inputData = tf.placeholder(tf.int32, [None,2])

# target is batchSize probability of compatible: [1-Prob(compat), Prob(compat)]
target = tf.placeholder(tf.float32, [None,2])

# this is the matrix of embedding vectors. rows=objects. cols=embedding. random [-1,1]
embedding = tf.Variable(tf.random_uniform([numObjects, embeddingSize], -1.0, 1.0))

# pull out the embedding vectors from integer ids
emap = tf.nn.embedding_lookup(embedding, inputData)

# reshape to feed pair of embedding points as single vector per datapoint
inputs = tf.reshape(emap , [-1, 2*embeddingSize])

# Set model to linear function to [1-p,p] 2d prediction vector
softmaxW = tf.Variable(tf.random_normal([2*embeddingSize,2],stddev=0.1), name="softmaxW")
softmaxB = tf.Variable(tf.random_normal([1,2], stddev=0.1), name="softmaxB")
logits = tf.matmul(inputs, softmaxW ) + softmaxB
probs = tf.nn.softmax(logits)

# cross entropy loss = \sum( -log(P prediction) of correct label)/N. Jensen's E(logX) < = logE(X).
cross = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits( logits, target))

# optimize


Note this is interesting because we will estimate both the embedding
and the linear discriminant function on the embedding simultaneously.

TODO: eliminate the linear function and use dot product with no
learned parameters.

================================

RESULT
Animate the embeddings and the decision boundries for the first 4
points (colors=black, red, green, blue) every 3 epochs of optimization
for 1000 epochs total.

What experiment do you want to see?
Coverged1
NotQuite
Converged2
current  / 333

frame step
frame delay  Smaller for faster animation!
Animate!

================================

The network can mostly learn the simple linear ordering! (A couple of
errors when not converged fully)

Here is a plot of the cross entropy objective by epoch and the bound
relation mean cross entropy to mean accuracy:

Jensen's Inequality relates the mean accuracy to mean cross-entropy.
Here $P$ is the probability assigned to the correct label and the
brackets are the expectation operator:

$$cross = < -\log P>$$
$$-cross = (< \log P>) \leq (\log < P>)$$
$$(< P>) \geq (\exp(-cross))$$

This isn't too exciting but it is interesting to see the nice linear
chain that is learned from a number of pairwise examples. It reminds
me of the Hammersley-Clifford theorem where a joint density can be
factorized over cliques (Hammersley-Clifford theorem)

It seems the ordering happens fairly quickly then the decision boundry
tightens up.

It is simple but interesting. Next it might be interesting to take ZIP
codes and embed them. First simply try to "learn"
ZIP->(latitude,longitude). Then try to pull out additional covariates
like median income, total population, ..., from the embedded vectors.

================================