Identifiability
-5
0
5
-5
0
5
-5
0
5
-5
0
5
-5
0
5
-5
0
5
-5
0
5
-5
0
5
-5
0
5
-5
0
5
-5
0
5
-5
0
5
-5
0
5
-5
0
5
learned representation
In theory, the theory should have worked; in practice...
We have constructed a set of tools that allow us to engage with
learned representations in models with a deterministic decoder.
Let's have a look at how it works in practice...
In theory, the theory should have worked; in practice...
We have constructed a set of tools that allow us to engage with
learned representations in models with a deterministic decoder.
Let's have a look at how it works in practice...
As a test, we construct a simple dataset of points lying near
a deformed cicle embedded in 1000 dimensions.
In theory, the theory should have worked; in practice...
We have constructed a set of tools that allow us to engage with
learned representations in models with a deterministic decoder.
Let's have a look at how it works in practice...
As a test, we construct a simple dataset of points lying near
a deformed cicle embedded in 1000 dimensions.
In theory, the theory should have worked; in practice...
We have constructed a set of tools that allow us to engage with
learned representations in models with a deterministic decoder.
Let's have a look at how it works in practice...
As a test, we construct a simple dataset of points lying near
a deformed cicle embedded in 1000 dimensions.
In theory, the theory should have worked; in practice...
We have constructed a set of tools that allow us to engage with
learned representations in models with a deterministic decoder.
Let's have a look at how it works in practice...
As a test, we construct a simple dataset of points lying near
a deformed cicle embedded in 1000 dimensions.
In theory, the theory should have worked; in practice...
We have constructed a set of tools that allow us to engage with
learned representations in models with a deterministic decoder.
Let's have a look at how it works in practice...
As a test, we construct a simple dataset of points lying near
a deformed cicle embedded in 1000 dimensions.
So, geodesic distances are identifiable,
but the underlying curves do not seem
to represent what's actually happening
in the data.
So, the theory is great, but in practice,
it's perhaps not so great...
In theory, the theory should have worked; in practice...
We have constructed a set of tools that allow us to engage with
learned representations in models with a deterministic decoder.
Let's have a look at how it works in practice...
As a test, we construct a simple dataset of points lying near
a deformed cicle embedded in 1000 dimensions.
So, geodesic distances are identifiable,
but the underlying curves do not seem
to represent what's actually happening
in the data.
So, the theory is great, but in practice,
it's perhaps not so great...
In theory, the theory should have worked; in practice...
We have constructed a set of tools that allow us to engage with
learned representations in models with a deterministic decoder.
Let's have a look at how it works in practice...
As a test, we construct a simple dataset of points lying near
a deformed cicle embedded in 1000 dimensions.
So, geodesic distances are identifiable,
but the underlying curves do not seem
to represent what's actually happening
in the data.
So, the theory is great, but in practice,
it's perhaps not so great...
Let's do the math...
Since we see failures around a hole in the manifold, let's analyse what happens 'away from data'
Let's do the math...
Consider the mean of a posterior Gaussian process as our 'decoder'
The associated metric (Jacobian-transposed times Jacobian) is then
where I assume no observation noise, i.e.
Let's do the math...
Consider the mean of a posterior Gaussian process as our 'decoder'
The associated metric (Jacobian-transposed times Jacobian) is then
where I assume no observation noise, i.e.
First observation: if we have infinite (noise free) data and use a universal kernel,
then we will learn the true decoder where we have data, i.e.
Let's do the math...
The behavior 'away' from data changes with the choice of covariance function. Let's consider this one
Let's do the math...
The behavior 'away' from data changes with the choice of covariance function. Let's consider this one
This extrapolates to zero, so the Jacobian also becomes zero, i.e.
Let's do the math...
The behavior 'away' from data changes with the choice of covariance function. Let's consider this one
This extrapolates to zero, so the Jacobian also becomes zero, i.e.
Let's do the math...
Okay, that didn't work; let's try linear extrapolation
Let's do the math...
Okay, that didn't work; let's try linear extrapolation
This extrapolates, well, linearly, i.e.
(i.e. the metric away from data becomes Euclidean)
Intermediate summary
~
~
Geometry is nice in theory, but fails in practice because of smooth extrapolation.
A 'fix' is to extrapolate to 'wigglyness', but then we give up on stable learning, so that won't work.
Doesn't uncertainty influence 'away from data'?
GPs inform us as to when they extrapolate
(through predictive uncertainty).
Perhaps we should look at GPs?
(also we're at GPSS)
SuperGauss
Geometry GP-style
Gaussians are closed under linear operations, such that the derivative of a GP is another GP
(is that clear?)
GP
Derivative
Geometry GP-style
Gaussians are closed under linear operations, such that the derivative of a GP is another GP
(is that clear?)
GP
Derivative
The decoder Jacobian must also follow a Gaussian process
Implication:
(assuming conditionally independent output dimensions)
Geometry GP-style
Gaussians are closed under linear operations, such that the derivative of a GP is another GP
(is that clear?)
GP
Derivative
The decoder Jacobian must also follow a Gaussian process
Implication:
(assuming conditionally independent output dimensions)
(assuming same covariance across output dimensions)
Geometry GP-style (lower your expectations)
With a Gaussian Jacobian
the metric follows a non-central Wishart distribution
(it's okay if you don't know what that is; we won't use it for anything)
Geometry GP-style
Gaussians are closed under linear operations, such that the derivative of a GP is another GP
(is that clear?)
GP
Derivative
The decoder Jacobian must also follow a Gaussian process
Implication:
(assuming conditionally independent output dimensions)
(assuming same covariance across output dimensions)
The metric is now a random variable and the geometry stuff no
longer applies:
we're screwed!
Geometry GP-style (lower your expectations)
With a Gaussian Jacobian
the metric follows a non-central Wishart distribution
(it's okay if you don't know what that is; we won't use it for anything)
The moments of this distribution are
Geometry GP-style (lower your expectations)
With a Gaussian Jacobian
the metric follows a non-central Wishart distribution
(it's okay if you don't know what that is; we won't use it for anything)
The moments of this distribution are
Suggests that perhaps it's not too terrible to replace the stochastic
metric with the
expected metric!
In theory, the theory should have worked; in practice...
Remember how things used to look?
Let's do the math...
Assuming noise-free infinite data and prior covariance
Recall that when disregarding uncertainty
With uncertainty...
Let's do the math...
Assuming noise-free infinite data and prior covariance
Recall that when disregarding uncertainty
With uncertainty...
Let's do the math...
Assuming noise-free infinite data and prior covariance
Recall that when disregarding uncertainty
With uncertainty...
Let's do the math...
Assuming noise-free infinite data and prior covariance
Recall that when disregarding uncertainty
With uncertainty...
this, my friends, is
called 'hope'
Intermediate summary
~
~
Geometry is nice in theory, but it fails when disregarding uncertainty. With uncertainty there is hope!
-5
0
5
-5
0
5
-5
0
5
-5
0
5
-5
0
5
-5
0
5
-5
0
5
-5
0
5
-5
0
5
-5
0
5
In theory, the theory should have worked; in practice...
Remember how things used to look?
-5
0
5
-5
0
5
-5
0
5
-5
0
5
-5
0
5
-5
0
5
Let's do the math...
Okay, that didn't work; let's try linear extrapolation
This extrapolates, well, linearly, i.e.
(i.e. the metric away from data becomes Euclidean)
-5
0
5
-5
0
5
What does all of this imply?
If we want geodesics to stay close to data, then we must penalize
the metric away from data.
The metric is determined by the Jacobian, so we want large
Jacobians away from data.
All common regularizers do the opposite, i.e. we need to regularize
towards 'wiggly' functions if we want geometry to work, i.e.
we have to give up on stable learning :-(
I'll come crawling back to the distributional
stuff and we see that geometry works
after all!
(GPSS for the win!)
Fitting a
GP-LVM is hard
Identifiability
Søren Hauberg
1