Identifiability-505-505-505-505-505-505-505-505-505-505-505-505-505-505learned representationIn theory, the theory should have worked; in practice...We have constructed a set of tools that allow us to engage withlearned representations in models with a deterministic decoder.Let's have a look at how it works in practice...In theory, the theory should have worked; in practice...We have constructed a set of tools that allow us to engage withlearned representations in models with a deterministic decoder.Let's have a look at how it works in practice...As a test, we construct a simple dataset of points lying neara deformed cicle embedded in 1000 dimensions.In theory, the theory should have worked; in practice...We have constructed a set of tools that allow us to engage withlearned representations in models with a deterministic decoder.Let's have a look at how it works in practice...As a test, we construct a simple dataset of points lying neara deformed cicle embedded in 1000 dimensions.In theory, the theory should have worked; in practice...We have constructed a set of tools that allow us to engage withlearned representations in models with a deterministic decoder.Let's have a look at how it works in practice...As a test, we construct a simple dataset of points lying neara deformed cicle embedded in 1000 dimensions.In theory, the theory should have worked; in practice...We have constructed a set of tools that allow us to engage withlearned representations in models with a deterministic decoder.Let's have a look at how it works in practice...As a test, we construct a simple dataset of points lying neara deformed cicle embedded in 1000 dimensions.In theory, the theory should have worked; in practice...We have constructed a set of tools that allow us to engage withlearned representations in models with a deterministic decoder.Let's have a look at how it works in practice...As a test, we construct a simple dataset of points lying neara deformed cicle embedded in 1000 dimensions.So, geodesic distances are identifiable,but the underlying curves do not seemto represent what's actually happeningin the data.So, the theory is great, but in practice,it's perhaps not so great...In theory, the theory should have worked; in practice...We have constructed a set of tools that allow us to engage withlearned representations in models with a deterministic decoder.Let's have a look at how it works in practice...As a test, we construct a simple dataset of points lying neara deformed cicle embedded in 1000 dimensions.So, geodesic distances are identifiable,but the underlying curves do not seemto represent what's actually happeningin the data.So, the theory is great, but in practice,it's perhaps not so great...In theory, the theory should have worked; in practice...We have constructed a set of tools that allow us to engage withlearned representations in models with a deterministic decoder.Let's have a look at how it works in practice...As a test, we construct a simple dataset of points lying neara deformed cicle embedded in 1000 dimensions.So, geodesic distances are identifiable,but the underlying curves do not seemto represent what's actually happeningin the data.So, the theory is great, but in practice,it's perhaps not so great...Let's do the math...Since we see failures around a hole in the manifold, let's analyse what happens 'away from data'Let's do the math...Consider the mean of a posterior Gaussian process as our 'decoder'The associated metric (Jacobian-transposed times Jacobian) is thenwhere I assume no observation noise, i.e.Let's do the math...Consider the mean of a posterior Gaussian process as our 'decoder'The associated metric (Jacobian-transposed times Jacobian) is thenwhere I assume no observation noise, i.e.First observation: if we have infinite (noise free) data and use a universal kernel,then we will learn the true decoder where we have data, i.e.Let's do the math...The behavior 'away' from data changes with the choice of covariance function. Let's consider this oneLet's do the math...The behavior 'away' from data changes with the choice of covariance function. Let's consider this oneThis extrapolates to zero, so the Jacobian also becomes zero, i.e.Let's do the math...The behavior 'away' from data changes with the choice of covariance function. Let's consider this oneThis extrapolates to zero, so the Jacobian also becomes zero, i.e.Let's do the math...Okay, that didn't work; let's try linear extrapolationLet's do the math...Okay, that didn't work; let's try linear extrapolationThis extrapolates, well, linearly, i.e.(i.e. the metric away from data becomes Euclidean)Intermediate summary~~Geometry is nice in theory, but fails in practice because of smooth extrapolation.A 'fix' is to extrapolate to 'wigglyness', but then we give up on stable learning, so that won't work.Doesn't uncertainty influence 'away from data'?GPs inform us as to when they extrapolate(through predictive uncertainty).Perhaps we should look at GPs?(also we're at GPSS)SuperGaussGeometry GP-styleGaussians are closed under linear operations, such that the derivative of a GP is another GP(is that clear?)GPDerivativeGeometry GP-styleGaussians are closed under linear operations, such that the derivative of a GP is another GP(is that clear?)GPDerivativeThe decoder Jacobian must also follow a Gaussian processImplication:(assuming conditionally independent output dimensions)Geometry GP-styleGaussians are closed under linear operations, such that the derivative of a GP is another GP(is that clear?)GPDerivativeThe decoder Jacobian must also follow a Gaussian processImplication:(assuming conditionally independent output dimensions)(assuming same covariance across output dimensions)Geometry GP-style (lower your expectations)With a Gaussian Jacobianthe metric follows a non-central Wishart distribution(it's okay if you don't know what that is; we won't use it for anything)Geometry GP-styleGaussians are closed under linear operations, such that the derivative of a GP is another GP(is that clear?)GPDerivativeThe decoder Jacobian must also follow a Gaussian processImplication:(assuming conditionally independent output dimensions)(assuming same covariance across output dimensions)The metric is now a random variable and the geometry stuff nolonger applies: we're screwed!Geometry GP-style (lower your expectations)With a Gaussian Jacobianthe metric follows a non-central Wishart distribution(it's okay if you don't know what that is; we won't use it for anything)The moments of this distribution areGeometry GP-style (lower your expectations)With a Gaussian Jacobianthe metric follows a non-central Wishart distribution(it's okay if you don't know what that is; we won't use it for anything)The moments of this distribution areSuggests that perhaps it's not too terrible to replace the stochasticmetric with the expected metric!In theory, the theory should have worked; in practice...Remember how things used to look?Let's do the math...Assuming noise-free infinite data and prior covarianceRecall that when disregarding uncertainty With uncertainty...Let's do the math...Assuming noise-free infinite data and prior covarianceRecall that when disregarding uncertainty With uncertainty...Let's do the math...Assuming noise-free infinite data and prior covarianceRecall that when disregarding uncertainty With uncertainty...Let's do the math...Assuming noise-free infinite data and prior covarianceRecall that when disregarding uncertainty With uncertainty...this, my friends, iscalled 'hope'Intermediate summary~~Geometry is nice in theory, but it fails when disregarding uncertainty. With uncertainty there is hope!-505-505-505-505-505-505-505-505-505-505In theory, the theory should have worked; in practice...Remember how things used to look?-505-505-505-505-505-505Let's do the math...Okay, that didn't work; let's try linear extrapolationThis extrapolates, well, linearly, i.e.(i.e. the metric away from data becomes Euclidean)-505-505What does all of this imply?If we want geodesics to stay close to data, then we must penalizethe metric away from data.The metric is determined by the Jacobian, so we want largeJacobians away from data.All common regularizers do the opposite, i.e. we need to regularizetowards 'wiggly' functions if we want geometry to work, i.e.we have to give up on stable learning :-(I'll come crawling back to the distributionalstuff and we see that geometry worksafter all!(GPSS for the win!)Fitting a GP-LVM is hardIdentifiabilitySøren Hauberg
1