Variational Bayesian Monte Carlo
- Variational Inference + Bayesian Quadrature (Bayesian Monte Carlo)
- framework for inference for models with expensive likelihood
- axis-aligned so cannot deal with highly correlated posteriors
Result
- Nonparametric analytical approximation of posterior distribution of unobserved variables (parameters + latent variables), to do statistical inference over them.
- Approximate lower bound for model evidence (marginal likelihood, Bayes factor) of the observed data, can be used for model selection. Idea is that higher the marginal likelihood for a given model, it is a better fit over the data by that model so has greater probability that this model generated the data.
We can also use sampling-based methods (MCMC eg. Gibbs Sampling) to approximate the intractable solution of an inference problem.
Why not MCMC?
- May take huge time to find global optimal solution but will always do if given enough time.
- Require choosing appropriate sampling technique beforehand.
Variational Inference (Variational Bayesian Methods, Variational Bayes, VI)
- statistical tool for approximate inference
- to track intractable integrals
- considers inference as an optimisation problem
- almost never finds the global optima
- Diff b/w EM and VI : no distinction between latent variable and parameter in VI but EM puts distinction on both
intractable distri P ~ q belonging to tractable distri Q
Kullback-Leibler Divergence
- measure differences in information contained within two distri q, p
- >= 0 for all q, p
- = 0 iff q = p
- not symmetric
KL(q || p) = \sum ( q(x) log ( q(x) / p(x) )
Optimisation objective - J(q) - captures similarity b/w q and p
Bayesian Quadrature (BQ)
- quadrature : finding area of something; expressing solution as integral
- to obtain Bayesian estimates of mean and variance of non-analytic integrals in variational objective
- model-based integration - allows active learning about single fixed integral and minimize their variance; require fewer samples than MCMC
- usually gaussian kernel in GP prior otherwise the integral is intractable
- works well in low-dim (< 10)
- Acq func eg 1 - expected entropy : minimize expected entropy after adding x to training set
- Acq func eg 2 - uncertainty sampling strategy : maximize variance of integrand at x
(f) = \int f(x) \pi (x)dx,
Here, f(x) - GP prior & π(x) - known prob distri
Gaussian Process
- inf dim Gaussian r.v.
- stochastic process - systems randomly changing over time; process can evolve in any dir (realizations); eg. Brownian motion
- kernel based prob distri
- to specify prior over unknown functions in Bayesian inference
- real-valued mean function m and PSD covar/kernel function K
- Why PSD? K should be such that for any set x1, . . . , xm \in X , covar matrix K should be valid corr to some MVN. This happens when K is PSD. Must satisfy Mercer's conditions.
- any valid kernel function can be used as covar func.
- any finite subcollection of random variables has a multivariate Gaussian distribution.
- smoothness of the prior comes from covar
Active Sampling
- When variance of posterior depends on func values
- acquisition function [a:X->R] - determines how we search for new points by solving proxy optimisation x_new = argmax_x a(x)
- for GP prior, acq. func. is a func of mean of f(x*) , std of f(x*) , & y_best seen during opt.
VBMC Algo
In each iteration t,
- sequentially sample a batch of new points n_active that maximise acquisition func. a(\theta) and evaluate log joint f at each point
- train GP surrogate model of log joint f; training set is points evaluated so far
- update variational posterior approx. by optimising surrogate ELBO calculated via Bayesian quadrature.
VI using GP as surrogate model f for expensive log posterior. Keep updating GP using active sampling.
In each iteration except 1st, VBMC samples n_active (=5) points. Select each point sequentially, by optimising acquisition func. & apply fast rank-one updates of GP posterior after each acquisition.
No sampling in first iteration so that variational posterior can adapt.
Algo works in unconstrained inference space R^D but parameters with bound constraints can be handled via nonlinear remapping of input space via a shifted and rescaled logit transform, with Jacobian correction of log prob density. Solutions are ed back to the original space via a matched inverse transform, e.g., a shifted and rescaled logistic function for bound parameters.
Variational Posterior q(\theta)
- from a non-paramteric family
- mixture of K Gaussians with shared diagonal covariance matrix \Sigma, modulo a scaling factor ; w_k = weight of kth comp ; \mu_k = mean of kth comp ; \sigma_k = scale of kth comp
- \phi = (w1,...,wK,\mu1,...,\muK,\sigma1,...,\sigmaK,\lambda) ; \phi has K + DK + K + D parameters
- K is set adaptively (K=2 initially)
- Variational paramters in t-th iter - φ_t
ELBO (Evidence Lower Bound, negative free energy)
- need to approximate
- using 2 ways
- approximate log joint prob f with GP with SE(rescaled) kernel, Gaussian likelihood with obs. noise σ_obs > 0 (for numerical stability) & neg quadratic mean func m(x) to ensure finiteness of variational objective
- approximate entropy of variational posterior q(θ) and its gradient via Monte Carlo sampling & reparametrization trick.
- using mean expected log joint E_phi (f) , entropy and their gradients , optimize neg mean ELBO via SGD
GP f : SE kernel, gaussian likelihood, negative quadratic mean function. GP hyperparam are estimated via MCMC when there’s large uncertainty about GP and then via MAP estimates using gradient-based optimization.
ELCBO (Evidence Lower Confidence Bound)
- first 2 terms : ELBO, estimated via Bayesian Quadrature
- 3rd term : uncertainty in computation of expected log joint * risk sensitivity param
- probabilistic lower bound on ELBO
- assess improvement of variational solution
In VBMC algo, do active sampling sequentially to find sequence of integrals across iterations 1 ..., T s.t.
- seq of variational parameters converge to variational posterior that minimize KL-divergence with true posterior
- min var on final estimate of ELBO
2 acquisition functions for VBMC based on uncertainty sampling (operate pointwise on posterior density)
- Vanilla uncertainty sampling a_us : for exploitation; maximizes variance of integrand under current variational parameters
- Prospective uncertainty sampling a_pro(\theta) : for exploration ; reducing uncertainty of variational objective both for current posterior & at prospective locations where variational posterior might go; Selects points from regions of high prob density; variational posterior acts as regularizer - preventing active sampling from following eager fluctuations of GP mean
Adaptive treatment of GP hyperparam
- 3D+3
- empirical Bayes prior based on current training set
- in each iteration, collect n_gp samples.
- Using these samples and r.v. X that depends on samples, find expected mean and var of X.
- Algo switches to MAP estimation of hyperparatwsm via gradient-based opt. when variability of expected log joint of samples falls below threshold
Initialization
- x_0 = starting point from region of high posterior prob mass.
- PUB, PLB = vectors identifying region of high posterior prob mass in param space
- n_init = 10 ; to randomly uniformly sample 10 points in plausible box Uniform[PLB,PUB]
- Plausible box sets reference scale for each var.
Warm up
- to converge faster to regions of high posterior prob
- initialize variational posterior with K=2 comp
- end when ELCBO shows improvement less than 1 for 3 consecutive iters.
- do trimming
Adaptive num of mixture comps - K
- variational sol is improving if ELCBO of last iter is > ELCBO of last n_recent (=4) iters
- in each iter, increment K by 1 if sol is improving (improvement in ELBO)
- to speed up adaptation, add 2 extra comp if sol is stable. Each new comp is created by splitting and jittering random existing comp.
- At end of each variational opt. , prune random comp k with w_k < w_min
Termination
- assign reliability index to current sol
- terminate when stable sol for n_stable (=8) iters or when reaching n_max func evals
- return estimate of mean and std of ELBO (lower bound on marginal likeli) & variational posterior
Future Work
- plausible box to inform other aspects of algo. Plausible box sets reference scale for each var.
Variational whitening
- technique to deal with non-axis aligned posteriors that are problematic for VBMC in noise.
- W transform inference space linearly (rotation and rescaling) s.t. q(\theta) has unit diagonal covariance matrix C_\phi
- Find W by doing SVD of C_\phi
Acq func:
- a_npro
- Global Acq func:
- driven by uncertainty in posterior mass
- account for non-local changes in GP model when making new obs
EIG (Expected Information Gain)
- sample points that maximize EIG of integral G (eqn 2)
- choose next location θ* that maximizes mutual information I[G;y*] ; G=expected log joint, y* = new obs
IMIQR/VIQR
- IQR (interquantile range) : estimate of uncertainty of unnormalized posterior
- integral is intractable so approximate it using MCMC and importance sampling
Applications
Application of Kriging and Variational Bayesian Monte Carlo method for improved prediction of doped UO2 fission gas release