Shruti Sharma | Portfolio

Paper-1 : arXiv

Paper-2 arXiv

Variational Bayesian Monte Carlo

Variational Inference + Bayesian Quadrature (Bayesian Monte Carlo)
framework for inference for models with expensive likelihood
axis-aligned so cannot deal with highly correlated posteriors

Result

Nonparametric analytical approximation of posterior distribution of unobserved variables (parameters + latent variables), to do statistical inference over them.
Approximate lower bound for model evidence (marginal likelihood, Bayes factor) of the observed data, can be used for model selection. Idea is that higher the marginal likelihood for a given model, it is a better fit over the data by that model so has greater probability that this model generated the data.

We can also use sampling-based methods (MCMC eg. Gibbs Sampling) to approximate the intractable solution of an inference problem.

Why not MCMC?

May take huge time to find global optimal solution but will always do if given enough time.
Require choosing appropriate sampling technique beforehand.

Variational Inference (Variational Bayesian Methods, Variational Bayes, VI)

statistical tool for approximate inference
to track intractable integrals
considers inference as an optimisation problem
almost never finds the global optima
Diff b/w EM and VI : no distinction between latent variable and parameter in VI but EM puts distinction on both

intractable distri P ~ q belonging to tractable distri Q

Kullback-Leibler Divergence

measure differences in information contained within two distri q, p
>= 0 for all q, p
= 0 iff q = p
not symmetric

KL(q || p) = \sum ( q(x) log ( q(x) / p(x) )
Optimisation objective - J(q) - captures similarity b/w q and p

Bayesian Quadrature (BQ)

quadrature : finding area of something; expressing solution as integral
to obtain Bayesian estimates of mean and variance of non-analytic integrals in variational objective
model-based integration - allows active learning about single fixed integral and minimize their variance; require fewer samples than MCMC
usually gaussian kernel in GP prior otherwise the integral is intractable
works well in low-dim (< 10)
Acq func eg 1 - expected entropy : minimize expected entropy after adding x to training set
Acq func eg 2 - uncertainty sampling strategy : maximize variance of integrand at x

(f) = \int f(x) \pi (x)dx,
Here, f(x) - GP prior & π(x) - known prob distri

Gaussian Process

inf dim Gaussian r.v.
stochastic process - systems randomly changing over time; process can evolve in any dir (realizations); eg. Brownian motion
kernel based prob distri
to specify prior over unknown functions in Bayesian inference
real-valued mean function m and PSD covar/kernel function K
Why PSD? K should be such that for any set x1, . . . , xm \in X , covar matrix K should be valid corr to some MVN. This happens when K is PSD. Must satisfy Mercer's conditions.
any valid kernel function can be used as covar func.
any finite subcollection of random variables has a multivariate Gaussian distribution.
smoothness of the prior comes from covar

Active Sampling

When variance of posterior depends on func values
acquisition function [a:X->R] - determines how we search for new points by solving proxy optimisation x_new = argmax_x a(x)
for GP prior, acq. func. is a func of mean of f(x*) , std of f(x*) , & y_best seen during opt.

VBMC Algo

In each iteration t,

sequentially sample a batch of new points n_active that maximise acquisition func. a(\theta) and evaluate log joint f at each point
train GP surrogate model of log joint f; training set is points evaluated so far
update variational posterior approx. by optimising surrogate ELBO calculated via Bayesian quadrature.

VI using GP as surrogate model f for expensive log posterior. Keep updating GP using active sampling.
In each iteration except 1st, VBMC samples n_active (=5) points. Select each point sequentially, by optimising acquisition func. & apply fast rank-one updates of GP posterior after each acquisition. No sampling in first iteration so that variational posterior can adapt.
Algo works in unconstrained inference space R^D but parameters with bound constraints can be handled via nonlinear remapping of input space via a shifted and rescaled logit transform, with Jacobian correction of log prob density. Solutions are ed back to the original space via a matched inverse transform, e.g., a shifted and rescaled logistic function for bound parameters.

Variational Posterior q(\theta)

from a non-paramteric family
mixture of K Gaussians with shared diagonal covariance matrix \Sigma, modulo a scaling factor ; w_k = weight of kth comp ; \mu_k = mean of kth comp ; \sigma_k = scale of kth comp
\phi = (w1,...,wK,\mu1,...,\muK,\sigma1,...,\sigmaK,\lambda) ; \phi has K + DK + K + D parameters
K is set adaptively (K=2 initially)
Variational paramters in t-th iter - φ_t

ELBO (Evidence Lower Bound, negative free energy)

need to approximate
using 2 ways
- approximate log joint prob f with GP with SE(rescaled) kernel, Gaussian likelihood with obs. noise σ_obs > 0 (for numerical stability) & neg quadratic mean func m(x) to ensure finiteness of variational objective
- approximate entropy of variational posterior q(θ) and its gradient via Monte Carlo sampling & reparametrization trick.
using mean expected log joint E_phi (f) , entropy and their gradients , optimize neg mean ELBO via SGD

GP f : SE kernel, gaussian likelihood, negative quadratic mean function. GP hyperparam are estimated via MCMC when there’s large uncertainty about GP and then via MAP estimates using gradient-based optimization.

ELCBO (Evidence Lower Confidence Bound)

first 2 terms : ELBO, estimated via Bayesian Quadrature
3rd term : uncertainty in computation of expected log joint * risk sensitivity param
probabilistic lower bound on ELBO
assess improvement of variational solution

In VBMC algo, do active sampling sequentially to find sequence of integrals across iterations 1 ..., T s.t.

seq of variational parameters converge to variational posterior that minimize KL-divergence with true posterior
min var on final estimate of ELBO

2 acquisition functions for VBMC based on uncertainty sampling (operate pointwise on posterior density)

Vanilla uncertainty sampling a_us : for exploitation; maximizes variance of integrand under current variational parameters
Prospective uncertainty sampling a_pro(\theta) : for exploration ; reducing uncertainty of variational objective both for current posterior & at prospective locations where variational posterior might go; Selects points from regions of high prob density; variational posterior acts as regularizer - preventing active sampling from following eager fluctuations of GP mean

Adaptive treatment of GP hyperparam

3D+3
empirical Bayes prior based on current training set
in each iteration, collect n_gp samples.
Using these samples and r.v. X that depends on samples, find expected mean and var of X.
Algo switches to MAP estimation of hyperparatwsm via gradient-based opt. when variability of expected log joint of samples falls below threshold

Initialization

x_0 = starting point from region of high posterior prob mass.
PUB, PLB = vectors identifying region of high posterior prob mass in param space
n_init = 10 ; to randomly uniformly sample 10 points in plausible box Uniform[PLB,PUB]
Plausible box sets reference scale for each var.

Warm up

to converge faster to regions of high posterior prob
initialize variational posterior with K=2 comp
end when ELCBO shows improvement less than 1 for 3 consecutive iters.
do trimming

Adaptive num of mixture comps - K

variational sol is improving if ELCBO of last iter is > ELCBO of last n_recent (=4) iters
in each iter, increment K by 1 if sol is improving (improvement in ELBO)
to speed up adaptation, add 2 extra comp if sol is stable. Each new comp is created by splitting and jittering random existing comp.
At end of each variational opt. , prune random comp k with w_k < w_min

Termination

assign reliability index to current sol
terminate when stable sol for n_stable (=8) iters or when reaching n_max func evals
return estimate of mean and std of ELBO (lower bound on marginal likeli) & variational posterior

Future Work

plausible box to inform other aspects of algo. Plausible box sets reference scale for each var.

Variational whitening

technique to deal with non-axis aligned posteriors that are problematic for VBMC in noise.
W transform inference space linearly (rotation and rescaling) s.t. q(\theta) has unit diagonal covariance matrix C_\phi
Find W by doing SVD of C_\phi

Acq func:

a_npro
Global Acq func:
- driven by uncertainty in posterior mass
- account for non-local changes in GP model when making new obs
EIG (Expected Information Gain)
- sample points that maximize EIG of integral G (eqn 2)
- choose next location θ* that maximizes mutual information I[G;y*] ; G=expected log joint, y* = new obs
IMIQR/VIQR
- IQR (interquantile range) : estimate of uncertainty of unnormalized posterior
- integral is intractable so approximate it using MCMC and importance sampling

Applications

Application of Kriging and Variational Bayesian Monte Carlo method for improved prediction of doped UO2 fission gas release