Paper : arXiv

Bayesian Batch Active Learning as Sparse Subset Approximation (NeurIPS, 2019)

Can use huge unlabelled data to improve supervised models. Use probabilistic active learning (AL) methods to greedily select most informative data-points to be labelled. However, greedy methods are computationally infeasible so can select only subset (batch) of points D' and suffer from no model change.
So, they used bayesian batch active learning approach that approximates complete data posterior of model parameters. Algorithm produces batches that enable efficient active learning.

AL

No access to labels before querying pool-set

Approach


Initial Labelled Dataset : D_0
Unlabelled Poolset : X_p
Corr labels : Y_p

Batch Construction as Sparse Approximation

Choose batch that best approximates \sum_m L_m(θ) since first term in eqn 4 depends only on D_0 Construct batches by solving w* =minimize ||L−L(w)||^2 in Hilbert space induced by inner product L_nL_m The choice of inner product introduces notion of directionality into optimization procedure - thus, query batches are constructed while also accounting for similarity between selected points.
On probit regression task : For BALD, choose b=10 to find most informative points. BALD acq func does not change during batch construction. ACS acq func rotates after each selected data point. ACS-FW is able to spread the batch in data space, avoiding the strongly correlated queries that BALD produces.

Q) What does this quantity mean? How is it used to select a batch D' of inputs w ?
We re-cast batch construction process as sparse approximation to complete data log posterior defined in eqn 4. Since the first term in eqn 4 only depends on D_0, so we select a batch D’ that best approximates the 2nd term of eqn 4 i.e. \sum_m L_m(θ). Here, L_m are considered as vectors in function space. Let binary weight vector w ∈ {0, 1}^M constrain the points that should be included in batch D’ so we write L(w) = \sum_m w_m L_m. The sparse subset approximation problem becomes solving for optimal weights w* =minimize || L−L(w) ||^2 This objective function approximates L such that the posterior lies as close to the expected complete data log posterior as possible, if we had observed the complete pool.

The relaxed objective that is minimized is, w* = minimize (1−w)^T K(1−w). It is obtained when the weight vector w is changed from being binary to being non-negative. The cardinality constraint is replaced with polytope constraint in Hilbert space. This objective is solved using Frank-Wolfe algorithm that converges to optimal w* after b iterations. The optimum of objective function is attained at vertices of polytope. The weights are iteratively updated along the f-th vertex of polytope. This means adding atmost one data point to a batch in every iteration.

Bayesian linear regression and Probit regression. In the paper, they have used weighted Fisher inner product to get closed form expressions for these models. For general models, where acquisition function might not be available in closed form, they suggested to use random feature projections that allows us to approximate they key quantities that are needed during batch construction. This method scales linearly in poolset size |P|. The appropriate projection is, Lˆn=√1 [Ln(θ1),···,Ln(θJ)]T, θj∼πˆ, i.e. Lˆn represents the J-dimensional projection of Ln in Euclidean space