Prerequisites for Multi-Task Bayesian Optimization (arXiv:1904.01049)

Target paper: Practical Multi-Task Scalable Bayesian Optimization Goal: Read the paper deeply β€” follow derivations, kernel construction, and theoretical results Profile: Solid multivariable calculus + linear algebra; introductory probability; surface-level GP and BO familiarity Estimated pace: ~10 hrs/week β†’ 4–5 weeks to paper readiness


Dependency Graph

Multivariate Gaussian conditioning
        ↓
Bayesian inference (formally)
        ↓
GP regression (derive predictive posterior)
        ↓                    ↓
Kernel theory           Marginal likelihood + LOO-CV + hyperparameter learning
        ↓                    ↓
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                  ↓
      Bayesian optimization (EI / UCB derivations)
                  ↓
      Multi-task GP kernels (LMC / ICM)
                  ↓
             πŸ“„ The Paper

Week 1 β€” Multivariate Gaussians & Bayesian Foundations (~10 hrs)

🎯 Goal: Derive the Gaussian conditional and marginal from scratch; understand Bayesian inference as posterior computation.

The GP posterior is a Gaussian conditional. Everything downstream depends on internalizing this one formula.

πŸ“š Resources

Resource Sections Format Time
Bishop β€” PRML Β§2.3 (Gaussian Distribution) Textbook 3 hrs
Murphy β€” Probabilistic ML: An Introduction Β§2.3 (Gaussians), Β§4.2 (Bayesian inference for Gaussians) Textbook 3 hrs
Petersen & Pedersen β€” The Matrix Cookbook Β§8 (block matrix inversion, Schur complement) Reference 1 hr

✏️ Exercises

Exercise 1.1 β€” Derive the Gaussian conditional. Given a joint Gaussian \(\begin{pmatrix} \mathbf{x} \\ \mathbf{y} \end{pmatrix} \sim \mathcal{N}\!\left(\begin{pmatrix} \boldsymbol{\mu}_x \\ \boldsymbol{\mu}_y \end{pmatrix}, \begin{pmatrix} \Sigma_{xx} & \Sigma_{xy} \\ \Sigma_{yx} & \Sigma_{yy} \end{pmatrix}\right)\), derive \(p(\mathbf{x} \mid \mathbf{y})\) by completing the square. Identify the Schur complement \(\Sigma_{xx} - \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{yx}\).

Exercise 1.2 β€” Bayesian linear regression. Place a Gaussian prior \(\mathbf{w} \sim \mathcal{N}(\mathbf{0}, \alpha^{-1}I)\) on weights and a Gaussian likelihood \(y \mid \mathbf{w} \sim \mathcal{N}(\mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x}), \beta^{-1})\). Derive the posterior \(p(\mathbf{w} \mid \mathcal{D})\) in closed form. This is the finite-dimensional analogue of GP regression β€” understanding this derivation cold makes Week 2 immediate.

Exercise 1.3 β€” Marginal likelihood. From Exercise 1.2, compute the marginal likelihood \(p(\mathbf{y} \mid \mathbf{X})\) by integrating out \(\mathbf{w}\). Identify which terms penalize model complexity.

βœ… Checkpoint

You should be able to: (a) state the Gaussian conditioning formula without reference, (b) explain why the conditional mean is a linear function of the conditioning variable, and (c) derive the posterior in Bayesian linear regression in under 10 minutes.


Week 2 β€” Gaussian Process Regression (~12 hrs)

🎯 Goal: Understand GPs as distributions over functions. Derive the GP predictive posterior and connect it to Week 1’s conditioning formula.

πŸ“š Resources

Resource Sections Format Time
Rasmussen & Williams β€” Gaussian Processes for Machine Learning Ch. 1 (intro), Ch. 2 (GP regression β€” derive everything) Textbook (free) 5 hrs
R&W GPML Ch. 4 Β§4.1–4.3 (covariance functions) Textbook 3 hrs
R&W GPML Β§5.4.2 (LOO-CV for GPs β€” closed-form predictive) Textbook 1 hr
Neil Lawrence β€” GP Summer School lectures Intro GP lecture + lab Video + notebook 2 hrs

✏️ Exercises

Exercise 2.1 β€” GP predictive posterior from scratch. Suppose you observe noisy data \(\mathbf{y} = f(\mathbf{X}) + \boldsymbol{\varepsilon}\) where \(f \sim \mathcal{GP}(0, k)\) and \(\varepsilon_i \sim \mathcal{N}(0, \sigma_n^2)\). Using only the Gaussian conditioning formula from Week 1, derive the posterior predictive \(p(f_* \mid \mathbf{X}_*, \mathbf{X}, \mathbf{y})\). Write out mean and variance explicitly. Verify this matches R&W eq. (2.23–2.24).

Exercise 2.2 β€” Implement GP regression from scratch. In Python (NumPy only β€” no GPy/GPflow), implement: - SE kernel: \(k(x, x') = \sigma_f^2 \exp\!\left(-\frac{(x-x')^2}{2\ell^2}\right)\) - GP posterior mean and variance - Posterior sample paths via Cholesky

Plot the posterior with 3, 10, and 50 training points on a 1D function of your choice. Verify variance collapses to zero at observed points.

Exercise 2.3 β€” Kernel composition. Prove: if \(k_1\) and \(k_2\) are PSD kernels, then \(k_1 + k_2\) and \(k_1 \cdot k_2\) are PSD. Use this to construct a kernel modeling a trend plus periodic component.

Exercise 2.5 β€” GP LOO-CV in closed form. For a GP fit to \(n\) training points with kernel matrix \(K\), the leave-one-out predictive mean and variance for point \(i\) are (R&W eq. 5.12): \[\mu_{-i} = y_i - \frac{[K^{-1}\mathbf{y}]_i}{[K^{-1}]_{ii}}, \qquad \sigma_{-i}^2 = \frac{1}{[K^{-1}]_{ii}}\] Derive these from the Gaussian conditioning formula applied to the joint \(p(\mathbf{y})\) by partitioning into \((y_i, \mathbf{y}_{-i})\). Then verify: why does this require only one Cholesky factorization rather than \(n\) separate GP refits?

Exercise 2.4 β€” Mercer’s theorem (conceptual). State Mercer’s theorem and explain in one paragraph why it licenses thinking of \(k(x, x')\) as a dot product in a (possibly infinite-dimensional) feature space. Why does this matter for GP regression?

βœ… Checkpoint

You should be able to: (a) derive the GP posterior mean and covariance in 5 minutes given the conditioning formula, (b) explain what β€œdistribution over functions” means precisely (not just intuitively), (c) construct a novel kernel for a given function class and justify its PSD property, and (d) state the closed-form LOO predictive and explain why it’s \(O(n^3)\) total rather than \(O(n^4)\).


Week 3 β€” Bayesian Optimization (~10 hrs)

🎯 Goal: Understand the BO loop, derive the main acquisition functions, and see how GP uncertainty drives exploration.

πŸ“š Resources

Resource Sections Format Time
Frazier β€” A Tutorial on Bayesian Optimization Full (short, 22pp) Paper 2 hrs
Garnett β€” Bayesian Optimization Ch. 1–5 (foundations through acquisition functions) Textbook (free) 4 hrs
Shahriari et al. β€” Taking the Human Out of the Loop Β§III (acquisition functions) only Survey 1.5 hrs

✏️ Exercises

Exercise 3.1 β€” Derive Expected Improvement. Given a GP surrogate \(f \sim \mathcal{GP}\) with posterior mean \(\mu(\mathbf{x})\) and variance \(\sigma^2(\mathbf{x})\), derive the EI acquisition function: \[\text{EI}(\mathbf{x}) = \mathbb{E}[\max(f(\mathbf{x}) - f^+, 0)]\] in closed form. You will need the formula \(\mathbb{E}[\max(Z - \tau, 0)]\) for \(Z \sim \mathcal{N}(\mu, \sigma^2)\) β€” derive this too using integration by parts.

Exercise 3.2 β€” BO from scratch. Implement a minimal BO loop: 1. GP surrogate with SE kernel 2. EI acquisition maximized by grid search 3. Sequential evaluation loop

Test on \(f(x) = -x^2 + \sin(3x)\) on \([-3, 3]\). Plot the surrogate, EI, and observed points at each iteration.

Exercise 3.3 β€” Regret analysis (conceptual). Read Garnett Ch. 5 on convergence. In your own words: what is cumulative regret and what does a sublinear regret bound imply about BO’s performance as \(T \to \infty\)?

βœ… Checkpoint

You should be able to: (a) derive EI closed form without reference, (b) explain the exploration-exploitation tradeoff in terms of the GP posterior variance, and (c) identify one assumption in the standard BO convergence analysis that the paper under study relaxes.


Week 4 β€” Multi-Task GPs & Kernel Coregionalization (~10 hrs)

🎯 Goal: Understand how GP regression extends to vector-valued outputs (tasks), then read the paper.

This is the direct prerequisite for the paper’s kernel construction.

πŸ“š Resources

Resource Sections Format Time
R&W GPML Β§5.4–5.5 (multi-output GPs, coregionalization) Textbook 2 hrs
Bonilla et al. β€” Multi-Task GP Prediction Full Paper 2 hrs
Álvarez et al. β€” Kernels for Vector-Valued Functions Β§2–3 (LMC and ICM) Survey 2 hrs

✏️ Exercises

Exercise 4.1 β€” Linear Model of Coregionalization. In the LMC, each output is modeled as \(f_d(\mathbf{x}) = \sum_{q=1}^Q a_{d,q} u_q(\mathbf{x})\) for independent GPs \(u_q \sim \mathcal{GP}(0, k_q)\). Derive the cross-covariance \(\text{Cov}(f_d(\mathbf{x}), f_{d'}(\mathbf{x}'))\) and show the multi-task covariance matrix factorizes as a Kronecker product in the ICM special case.

Exercise 4.2 β€” Bias-variance in multi-task settings. Suppose task 2 provides a biased but low-noise version of task 1. Sketch (analytically or geometrically) how the optimal weight on task-2 data in the multi-task posterior trades off between the bias it introduces and the variance it reduces.

Exercise 4.3 β€” Read the paper. After completing 4.1–4.2, read arXiv:1904.01049 with these questions in mind: - How does the paper’s kernel for the biased offline/online task pair differ from standard ICM? - What theoretical result characterizes the optimal weight on biased data? - What assumption about the offline simulator does the kernel encode?

βœ… Checkpoint

You can reconstruct the paper’s multi-task kernel from the LMC framework and explain (in the language of Section 3) what the kernel assumes about the relationship between the online and offline tasks.


Week 5 β€” Deep Reading of the Paper (~6–8 hrs)

🎯 Goal: Work through every derivation in the paper, annotate assumptions, and identify what’s novel vs.Β standard.

Strategy

  1. Pass 1 (2 hrs): Read straight through, noting every symbol and equation you don’t immediately recognize. Flag them.
  2. Pass 2 (3 hrs): For each flagged item, trace it back to the prerequisite concept. Derive or look up.
  3. Pass 3 (1.5 hrs): Re-read the theory section (Β§3–4) and reproduce the main result on paper without looking at the proof.

Questions to answer during deep read


Reference Table

Reference Brief Summary Link
Bishop β€” PRML Ch. 2 Gaussians, conjugate priors, Bayesian inference PDF
Murphy β€” PML Book 1 Modern probabilistic ML foundations Free online
Rasmussen & Williams β€” GPML The GP bible; derivations of everything Free online
Frazier β€” BO Tutorial Best short intro to BO; EI derivation arXiv:1807.02811
Garnett β€” Bayesian Optimization Full rigorous textbook; convergence theory Free online
Shahriari et al. 2016 Comprehensive BO survey IEEE
Bonilla et al. 2007 Multi-task GP prediction; LMC derivation NeurIPS
Álvarez et al. β€” Vector-Valued Kernels Survey of LMC, ICM, convolution kernels JMLR
Petersen & Pedersen β€” Matrix Cookbook Quick reference for matrix identities PDF