Prerequisites for Multi-Task Bayesian Optimization (arXiv:1904.01049)
Target paper: Practical Multi-Task Scalable Bayesian Optimization Goal: Read the paper deeply β follow derivations, kernel construction, and theoretical results Profile: Solid multivariable calculus + linear algebra; introductory probability; surface-level GP and BO familiarity Estimated pace: ~10 hrs/week β 4β5 weeks to paper readiness
Dependency Graph
Multivariate Gaussian conditioning
β
Bayesian inference (formally)
β
GP regression (derive predictive posterior)
β β
Kernel theory Marginal likelihood + LOO-CV + hyperparameter learning
β β
βββββββββββ¬βββββββββββ
β
Bayesian optimization (EI / UCB derivations)
β
Multi-task GP kernels (LMC / ICM)
β
π The Paper
Week 1 β Multivariate Gaussians & Bayesian Foundations (~10 hrs)
π― Goal: Derive the Gaussian conditional and marginal from scratch; understand Bayesian inference as posterior computation.
The GP posterior is a Gaussian conditional. Everything downstream depends on internalizing this one formula.
π Resources
| Resource | Sections | Format | Time |
|---|---|---|---|
| Bishop β PRML | Β§2.3 (Gaussian Distribution) | Textbook | 3 hrs |
| Murphy β Probabilistic ML: An Introduction | Β§2.3 (Gaussians), Β§4.2 (Bayesian inference for Gaussians) | Textbook | 3 hrs |
| Petersen & Pedersen β The Matrix Cookbook | Β§8 (block matrix inversion, Schur complement) | Reference | 1 hr |
βοΈ Exercises
Exercise 1.1 β Derive the Gaussian conditional. Given a joint Gaussian \(\begin{pmatrix} \mathbf{x} \\ \mathbf{y} \end{pmatrix} \sim \mathcal{N}\!\left(\begin{pmatrix} \boldsymbol{\mu}_x \\ \boldsymbol{\mu}_y \end{pmatrix}, \begin{pmatrix} \Sigma_{xx} & \Sigma_{xy} \\ \Sigma_{yx} & \Sigma_{yy} \end{pmatrix}\right)\), derive \(p(\mathbf{x} \mid \mathbf{y})\) by completing the square. Identify the Schur complement \(\Sigma_{xx} - \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{yx}\).
Exercise 1.2 β Bayesian linear regression. Place a Gaussian prior \(\mathbf{w} \sim \mathcal{N}(\mathbf{0}, \alpha^{-1}I)\) on weights and a Gaussian likelihood \(y \mid \mathbf{w} \sim \mathcal{N}(\mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x}), \beta^{-1})\). Derive the posterior \(p(\mathbf{w} \mid \mathcal{D})\) in closed form. This is the finite-dimensional analogue of GP regression β understanding this derivation cold makes Week 2 immediate.
Exercise 1.3 β Marginal likelihood. From Exercise 1.2, compute the marginal likelihood \(p(\mathbf{y} \mid \mathbf{X})\) by integrating out \(\mathbf{w}\). Identify which terms penalize model complexity.
β Checkpoint
You should be able to: (a) state the Gaussian conditioning formula without reference, (b) explain why the conditional mean is a linear function of the conditioning variable, and (c) derive the posterior in Bayesian linear regression in under 10 minutes.
Week 2 β Gaussian Process Regression (~12 hrs)
π― Goal: Understand GPs as distributions over functions. Derive the GP predictive posterior and connect it to Week 1βs conditioning formula.
π Resources
| Resource | Sections | Format | Time |
|---|---|---|---|
| Rasmussen & Williams β Gaussian Processes for Machine Learning | Ch. 1 (intro), Ch. 2 (GP regression β derive everything) | Textbook (free) | 5 hrs |
| R&W GPML | Ch. 4 Β§4.1β4.3 (covariance functions) | Textbook | 3 hrs |
| R&W GPML | Β§5.4.2 (LOO-CV for GPs β closed-form predictive) | Textbook | 1 hr |
| Neil Lawrence β GP Summer School lectures | Intro GP lecture + lab | Video + notebook | 2 hrs |
βοΈ Exercises
Exercise 2.1 β GP predictive posterior from scratch. Suppose you observe noisy data \(\mathbf{y} = f(\mathbf{X}) + \boldsymbol{\varepsilon}\) where \(f \sim \mathcal{GP}(0, k)\) and \(\varepsilon_i \sim \mathcal{N}(0, \sigma_n^2)\). Using only the Gaussian conditioning formula from Week 1, derive the posterior predictive \(p(f_* \mid \mathbf{X}_*, \mathbf{X}, \mathbf{y})\). Write out mean and variance explicitly. Verify this matches R&W eq. (2.23β2.24).
Exercise 2.2 β Implement GP regression from scratch. In Python (NumPy only β no GPy/GPflow), implement: - SE kernel: \(k(x, x') = \sigma_f^2 \exp\!\left(-\frac{(x-x')^2}{2\ell^2}\right)\) - GP posterior mean and variance - Posterior sample paths via Cholesky
Plot the posterior with 3, 10, and 50 training points on a 1D function of your choice. Verify variance collapses to zero at observed points.
Exercise 2.3 β Kernel composition. Prove: if \(k_1\) and \(k_2\) are PSD kernels, then \(k_1 + k_2\) and \(k_1 \cdot k_2\) are PSD. Use this to construct a kernel modeling a trend plus periodic component.
Exercise 2.5 β GP LOO-CV in closed form. For a GP fit to \(n\) training points with kernel matrix \(K\), the leave-one-out predictive mean and variance for point \(i\) are (R&W eq. 5.12): \[\mu_{-i} = y_i - \frac{[K^{-1}\mathbf{y}]_i}{[K^{-1}]_{ii}}, \qquad \sigma_{-i}^2 = \frac{1}{[K^{-1}]_{ii}}\] Derive these from the Gaussian conditioning formula applied to the joint \(p(\mathbf{y})\) by partitioning into \((y_i, \mathbf{y}_{-i})\). Then verify: why does this require only one Cholesky factorization rather than \(n\) separate GP refits?
Exercise 2.4 β Mercerβs theorem (conceptual). State Mercerβs theorem and explain in one paragraph why it licenses thinking of \(k(x, x')\) as a dot product in a (possibly infinite-dimensional) feature space. Why does this matter for GP regression?
β Checkpoint
You should be able to: (a) derive the GP posterior mean and covariance in 5 minutes given the conditioning formula, (b) explain what βdistribution over functionsβ means precisely (not just intuitively), (c) construct a novel kernel for a given function class and justify its PSD property, and (d) state the closed-form LOO predictive and explain why itβs \(O(n^3)\) total rather than \(O(n^4)\).
Week 3 β Bayesian Optimization (~10 hrs)
π― Goal: Understand the BO loop, derive the main acquisition functions, and see how GP uncertainty drives exploration.
π Resources
| Resource | Sections | Format | Time |
|---|---|---|---|
| Frazier β A Tutorial on Bayesian Optimization | Full (short, 22pp) | Paper | 2 hrs |
| Garnett β Bayesian Optimization | Ch. 1β5 (foundations through acquisition functions) | Textbook (free) | 4 hrs |
| Shahriari et al. β Taking the Human Out of the Loop | Β§III (acquisition functions) only | Survey | 1.5 hrs |
βοΈ Exercises
Exercise 3.1 β Derive Expected Improvement. Given a GP surrogate \(f \sim \mathcal{GP}\) with posterior mean \(\mu(\mathbf{x})\) and variance \(\sigma^2(\mathbf{x})\), derive the EI acquisition function: \[\text{EI}(\mathbf{x}) = \mathbb{E}[\max(f(\mathbf{x}) - f^+, 0)]\] in closed form. You will need the formula \(\mathbb{E}[\max(Z - \tau, 0)]\) for \(Z \sim \mathcal{N}(\mu, \sigma^2)\) β derive this too using integration by parts.
Exercise 3.2 β BO from scratch. Implement a minimal BO loop: 1. GP surrogate with SE kernel 2. EI acquisition maximized by grid search 3. Sequential evaluation loop
Test on \(f(x) = -x^2 + \sin(3x)\) on \([-3, 3]\). Plot the surrogate, EI, and observed points at each iteration.
Exercise 3.3 β Regret analysis (conceptual). Read Garnett Ch. 5 on convergence. In your own words: what is cumulative regret and what does a sublinear regret bound imply about BOβs performance as \(T \to \infty\)?
β Checkpoint
You should be able to: (a) derive EI closed form without reference, (b) explain the exploration-exploitation tradeoff in terms of the GP posterior variance, and (c) identify one assumption in the standard BO convergence analysis that the paper under study relaxes.
Week 4 β Multi-Task GPs & Kernel Coregionalization (~10 hrs)
π― Goal: Understand how GP regression extends to vector-valued outputs (tasks), then read the paper.
This is the direct prerequisite for the paperβs kernel construction.
π Resources
| Resource | Sections | Format | Time |
|---|---|---|---|
| R&W GPML | Β§5.4β5.5 (multi-output GPs, coregionalization) | Textbook | 2 hrs |
| Bonilla et al. β Multi-Task GP Prediction | Full | Paper | 2 hrs |
| Γlvarez et al. β Kernels for Vector-Valued Functions | Β§2β3 (LMC and ICM) | Survey | 2 hrs |
βοΈ Exercises
Exercise 4.1 β Linear Model of Coregionalization. In the LMC, each output is modeled as \(f_d(\mathbf{x}) = \sum_{q=1}^Q a_{d,q} u_q(\mathbf{x})\) for independent GPs \(u_q \sim \mathcal{GP}(0, k_q)\). Derive the cross-covariance \(\text{Cov}(f_d(\mathbf{x}), f_{d'}(\mathbf{x}'))\) and show the multi-task covariance matrix factorizes as a Kronecker product in the ICM special case.
Exercise 4.2 β Bias-variance in multi-task settings. Suppose task 2 provides a biased but low-noise version of task 1. Sketch (analytically or geometrically) how the optimal weight on task-2 data in the multi-task posterior trades off between the bias it introduces and the variance it reduces.
Exercise 4.3 β Read the paper. After completing 4.1β4.2, read arXiv:1904.01049 with these questions in mind: - How does the paperβs kernel for the biased offline/online task pair differ from standard ICM? - What theoretical result characterizes the optimal weight on biased data? - What assumption about the offline simulator does the kernel encode?
β Checkpoint
You can reconstruct the paperβs multi-task kernel from the LMC framework and explain (in the language of Section 3) what the kernel assumes about the relationship between the online and offline tasks.
Week 5 β Deep Reading of the Paper (~6β8 hrs)
π― Goal: Work through every derivation in the paper, annotate assumptions, and identify whatβs novel vs.Β standard.
Strategy
- Pass 1 (2 hrs): Read straight through, noting every symbol and equation you donβt immediately recognize. Flag them.
- Pass 2 (3 hrs): For each flagged item, trace it back to the prerequisite concept. Derive or look up.
- Pass 3 (1.5 hrs): Re-read the theory section (Β§3β4) and reproduce the main result on paper without looking at the proof.
Questions to answer during deep read
Reference Table
| Reference | Brief Summary | Link |
|---|---|---|
| Bishop β PRML Ch. 2 | Gaussians, conjugate priors, Bayesian inference | |
| Murphy β PML Book 1 | Modern probabilistic ML foundations | Free online |
| Rasmussen & Williams β GPML | The GP bible; derivations of everything | Free online |
| Frazier β BO Tutorial | Best short intro to BO; EI derivation | arXiv:1807.02811 |
| Garnett β Bayesian Optimization | Full rigorous textbook; convergence theory | Free online |
| Shahriari et al. 2016 | Comprehensive BO survey | IEEE |
| Bonilla et al. 2007 | Multi-task GP prediction; LMC derivation | NeurIPS |
| Γlvarez et al. β Vector-Valued Kernels | Survey of LMC, ICM, convolution kernels | JMLR |
| Petersen & Pedersen β Matrix Cookbook | Quick reference for matrix identities |