Online learning and FoReL - Calvin Woo's blog

Note: This blog post is still a rough draft. Read on with caution.

In part 2 of our unthemed dive into the reinforcement learning literature, we will be taking a look at online (convex) optimization and some reinforcement learning algorithms that came out of it, applied to imperfect-information zero-sum games.

The last post focused on counterfactual regret minimization, which was also an online algorithm for choosing the optimal strategies for an agent. The success of counterfactual regret minimization came from its strong theoretical guarantees of sublinear regret growth, along with its generality. As such, it seems fitting to start with a general overview of the ideas behind online optimization and see what other ideas came out of it that could be fruitful for future AIs.

online learning

In machine learning, online learning is the process of continuously adapting and making decisions from streams of information: at each point in a time \(t\), an online learning algorithm is given an informational signal \(x_t\) from a space \(\mathcal{X}\), and decides on an action \(a_t\in\mathcal{A}\) to perform. After their decision, the environment/opponent chooses a loss function \(\ell^t\) and causes the agent to suffer a loss \(\ell^t(x_t, a_t)\). The algorithm learns from this loss and updates its processes for the next time.

for _ in range(num_timesteps):
    signal_t = env.receive_signal()
    action_t = learner.decide(signal_t)
    loss_t = env.receive_loss()
    loss = loss_t(signal_t, action_t)
    learner.suffer(loss)

The goal of the learner is to minimize their regret

\[ R^T = \max_{a^*\in\mathcal{A}}\left\{\sum_{t=1}^T\ell^t(x_t, a^*)\right\} - \sum_{t=1}^T\ell^t(x_t, a_t) \]

We call such an online learning setting learnable if we can achieve sublinear regret in \(T\).

Let us give a vibe for the field with an example. Consider the \(n\)-expert opinion setting, where we at each time step we are trying to perform a binary action, i.e. \(a_t\in\mathcal{A}=\{0, 1\}\). To inform us on what action to take, we listen to \(n\) “experts”, which in our setting is a vector of 0’s and 1’s \(x_t\in\mathcal{X}=\{0,1\}^n\). After the learner takes their binary action, the true answer in \(\{0,1\}\) is revealed and the loss is given by the 0-1 loss

\[ \begin{equation} \ell^t(x_t, a_t) = \begin{cases} 1 & \text{if } a_t\text{ is correct answer}\\ 0 & \text{otherwise} \end{cases} \end{equation} \]

We can then see that the regret \(R^T\) is merely the number of mistakes made by the learner after \(T\) attempts.

In this paper of Littlestone-Warmuth, a simple algorithm called the weighted majority algorithm is introduced that achieves sublinear regret for this problem. We maintain a list of weights \(w_1,...,w_n\), one for each expert, and we vote on an action based on weighted majority of the experts– that is, for the expert opinions \((x_1,...,x_n)\in\{0,1\}^n\), we vote 1 if

\[ \sum_{i:x_i=1} w_i \ge \sum_{i:x_i=0} w_i \]

and 0 otherwise.

Once we receive the correct answer, we penalize each incorrect expert by multiplying their weight by 0.5. In code:

import numpy as np

class WeightedMajority(Learner):
    def __init__(self, num_experts: int):
        self.num_experts = num_experts
        # initialize weights of experts to 1.0
        self.weights = np.repeat(1.0, num_experts)
        self.last_opinions = None
    
    def decide(self, opinions: np.ndarray) -> int:
        # weighted majority vote
        total_weight_0 = np.sum(self.weights * (1 - opinions))
        total_weight_1 = np.sum(self.weights * opinions)
        self.last_opinions = opinions
        return 1 if total_weight_1 >= total_weight_0 else 0

    def suffer(self, loss: int):
        # here, loss is the "correct answer" in {0, 1}
        wrong_experts = self.weights[self.last_opinions != loss]
        wrong_experts *= 0.5
        self.weights[self.last_opinions != loss] = wrong_experts

The main theorem of Littlestone-Warmuth is a regret analysis of this learning algorithm: the number of mistakes \(R^T\) made by the weighted majority learner is bounded above by

\[ R^T \le 2.41(m + \log_2{n}) \]

where \(m\) is the number of mistakes made by the best expert so far.

Proof: This is fairly straightforward. Let \(W\) be the total weight of all experts (so initially \(W=n\)). If the learner makes a mistake, that means that more than half the total weight is on the wrong experts, so that chunk will be halved. As a consequence, we lose at least a 1/4th of our total weight. So

\[ W \le n(3/4)^M \]

where \(M\) is the total number of mistakes made (above we called it \(R^T\)).

On the contrary, if our best expert made \(m\) mistakes, it’s weight is \(1/2^m\) and so \(W\ge 1/2^m\) at least. Combining the two gives

\[ 1/2^m \le n(3/4)^M \]

which rearranging gives the regret bound. \(\square\)

It is often the case that many algorithms in computer science are enhanced by introducing randomness. Applying it to this situation will miraculously give a better regret bound! Here, instead of weighted majority vote, we normalize the weights into probabilities and choose as our action the opinion of a randomly chosen expert. We also then multiply the weights of all wrong experts by \(\beta\), where \(\beta\) is some hyperparameter we can tune the algorithm with.

Via a similar argument, we can prove the regret bound

\[ R^T \le \frac{m\log(1/\beta) + \log n}{1-\beta} \]

where, again \(m\) is the number of mistakes made by the best expert so far.

convex optimization

A special case that we will focus on is the setting of online convex optimization. Here, we receive no signals \(x_t\) from the environment, and instead our “actions” will be points in a convex domain \(a_t\in\mathcal{K}\). The loss here will be given by an arbitrary convex function \(f_t\), and so the goal of our convex optimizer is to minimize the regret term

\[ R^T = \max_{a^*\in\mathcal{K}}\left\{\sum_{t=1}^T f_t(a^*)\right\} - \sum_{t=1}^T f_t(a_t) \]

Our goal in this post is to introduce and derive some important algorithms for solving the online convex optimization problem, and apply these algorithms to game-theoretic solutions in modern machine learning.

Online convex optimization in pseudocode is given by:

for _ in range(num_timesteps):
    x_t = learner.generate()
    f_t = env.receive_loss()
    loss = f_t(x_t)
    learner.suffer(loss)

which is similar to the general online learning situation above. In OCO we are trying to compute the offline optimum

\[ \min_{x\in\mathcal{K}}\sum_{t=1}^T f_t(x) \]

(which is equivalent to getting sublinear regret), where our \(f_t\) is coming from a potentially constrained space of functions \(\mathcal{F}\). The goal is to, in a generic way, get a regret bound of the form

\[ R^T \le \mathcal{O}_{\mathcal{K},\mathcal{F}}(\sqrt{T}) \]

Like in the last section, we will first describe an online convex optimization problem inspired by the expert opinion problem: here we have \(n\) experts, with a goal to best utilize the opinions of the experts in order to minimize a linear loss.

At time \(t=1,2,...\) we need to decide \(x_t\in\Delta^n\) (where \(\Delta^n\) is the probability simplex, see last post), probabilities for following the advice, i.e.

\[ x_{t,i}=\text{probability of listening to expert }i\text{ at time }t \]

Let \(\ell^t\) be the loss vector where

\[ \ell^t_i=\text{loss of listening to expert }i\text{ at time }t \]

Our expected loss is then

\[ \sum_{i=1}^n x_{t,i}\ell^t_i = \langle x_t, \ell^t\rangle \]

This is the case of online convex optimization where at each time \(t\), the generated point is \(x_t\in\Delta^n\) (note that the probability simplex is convex!) and the linear convex loss is given by \(f_t(x)=\langle x_t, \ell^t\rangle\).

The learning algorithm we will use for this problem is known as multiplicative weights.

multiplicative weights

We give here a statistical physics approach to a no-regret algorithm for the expert advice problem.

Focusing on a single expert \(i\), call the cumulative loss incurred by this expert at time \(t-1\) to be the energy

\[ E_t(i) = \sum_{k=1}^{t-1}\ell_i^k \]

So at time \(t\), the learner knows \(E_t(i)\) for each \(i\). By convexity of \(\Delta^n\), the offline optimum of our problem is interpreted as the energy of the lowest energy (i.e. ground state) expert at time \(t+1\),

\[ \min_{i\in\{1,...,n\}} E_t(i) \]

For any arbitrary “inverse temperature” parameter \(\beta\), we trivially have \(e^{-\beta\min_i E_t(i)}\le\sum_{i=1}^n e^{-\beta E_t(i)}\) and so

\[ \min_{i\in\{1,...,n\}} E_t(i) \ge -\frac{1}{\beta}\log{\sum_{i=1}^n e^{-\beta E_t(i)}} \]

where \(\Phi_t = -\frac{1}{\beta}\log{\sum_{i=1}^n e^{-\beta E_t(i)}}\) is the free energy at temperature \(1/\beta\) at time \(t\). We will use this to establish a regret bound.

Given the energies of each expert, how do we decide on a distribution to sample them from? Statistical physics say to form a Boltzmann-Gibbs distribution

\[ x_{t,i}=\frac{1}{Z_t}e^{-\beta E_t(i)} \]

where \(Z_t=\sum_{i=1}^n e^{-\beta E_t(i)}\) is the partition function at time \(t\). This makes sense– we want to heed experts which have given us the least loss more often.

This choice of \(x_t\in\Delta^n\) is the multiplicative weights algorithm:

    maintain weights w_t = [w_t(1), w_t(2), ..., w_t(n)]
    update as
        w_t[i] <- w_{t-1}[i] * exp(-beta * loss_{t-1}[i])
    generate strategy x_t as
        x_t[i] <- w_t[i] / sum(w_t)

which in Python can be given as

class MultiplicativeWeights(Learner):
    def __init__(self, num_experts: int):
        self.num_experts = num_experts
        # initialize weights of experts to 1.0
        self.weights = np.repeat(1.0, num_experts)
    
    def generate(self) -> np.ndarray:
        return self.weights / np.sum(self.weights)

    def suffer(self, loss: np.ndarray):
        gibbs_weights = np.exp(-self.beta * loss)
        self.weights *= gibbs_weights

Let us now determine the regret bound for this learning algorithm. Note that

\[ \begin{align*} \Phi_{t+1} - \Phi_t &= -\frac{1}{\beta}\log{\sum_{i=1}^n e^{-\beta E_{t+1}(i)}} + \frac{1}{\beta}\log{\sum_{i=1}^n e^{-\beta E_t(i)}} \\ &= -\frac{1}{\beta}\log{\frac{Z_{t+1}}{Z_t}} \end{align*} \]

As \(E_{t+1}(i)=E_t(i)+\ell_i^t\) and \(Z_{t+1}=\sum_{i=1}^n e^{-\beta E_t(i)}e^{-\beta\ell_i^t}\), we have

\[ \begin{align*} \frac{Z_{t+1}}{Z_t} &= \sum_{i=1}^n\left(\frac{e^{-\beta E_t(i)}}{Z_t}\right)e^{-\beta\ell_i^t} \\ &= \sum_{i=1}^n x_{t,i} e^{-\beta\ell_i^t} \end{align*} \]

Assume, without loss of generality (we can rescale if necessary), that \(\|\ell^t\|_\infty\le 1\) and \(|\beta|<\frac{1}{2}\). Taylor expanding the exponential terms as

\[ e^{-\beta\ell_i^t} \le 1 - \beta\ell^t_i + \beta^2\ell_i^{t,2} \]

where \(\ell^{t,2}\) is the pointwise square of the vector \(\ell^t\). We can replace in the equality above this expression to get the inequality

\[ \begin{align*} \frac{Z_{t+1}}{Z_t} &\le \sum_{i=1}^n x_{t,i}\cdot(1 - \beta\ell^t_i + \beta^2\ell_i^{t,2}) \\ &= 1 - \beta\langle x_t,\ell^t\rangle + \beta^2\langle x_t, \ell^{t,2}\rangle \end{align*} \]

Since for \(|z|<\frac{1}{2}\), \(1-z\le e^{-z}\), we get in this case that

\[ \frac{Z_{t+1}}{Z_t} \le \exp(- \beta\langle x_t,\ell^t\rangle + \beta^2\langle x_t, \ell^{t,2}\rangle) \]

\[ \begin{align*} \Phi_{t+1}-\Phi_t &= -\frac{1}{\beta}\log\frac{Z_{t+1}}{Z_t} \\ &\ge -\frac{1}{\beta}\cdot\left(-\beta\langle x_t,\ell^t\rangle + \beta^2\langle x_t, \ell^{t,2}\rangle\right) \\ &= \langle x_t,\ell^t\rangle - \beta\langle x_t, \ell^{t,2}\rangle \end{align*} \]

Rearranging, we have

\[ \langle x_t,\ell^t \rangle \le \Phi_{t+1}-\Phi_t+\beta\langle x_t, \ell^{t,2}\rangle \]

Summing over \(t\) and realizing we have a telescoping sum, we get

\[ \sum_{t=1}^T\langle x_t,\ell^t\rangle \le \Phi_{t+1}-\Phi_1 + \beta\sum_{t=1}^T\langle x_t,\ell^{t,2}\rangle \]

As \(\Phi_1=-\frac{1}{\beta}\log{n}\) and \(\Phi_{T+1}\le\min_{j\in\{1,...,n\}}\sum_{t=1}^T \ell_j^t\), we combine the above to get

\[ \begin{align*} R^T &= \sum_{t=1}^T\langle x_t,\ell^t\rangle - \min_{j\in\{1,...,n\}}\sum_{t=1}^T\ell_j^t \\ &\le \frac{\log{n}}{\beta} + \beta\sum_{t=1}^T\langle x_t,\ell^{t,2}\rangle \\ &\le \frac{\log{n}}{\beta} + \beta T \end{align*} \]

Taking \(T \ge 4\log{n}\) and \(\beta=\sqrt{\frac{\log{n}}{T}}\), we get a regret bound

\[ R^T \le 2\sqrt{T\log{n}}. \]

follow the regularized leader

Multiplicative weights is a nice algorithm, but it isn’t clear how to extend it to other convex sets \(\mathcal{K}\) that isn’t the probability simplex.

So instead, we start from scratch and think more generally– we assume nothing about \(\mathcal{K}\) or \(\mathcal{F}\) at the moment. What would guide us in choosing the next \(x_t\in\mathcal{K}\) in our online convex optimization setting?

At time \(t\), all we have are our loss functions \(f_1, f_2,..., f_{t-1}\). If we have solutions that worked well in past time steps, why not just capitalize on those in a greedy fashion? The follow the leader algorithm does this: we take

\[ x_t = \operatorname*{argmin}_{x\in\mathcal{K}}\sum_{k=1}^{t-1} f_k(x) \]

When does this do badly? Inutitively, FoL can overfit to past history– if we get sequences of conflicting loss functions, the algorithm can continually change its mind on the “best point” \(x_t\).

Formally, the regret bound shows this is the main failure of this algorithm:

\[ R^T \le \sum_{t=1}^T f_t(x_t) - f_t(x_{t+1}) \]

To see this, note by definition that

\[ R^T = \sum_{t=1}^T f_t(x_t) - \min_{x\in\mathcal{K}}\sum_{t=1}^T f_t(x) \]

So the regret bound above is equivalent to showing

\[ \min_{x\in\mathcal{K}}\sum_{t=1}^T f_t(x) \ge \sum_{t=1}^T f_t(x_{t+1}) \]

We show this by induction: in the base case (\(T=1\)), we have \(\min_{x\in\mathcal{K}}f_1(x) = f_1(x_2)\), by definition of the FoL algorithm. Now assume this is true for \(T\). We decompose

\[ \sum_{t=1}^{T+1} f_t(x_{t+1}) = \left(\sum_{t=1}^T f_t(x_{t+1})\right) + f_{T+1}(x_{T+2}) \]

Note as \(\min_{x\in\mathcal{K}}\sum_{t=1}^T f_t(x)\le\sum_{t=1}^T f_t(x_{T+2})\), we have

\[ \begin{align*} \sum_{t=1}^{T+1} f_t(x_{t+1}) &= \left(\sum_{t=1}^T f_t(x_{t+1})\right) + f_{T+1}(x_{T+2}) \\ &\le \left(\min_{x\in\mathcal{K}}\sum_{t=1}^T f_t(x)\right) + f_{T+1}(x_{T+2}) \\ &\le \sum_{t=1}^T f_t(x_{T+2}) + f_{T+1}(x_{T+2}) \\ &= \sum_{t=1}^{T+1} f_t(x_{T+2}) \\ &= \min_{x\in\mathcal{K}}\sum_{t=1}^{T+1} f_t(x) \end{align*} \]

\(\square\).

This suggests that we should take smaller steps to balance out the old information we have from past cost functions with new ones from the environment.

We do this by introducing a regularizer \(R\) and take at each time \(t\) a regularized step

\[ x_t = \operatorname*{argmin}_{x\in\mathcal{K}}\left\{R(x) + \sum_{k=1}^{t-1} f_k(x)\right\} \]

This algorithm is follow the regularized leader, or FoReL.

What is the regret bound generically for FoReL? Let \(x_1,..., x_t\) be the sequence generated by FoReL on \(\mathcal{K}\) with cost functions \(f_1,..., f_t\).

Theorem: Let \(R^T(x)=\sum_{t=1}^T\left(f_t(x_t) - f_t(x)\right)\). Then

\[ R^T(x) \le R(x) - R(x_1) + \sum_{t=1}^T\left(f_t(x_t) - f_t(x_{t+1})\right). \]

Proof: Consider FoReL to be a FoL process with losses \(R, f_1, f_2,...\). Starting with \(x_1\), FoL would generate \(x_1, x_1, x_2,...,x_T\). The previous regret bound for FoL gives the regret bound above. \(\square\)

FoReL encompasses a very general online optimization procedure, and leads to many other popular algorithms in use. For example, we will now use FoReL to rediscover the multiplicative weights algorithm. If we were to run FoL on the expert advice problem, we would find that it tends to concentrate probability on a single expert. To “spread out” the distribution, we would want to increase the entropy (i.e. information spread) of the resulting distribution. Since we’re minimizing cost, we will use negentropy (i.e. negative of entropy) as the regularizer

\[ R(x) = c \cdot \sum_{i=1}^n x_i\log{x_i} \]

Consider FoReL on \(\Delta^n\) with negentropic regularization:

\[ x_t = \operatorname*{argmin}_{x\in\mathcal{K}}\left\{\sum_{k=1}^{t-1} \langle\ell^k, x\rangle + c\cdot\sum_{i=1}^n x_i\log{x_i}\right\} \]

Solving by Lagrange multipliers, we want to find the Jacobian of

\[ f(x,\lambda)=\sum_{k=1}^{t-1}\langle\ell^k, x\rangle + c\cdot\sum_{i=1}^n x_i\log{x_i} + \lambda\left(1-\sum_{i=1}^n x_i\right) \]

Taking derivatives,

\[ \partial_{x_i}f(x,\lambda) = \sum_{k=1}^{t-1}\ell_i^k+c(1+\log{x_i})+\lambda \]

and setting to \(0\) at the minima, gives

\[ x_i = \exp\left(-1-\frac{\lambda}{c}-\frac{1}{c}\sum_{k=1}^{t-1}\ell_i^k\right) \]

Since \(\sum_{i=1}^n x_i = 1\) as \(x\in\Delta^n\), we can arrange

\[ \begin{align*} 1 = \sum_{i=1}^n x_i &= \sum_{i=1}^n e^{-1-\frac{\lambda}{c}}\exp\left(-\frac{1}{c}\sum_{k=1}^{t-1}\ell_i^k\right) \\ &= e^{-1-\frac{\lambda}{c}}\sum_{i=1}^n \exp\left(-\frac{1}{c}\sum_{k=1}^{t-1}\ell_i^k\right) \end{align*} \]

which shows that

\[ e^{1+\frac{\lambda}{c}} = \sum_{i=1}^n \exp\left(-\frac{1}{c}\sum_{k=1}^{t-1}\ell_i^k\right) \]

and hence, solving for \(\lambda\):

\[ \lambda = c\left(-1 + \log\sum_{i=1}^n\exp\left(-\frac{1}{c}\sum_{k=1}^{t-1}\ell_i^k\right)\right) \]

Plugging this multiplier in gives

\[ x_i = \frac{\exp\left(-\frac{1}{c}\sum_{k=1}^{t-1}\ell_i^k\right)}{\sum_{j=1}^n \exp\left(-\frac{1}{c}\sum_{k=1}^{t-1}\ell_j^k\right)} \]

If we choose \(c=\frac{1}{\beta}\), we get the multiplicative weights algorithm!

computing FoReL

In this section, we will take a look at computing the FoReL dynamics. Considering it as a discrete dynamical system, we want to iteratively compute

\[ x_{t+1} = \operatorname*{argmin}_{x\in\mathcal{K}}\left\{R(x) + \sum_{k=1}^t f_k(x)\right\} \]

for each time step \(t\). However, computing this requires taking an argmax over a entire convex space, which is itself a (constrained) optimization problem. Is there a way to reduce this to a sequence of easier optimization settings?

Consider the case where the regularization function \(R(x)\) is a quadratic norm \(R(x)=\frac{1}{2}\|x\|^2\), that is,

\[ x_{t+1} = \operatorname*{argmin}_{x\in\mathcal{K}}\left\{\eta\sum_{s=1}^t f_t(x) + \frac{1}{2}\|x\|^2\right\} \]

We start by looking at the case where the convex loss functions \(f_s\) are linear, that is \(f_s(x) = \nabla{f_s}\cdot x\). In this case, the problem above simplifies– by definition of \(x_{t+1}, x_t\),

\[ \nabla \left\{\eta\sum_{s=1}^t f_t(x) + \frac{1}{2}\|x\|^2\right\}(x_{t+1}) = \eta\sum_{s=1}^t\nabla{f_s} + x_{t+1} = 0 \]

and

\[ \nabla \left\{\eta\sum_{s=1}^{t-1} f_t(x) + \frac{1}{2}\|x\|^2\right\}(x_t) = \eta\sum_{s=1}^{t-1}\nabla{f_s} + x_t = 0 \]

Hence, combining the two we see that

\[ x_{t+1} = x_t - \eta\nabla{f_t}(x_t) \]

which is online gradient descent. In particular, the FoReL dynamics here is solvable in a “closed form”, and only involves the most recent loss function at that time step. Indeed, this is the main computational gain: we turn a minimization problem involving all losses \(f_s\) up to time \(t\) into a problem involving just one of them, \(f_t\).

This argument can be extended, but in order to do so, we need to modify our notion of “distance to a solution”. Indeed, we will effectively try and turn our constrained optimization problem, into an unconstrained one but in a “different metric space” of sorts.

Consider the regularized minimization problem

\[ x_{t+1} = \operatorname*{argmin}_{x\in\mathcal{K}}\left\{R(x) + \sum_{k=1}^t f_k(x)\right\} \]

and iterative define

\[ \Phi_0 = R \\ \Phi_t = \Phi_{t-1} + \eta f_t \]

Then we can write \(x_{t+1} = \operatorname*{argmin}_{x\in\mathcal{K}}\Phi_t(x)\). For \(\Phi:\mathbf{R}^n\to\mathbf{R}\), define the Bregman divergence with respect to \(\Phi\) as

\[ D_\Phi(a, b) = \Phi(a) - (\Phi(b)+\nabla{\Phi}(b)\cdot(a-b)) \]

which is the difference between \(\Phi(a)\) and the first-order approximation of \(\Phi(a)\) centered at \(b\).

Here are two basic examples: First, let \(\Phi(a)=\frac{1}{2}\|a\|^2\). Then the Bregman divergence corresponding to this function is given by

\[ \begin{align*} D_\Phi(a, b) &= \frac{1}{2}\|a\|^2-\frac{1}{2}\|b\|^2-b\cdot(a-b) \\ &= \frac{1}{2}\|a\|^2 + \frac{1}{2}\|b\|^2 - a^\top b \\ &= \frac{1}{2}\|a-b\|^2 \end{align*} \]

For a second example, let \(\Phi(a)=\sum_i a_i\log{a_i}\), the entropy function. Computing the Bregman divergence:

\[ \begin{align*} D_\Phi(a, b) &= \sum_i a_i\log{a_i}-\sum_i b_i\log{b_i} - \sum_i (\log{b_i}+1)(a_i-b_i) \\ &= \sum_i a_i\log{\frac{a_i}{b_i}} \end{align*} \]

as \(\sum_i a_i = 1 = \sum_i b_i\). Here we see that the Bregman divergence in this case is the Kullback-Leibner divergence.

For now, consider the case of linear losses. In the unconstrained optimization situation (\(\mathcal{K}=\mathbf{R}^d\)), we are trying to iteratively solve the dynamics

\[ \tilde{x}_{t+1} = \operatorname*{argmin}_{x\in\mathbf{R}^d}\left\{\eta\sum_{s=1}^t g_s\cdot x + R(x)\right\} \]

where here the tilde denotes the unconstrained dynamics. As in the argument before, we can let

\[ \nabla R(\tilde{x}_{t+1})=\eta\sum_{s=1}^t g_s \\ \nabla R(\tilde{x}_t)=\eta\sum_{s=1}^{t-1} g_s \]

by definition of \(\tilde{x}_{t+1}, \tilde{x}_t\). Hence \(\nabla R(\tilde{x}_{t+1})=\nabla R(\tilde{x}_t)-\eta g_t\), and so

\[ \tilde{x}_{t+1} = \nabla R^{-1}(\nabla R(\tilde{x}_t)-\eta g_t) \]

which is a generalization of online gradient descent called mirror descent.

How do we deal with the constrained optimization case? It turns out that all we need to do is solve the unconstrained case and then “project” the answer down onto the constraints, i.e. the convex set \(\mathcal{K}\) we’re optimizing over. Formalizing this, let

\[ \Pi^\Phi_\mathcal{K}(b) = \operatorname*{argmin}_{a\in\mathcal{K}} D_\Phi(a, b) \]

be the Bregman projection of \(b\in\mathbf{R}^d\) onto \(\mathcal{K}\subset\mathbf{R}^d\) via the divergence \(D_\Phi\).

Theorem: Let \(x_{t+1} = \operatorname*{argmin}_{x\in\mathcal{K}}\Phi_t(x)\) be the constrained optimization iterate, and \(\tilde{x}_{t+1} = \operatorname*{argmin}_{x\in\mathbf{R}^d}\Phi_t(x)\) the corresponding unconstrained iterate. Then they are related by the Bregman projection:

\[ x_{t+1} = \Pi_\mathcal{K}^{\Phi_t}(\tilde{x}_{t+1}) \]

Proof: Let \(x'_{t+1}=\Pi_\mathcal{K}^{\Phi_t}(\tilde{x}_{t+1})\). By definition,

\[ \Phi_t(x_{t+1}) \le \Phi_t(x'_{t+1}) \]

Conversely, \(D_{\Phi_t}(x'_{t+1},\tilde{x}_{t+1})\le D_{\Phi_t}(x_{t+1},\tilde{x}_{t+1})\) by definition of the Bregman projection. As \(\nabla\Phi_t(\tilde{x}_{t+1})=0\), we have

\[ D_{\Phi_t}(x,\tilde{x}_{t+1})=\Phi_t(x)-\Phi_t(\tilde{x}_{t+1}) \]

and so \(\Phi_t(x'_{t+1})\le\Phi_t(x_{t+1})\). This implies \(x_{t+1}=x'_{t+1}\). \(\square\)

Hence, with linear losses, the constrained optimization solution becomes

\[ x_{t+1} = \Pi^{\Phi_t}_\mathcal{K}\left(\nabla R^{-1}(\nabla R(\tilde{x}_t)-\eta g_t)\right) \]

In general, we won’t have linear losses. What do we take as our unconstrained minimizers in this case? Let

\[ \tilde{x}_{t+1} = \operatorname*{argmin}_{x\in\mathbf{R}^d}\left\{\eta f_t(x) + D_{\Phi_{t-1}}(x, \tilde{x}_t)\right\} \]

We can prove this is a minimizer of \(\Phi_t\): by definition of \(\Phi_t\),

\[ \eta f_t(x) + D_{\Phi_{t-1}}(x, \tilde{x}_t) = \Phi_t(x) - \Phi_{t-1}(x) + D_{\Phi_{t-1}}(x, \tilde{x}_t) \]

Taking gradients with respect to \(x\),

\[ \nabla \Phi_t(x) - \nabla \Phi_{t-1}(x) + \nabla_x D_{\Phi_{t-1}}(x,\tilde{x}_t) \]

and noting that \(\nabla_x D_{\Phi_{t-1}}(x,\tilde{x}_t)=\nabla\Phi_{t-1}(x)-\nabla\Phi_{t-1}(\tilde{x}_t)\), setting to zero gives

\[ \nabla\Phi_t(\tilde{x}_{t+1})=\nabla\Phi_{t-1}(\tilde{x}_t) \]

which inductively can be shown to be zero. So \(\tilde{x}_{t+1}\) minimizes \(\Phi_t\). As before, the constrained dynamics are given by Bregman projection.