📐 Linear Algebra Prerequisites for SSL Theory

A problem set covering the exact linear algebra needed to read the SSL theory note — rank, eigenvalues, covariance matrices, SVD, and matrix calculus

How to use this note

Each section gives a brief definition block, then one or more exercises. Work through the exercises before expanding the solution. The goal is not completeness — it is to build fluency with the specific objects that appear in SSL Theory and SSL Vision.

🔢 Rank and the Column Space
🔬 Eigenvalues and the Spectral Theorem
✅ Positive Semidefinite Matrices
📊 Covariance Matrices
✂️ Singular Value Decomposition
📏 Trace, Frobenius Norm, and Stable Rank
🔄 Orthogonal Matrices and the Group O(d)
🧮 Matrix Calculus
📐 Cross-Covariance and Gaussian Conditioning
🎯 Rayleigh Quotient and Variational Characterization
🔗 Synthesis Exercises
📚 References

🔢 Rank and the Column Space

Definition. The column space of $A \in \mathbb{R}^{m \times n}$ is $\mathrm{col}(A) = \{Ax : x \in \mathbb{R}^n\} \subseteq \mathbb{R}^m$. The rank is $\mathrm{rank}(A) = \dim \mathrm{col}(A)$.

Key facts. - $\mathrm{rank}(A) \leq \min(m, n)$. If equality holds, $A$ is full rank. - $\mathrm{rank}(A) = \mathrm{rank}(A^\top)$ — row rank equals column rank. - $\mathrm{rank}(AB) \leq \min(\mathrm{rank}(A), \mathrm{rank}(B))$. - $A \in \mathbb{R}^{n \times n}$ is invertible if and only if $\mathrm{rank}(A) = n$.

Why it matters for SSL. In SSL, we form an embedding matrix $Z \in \mathbb{R}^{N \times d}$ (rows = embeddings for $N$ images, columns = $d$ embedding dimensions). Representational collapse is exactly $\mathrm{rank}(Z) \ll d$ — the embeddings lie in a low-dimensional subspace and carry limited information.

Exercise 1: Rank of an Outer Product

The outer product of two vectors is the simplest rank-1 matrix — and the building block of low-rank structure.

Prerequisites: Rank and the Column Space

Let $u \in \mathbb{R}^m$, $v \in \mathbb{R}^n$ be nonzero vectors. Show that $A = uv^\top \in \mathbb{R}^{m \times n}$ has rank exactly 1. Then show that if all $N$ embedding vectors are identical — $z_1 = z_2 = \cdots = z_N = c$ — the centered embedding matrix $\tilde{Z} = Z - \bar{z}\mathbf{1}^\top$ has rank 0. What is the rank of the uncentered $Z$ in this case?

Solution to Exercise 1

Key insight: Any matrix whose rows all point in the same direction has rank 1; subtracting the mean when all rows are equal gives the zero matrix (rank 0).

Sketch: $A = uv^\top$: every column of $A$ is $u \cdot v_i$ (a scalar multiple of $u$), so $\mathrm{col}(A) = \mathrm{span}\{u\}$, which has dimension 1. For the collapse case: $Z = \mathbf{1} c^\top$ (all rows equal $c^\top$). Then $\bar{z} = c$, so $\tilde{Z} = Z - \mathbf{1}c^\top = 0$ — rank 0. The uncentered $Z = \mathbf{1}c^\top$ has rank 1 (since $\mathbf{1}$ and $c$ are nonzero). This is why the SSL literature says “full collapse has $\mathrm{rank}(Z) = 1$”: technically the uncentered matrix has rank 1, but there is zero variance in the embeddings — no information is encoded.

Exercise 2: Rank and Information Capacity

Low rank directly limits how many linearly separable classes a representation can support.

Prerequisites: Rank and the Column Space

Suppose a linear classifier uses the embedding $z \in \mathbb{R}^d$ to predict one of $K$ classes: $\hat{y} = \mathrm{argmax}_k\, w_k^\top z$. If the embedding matrix $Z \in \mathbb{R}^{N \times d}$ has rank $r$, what is the maximum number of classes $K$ that can be linearly separated by any classifier of this form? Relate this to dimensional collapse in SSL (where $r \ll d$).

Solution to Exercise 2

Key insight: A rank-$r$ embedding lives in an $r$-dimensional subspace, so at most $r+1$ classes can be linearly separated — regardless of $d$.

Sketch: All embeddings lie in a subspace $V \subseteq \mathbb{R}^d$ with $\dim V = r$. The linear classifier $w_k^\top z$ restricted to $V$ is equivalent to a linear classifier in $\mathbb{R}^r$. By the VC dimension of linear classifiers in $\mathbb{R}^r$, at most $r + 1$ points in general position can be shattered. So at most $r + 1$ classes can be linearly separated. If $r \ll d$, then dimensional collapse wastes most of the embedding capacity — a 2048-dimensional embedding with rank 10 cannot linearly separate more classes than a 10-dimensional embedding.

🔬 Eigenvalues and the Spectral Theorem

Definition. A scalar $\lambda \in \mathbb{R}$ is an eigenvalue of $A \in \mathbb{R}^{n \times n}$ with eigenvector $v \neq 0$ if $Av = \lambda v$.

The spectral theorem. If $A \in \mathbb{R}^{n \times n}$ is symmetric ($A = A^\top$), then: 1. All eigenvalues of $A$ are real. 2. Eigenvectors for distinct eigenvalues are orthogonal. 3. $A$ has an eigendecomposition: $A = Q \Lambda Q^\top$ where $Q \in \mathbb{R}^{n \times n}$ has orthonormal columns (the eigenvectors) and $\Lambda = \mathrm{diag}(\lambda_1, \ldots, \lambda_n)$.

Why it matters. The covariance matrix $\Sigma = \mathrm{Cov}(z)$ is always symmetric. Its eigenvectors are the principal directions of the data distribution; its eigenvalues measure the variance in each direction. $\mathrm{rank}(\Sigma) < d$ exactly when some eigenvalue is zero — the embedding has collapsed in that direction.

Exercise 3: Eigenvalues and Matrix Functions

Functions of a symmetric matrix act on eigenvalues, not on the matrix entries.

Prerequisites: Eigenvalues and the Spectral Theorem

Let $A = Q\Lambda Q^\top$ be the eigendecomposition of a symmetric matrix $A$ with eigenvalues $\lambda_1, \ldots, \lambda_n > 0$. Show that $A^{-1} = Q\Lambda^{-1}Q^\top$ and $A^{1/2} = Q\Lambda^{1/2}Q^\top$ (where $\Lambda^{-1} = \mathrm{diag}(1/\lambda_i)$ and $\Lambda^{1/2} = \mathrm{diag}(\sqrt{\lambda_i})$). Use this to compute $\mathrm{tr}(A^{-1})$ in terms of the eigenvalues.

Solution to Exercise 3

Key insight: Any reasonable function $f$ of a symmetric matrix is computed by applying $f$ to the diagonal entries of $\Lambda$: $f(A) = Q f(\Lambda) Q^\top$. This makes matrix functions tractable whenever the eigendecomposition is known.

Sketch: $A^{-1} = (Q\Lambda Q^\top)^{-1} = (Q^\top)^{-1}\Lambda^{-1}Q^{-1} = Q\Lambda^{-1}Q^\top$ (using $Q^{-1} = Q^\top$ for orthogonal $Q$). Similarly $A^{1/2} \cdot A^{1/2} = Q\Lambda^{1/2}Q^\top Q\Lambda^{1/2}Q^\top = Q\Lambda Q^\top = A$ ✓. For the trace: $\mathrm{tr}(A^{-1}) = \mathrm{tr}(Q\Lambda^{-1}Q^\top) = \mathrm{tr}(\Lambda^{-1}Q^\top Q) = \mathrm{tr}(\Lambda^{-1}) = \sum_i 1/\lambda_i$ (using the cyclic property $\mathrm{tr}(ABC) = \mathrm{tr}(BCA)$).

Exercise 4: Spectral Characterization of Rank

The rank of a symmetric matrix equals the number of nonzero eigenvalues.

Prerequisites: Eigenvalues and the Spectral Theorem

Let $A = Q\Lambda Q^\top$ be symmetric with eigenvalues $\lambda_1 \geq \cdots \geq \lambda_r > \lambda_{r+1} = \cdots = \lambda_n = 0$. Show that $\mathrm{rank}(A) = r$. (Hint: use $A = \sum_{i=1}^n \lambda_i q_i q_i^\top$.) Why does this mean $\Sigma \succ 0$ (all eigenvalues positive) is equivalent to $\mathrm{rank}(\Sigma) = d$?

Solution to Exercise 4

Key insight: The spectral decomposition writes $A$ as a sum of rank-1 outer products — the number of nonzero terms is exactly the rank.

Sketch: From $A = Q\Lambda Q^\top$: expand as $A = \sum_{i=1}^n \lambda_i q_i q_i^\top$. The nonzero terms are $i = 1, \ldots, r$ (since $\lambda_i = 0$ for $i > r$). Each $\lambda_i q_i q_i^\top$ has rank 1 and column space $\mathrm{span}\{q_i\}$. Since $q_1, \ldots, q_r$ are orthonormal (hence linearly independent), $\mathrm{col}(A) = \mathrm{span}\{q_1, \ldots, q_r\}$ has dimension $r$. So $\mathrm{rank}(A) = r$. The equivalence $\Sigma \succ 0 \Leftrightarrow \mathrm{rank}(\Sigma) = d$ follows immediately: $\Sigma \succ 0$ means all $\lambda_i > 0$ (by definition), which means all $n$ eigenvalues are nonzero, so $r = n = d$.

✅ Positive Semidefinite Matrices

Definition. $A \in \mathbb{R}^{n \times n}$ is positive semidefinite (PSD), written $A \succeq 0$, if $x^\top A x \geq 0$ for all $x \in \mathbb{R}^n$. It is positive definite (PD), written $A \succ 0$, if $x^\top Ax > 0$ for all $x \neq 0$.

Equivalent characterizations (for symmetric $A$): - $A \succeq 0 \Leftrightarrow$ all eigenvalues $\lambda_i \geq 0$. - $A \succ 0 \Leftrightarrow$ all eigenvalues $\lambda_i > 0 \Leftrightarrow A$ is invertible. - $A \succeq 0 \Leftrightarrow A = B^\top B$ for some $B$ (the square root factorization).

SSL connection. The anti-collapse condition in SSL Theory is $\Sigma \succ 0$ — the embedding covariance must be positive definite. VICReg’s variance term explicitly enforces $\lambda_j \geq \gamma^2 > 0$ for all $j$, making $\Sigma \succ 0$ a hard constraint.

Exercise 5: Every Covariance Matrix Is PSD

This is why eigenvalue analysis is always valid for covariance matrices — their eigenvalues are guaranteed non-negative.

Prerequisites: Positive Semidefinite Matrices

Let $Z \in \mathbb{R}^{N \times d}$ be any matrix and $\hat\Sigma = Z^\top Z / N$. Show that $\hat\Sigma \succeq 0$ directly from the definition (i.e. show $x^\top \hat\Sigma x \geq 0$ for all $x \in \mathbb{R}^d$). Under what condition on $Z$ is $\hat\Sigma \succ 0$ (positive definite)?

Solution to Exercise 5

Key insight: $x^\top (Z^\top Z) x = \|Zx\|^2 \geq 0$ always — PSD follows from the norm being nonneg. PD requires $Zx \neq 0$ for all $x \neq 0$, i.e. $Z$ has full column rank.

Sketch: $x^\top \hat\Sigma x = x^\top (Z^\top Z/N) x = \|Zx\|^2/N \geq 0$ for all $x$ since norms are nonnegative. So $\hat\Sigma \succeq 0$. For $\hat\Sigma \succ 0$: need $\|Zx\|^2 > 0$ for all $x \neq 0$, i.e. $Zx \neq 0$ for all $x \neq 0$, i.e. the null space of $Z$ is $\{0\}$, i.e. $Z$ has full column rank ($\mathrm{rank}(Z) = d$). In SSL terms: $\hat\Sigma \succ 0 \Leftrightarrow$ no dimensional collapse in $Z$.

📊 Covariance Matrices

Definition. For a random vector $z \in \mathbb{R}^d$ with mean $\mu = \mathbb{E}[z]$, the covariance matrix is: \[\Sigma = \mathbb{E}[(z - \mu)(z - \mu)^\top] \in \mathbb{R}^{d \times d}.\]

The $(i,j)$ entry is $\Sigma_{ij} = \mathrm{Cov}(z_i, z_j)$. The diagonal $\Sigma_{jj} = \mathrm{Var}(z_j)$.

Eigendecomposition of $\Sigma$. Since $\Sigma \succeq 0$ and symmetric (by Exercise 5), $\Sigma = Q\Lambda Q^\top$ with $\Lambda = \mathrm{diag}(\lambda_1, \ldots, \lambda_d)$, $\lambda_i \geq 0$. The eigenvectors $q_i$ (columns of $Q$) are the principal directions — the directions of maximum/minimum variance. The eigenvalue $\lambda_i = \mathrm{Var}(q_i^\top z)$ is the variance of $z$ projected onto $q_i$.

SSL connection. The alignment and uniformity framework (SSL Theory §Alignment and Uniformity) requires the embedding covariance to be “spread” — all eigenvalues significant. The effective rank measures how spread the eigenvalue distribution is.

Exercise 6: Variance in a Rotated Basis

Eigenvalues are the variances in the eigenvector directions — the most natural coordinate system for a covariance matrix.

Prerequisites: Covariance Matrices

Let $\Sigma = Q\Lambda Q^\top$ be the covariance of $z$. Define the whitened variable $\tilde{z} = \Lambda^{-1/2}Q^\top z$ (assuming $\Sigma \succ 0$). Compute $\mathrm{Cov}(\tilde{z})$ and show it equals $I_d$. Why is whitening useful for the Barlow Twins objective (which targets $C = I_d$)?

Solution to Exercise 6

Key insight: Whitening simultaneously normalizes all variances to 1 and removes all correlations — it maps any full-rank covariance to the identity. Barlow Twins’ $C = I_d$ target is exactly the whitened condition.

Sketch: $\mathrm{Cov}(\tilde{z}) = \Lambda^{-1/2}Q^\top \mathrm{Cov}(z) Q\Lambda^{-1/2} = \Lambda^{-1/2}Q^\top Q\Lambda Q^\top Q\Lambda^{-1/2} = \Lambda^{-1/2}\Lambda\Lambda^{-1/2} = I_d$. The whitening transform makes all dimensions unit-variance and pairwise uncorrelated. Barlow Twins drives the cross-correlation matrix $C = (Z^A)^\top Z^B / N$ toward $I_d$ — this is cross-whitening: requiring the joint embedding $(z^A, z^B)$ to satisfy $\mathrm{Cov}(z^A_i, z^B_j) = \delta_{ij}$, i.e. each augmented view’s embeddings are whitened and the two views are maximally aligned dimension-by-dimension.

✂️ Singular Value Decomposition

Theorem (SVD). Every matrix $A \in \mathbb{R}^{m \times n}$ can be written as: \[A = U \Sigma V^\top\] where $U \in \mathbb{R}^{m \times m}$ and $V \in \mathbb{R}^{n \times n}$ are orthogonal, and $\Sigma = \mathrm{diag}(\sigma_1, \ldots, \sigma_r, 0, \ldots, 0)$ with $\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r > 0$. The $\sigma_i$ are the singular values, $r = \mathrm{rank}(A)$.

Connection to eigendecomposition. $A^\top A = V\Sigma^\top U^\top U\Sigma V^\top = V\mathrm{diag}(\sigma_i^2)V^\top$ — so the right singular vectors of $A$ are the eigenvectors of $A^\top A$, with eigenvalues $\sigma_i^2$.

Eckart–Young theorem (best low-rank approximation). The best rank-$k$ approximation to $A$ in Frobenius norm is: \[A_k = \sum_{i=1}^k \sigma_i u_i v_i^\top, \quad \|A - A_k\|_F^2 = \sum_{i=k+1}^r \sigma_i^2.\]

SSL connection. Tian et al.’s theorem says that SSL converges to the top-$d$ eigenvectors of the augmentation-averaged covariance $F$. Since $F = \mathbb{E}[z_1 z_2^\top]$ is symmetric PSD, its eigenvectors = its right singular vectors. SSL converges to the rank-$d$ SVD truncation of $F$ — the PCA of augmentation-invariant covariance.

Exercise 7: SVD and PCA

PCA is the SVD of the centered data matrix — the connection that underlies Tian et al.’s convergence theorem.

Prerequisites: Singular Value Decomposition

Let $X \in \mathbb{R}^{N \times n}$ be a centered data matrix (zero mean columns). Show that the top-$d$ principal components of $X$ (the directions of maximum variance) are exactly the top-$d$ right singular vectors of $X$. Then argue why minimizing the prediction error $\mathbb{E}[\|h^*(z_1) - z_2\|^2]$ over linear encoders $z = Wx$ with $W \in \mathbb{R}^{d \times n}$ is equivalent to computing the top-$d$ PCA directions of $\mathbb{E}[x_1 x_2^\top]$.

Solution to Exercise 7

Key insight: PCA = truncated SVD of the data matrix. SSL in the linear regime finds the PCA directions of the cross-view covariance — the maximally augmentation-consistent subspace.

Sketch: The sample covariance is $\hat\Sigma_X = X^\top X / N$. PCA finds the $d$ eigenvectors of $\hat\Sigma_X$ with largest eigenvalues. Since $X = U\Sigma V^\top$: $\hat\Sigma_X = V(\Sigma^\top\Sigma/N)V^\top$ — the right singular vectors $V$ are the eigenvectors of $\hat\Sigma_X$, so the top-$d$ PC directions = top-$d$ right singular vectors of $X$. ✓ For the SSL claim: the optimal linear predictor is $h^*(z_1) = \Sigma_{21}\Sigma_{11}^{-1}z_1$ (Exercise 3 in ssl-theory). Substituting $z_i = Wx_i$ and minimizing over $W$: the problem reduces to finding $W$ that maximizes $\mathrm{tr}(W F W^\top)$ subject to $WW^\top = I$ (orthonormality), where $F = \mathbb{E}[x_1 x_2^\top]$. By the variational characterization of eigenvalues, the optimum is $W = $ top-$d$ eigenvectors of $F$ — i.e., the PCA of the cross-view covariance.

📏 Trace, Frobenius Norm, and Stable Rank

Definitions.

Quantity	Formula	Properties
Trace	$\mathrm{tr}(A) = \sum_i A_{ii}$	$\mathrm{tr}(AB) = \mathrm{tr}(BA)$; $\mathrm{tr}(A) = \sum_i \lambda_i$
Frobenius norm	$\\|A\\|_F = \sqrt{\sum_{i,j}A_{ij}^2} = \sqrt{\mathrm{tr}(A^\top A)}$	$\\|A\\|_F^2 = \sum_i \sigma_i^2$
Spectral norm	$\\|A\\|_2 = \sigma_1$ (largest singular value of $A$)	$\\|A\\|_2 \leq \\|A\\|_F \leq \sqrt{\mathrm{rank}(A)}\,\\|A\\|_2$
Stable rank	$\mathrm{srank}(\Sigma) = (\mathrm{tr}\,\Sigma)^2 / \\|\Sigma\\|_F^2 = (\sum_i \lambda_i)^2 / \sum_i \lambda_i^2$	$\in [1, d]$ for $\Sigma \succeq 0$

Two stable rank conventions

The definition above — $(\mathrm{tr}\,\Sigma)^2/\|\Sigma\|_F^2$ — is the convention used in SSL Theory and Exercise 9 below. A second convention (Vershynin, random matrix theory) defines $\mathrm{srank}(A) = \|A\|_F^2/\|A\|_2^2 = \sum_i \sigma_i^2/\sigma_1^2$. Both equal $d$ for $\Sigma = \sigma^2 I_d$ and $1$ for rank-1 matrices; they differ for intermediate eigenvalue spectra. This note uses the first convention throughout.

The cyclic trace property $\mathrm{tr}(ABC) = \mathrm{tr}(BCA) = \mathrm{tr}(CAB)$ is used constantly in matrix calculus derivations.

Stable rank for covariance matrices. For $\Sigma$ with eigenvalues $\lambda_i$: $\mathrm{tr}(\Sigma) = \sum_i \lambda_i$, $\|\Sigma\|_F^2 = \sum_i \lambda_i^2$, and $\mathrm{srank}(\Sigma) = (\sum_i \lambda_i)^2 / \sum_i \lambda_i^2$. This equals $d$ when all $\lambda_i$ are equal, and 1 when only one $\lambda_i$ is nonzero.

Exercise 8: Trace and the Barlow Twins Loss

The Barlow Twins loss has a natural expression in terms of the Frobenius norm of the cross-correlation matrix.

Prerequisites: Trace, Frobenius Norm, and Stable Rank

Let $C \in \mathbb{R}^{d \times d}$ be the cross-correlation matrix from Barlow Twins. Show that: \[\mathcal{L}_{\mathrm{BT}} = \|C - I_d\|_F^2 + (\lambda - 1)\sum_i (1 - C_{ii})^2.\] For the special case $\lambda = 1$, the loss simplifies to $\|C - I_d\|_F^2$. Interpret: what does minimizing the Frobenius distance $\|C - I_d\|_F$ mean geometrically?

Solution to Exercise 8

Key insight: The Barlow Twins loss (at $\lambda=1$) is exactly the squared Frobenius distance between the cross-correlation matrix and the identity — it directly minimizes the “distance to whitened” condition.

Sketch: $\|C - I_d\|_F^2 = \sum_{i,j}(C_{ij} - \delta_{ij})^2 = \sum_i(C_{ii}-1)^2 + \sum_{i\neq j}C_{ij}^2$. The Barlow Twins loss is $\mathcal{L}_\text{BT} = \sum_i(1-C_{ii})^2 + \lambda\sum_{i\neq j}C_{ij}^2$. Comparing: $\mathcal{L}_\text{BT} = \|C-I\|_F^2 + (\lambda-1)\sum_{i\neq j}C_{ij}^2$. At $\lambda=1$: $\mathcal{L}_\text{BT} = \|C-I_d\|_F^2$. Geometrically, this is the squared distance (in the space of $d\times d$ matrices with the Frobenius inner product) between $C$ and the identity. Minimizing it finds the $C$ closest to $I_d$ — i.e., drives the embedding toward the whitened cross-covariance condition.

Exercise 9: Stable Rank Bounds

The stable rank lies between 1 and the true rank — it detects near-collapse even when no eigenvalue is exactly zero.

Prerequisites: Trace, Frobenius Norm, and Stable Rank

Let $\Sigma$ have eigenvalues $\lambda_1 \geq \cdots \geq \lambda_d \geq 0$. Prove the bounds $1 \leq \mathrm{srank}(\Sigma) \leq \mathrm{rank}(\Sigma)$ using the Cauchy–Schwarz inequality. Then compute $\mathrm{srank}(\Sigma)$ for: (a) $\Sigma = \sigma^2 I_d$, (b) $\Sigma = \mathrm{diag}(1, 0, \ldots, 0)$, (c) $\Sigma = \mathrm{diag}(1, \varepsilon, \ldots, \varepsilon)$ with small $\varepsilon > 0$.

Solution to Exercise 9

Key insight: Cauchy–Schwarz gives both bounds simultaneously; the three cases illustrate that stable rank correctly identifies full rank, collapse, and near-collapse.

Sketch: Let $r = \mathrm{rank}(\Sigma)$ and $\lambda_1, \ldots, \lambda_r > 0$ (nonzero eigenvalues). Upper bound: Cauchy–Schwarz gives $(\sum_{i=1}^r \lambda_i)^2 = (\sum_{i=1}^r \lambda_i \cdot 1)^2 \leq (\sum_{i=1}^r \lambda_i^2)(\sum_{i=1}^r 1^2) = r\sum\lambda_i^2$. So $\mathrm{srank} = (\sum\lambda_i)^2/\sum\lambda_i^2 \leq r = \mathrm{rank}(\Sigma)$. Lower bound: $(\sum\lambda_i)^2 \geq \max_i\lambda_i \cdot \sum\lambda_i \geq \sum\lambda_i^2$ (since $\max_i\lambda_i \geq \lambda_i$ for each $i$), so $\mathrm{srank} \geq 1$. Cases: (a) $\Sigma = \sigma^2 I$: $\mathrm{srank} = (d\sigma^2)^2/(d\sigma^4) = d$ ✓ full. (b) $\Sigma = \mathrm{diag}(1,0,\ldots,0)$: $\mathrm{srank} = 1^2/1^2 = 1$ ✓ full collapse. (c) $\Sigma = \mathrm{diag}(1,\varepsilon,\ldots,\varepsilon)$: $\mathrm{srank} = (1+(d-1)\varepsilon)^2/(1+(d-1)\varepsilon^2)$. As $\varepsilon\to 0$: $\to 1$; as $\varepsilon \to 1$: $\to d$. For small $\varepsilon$, $\mathrm{srank} \approx 1 + 2(d-1)\varepsilon$ — near-collapse is detected even though all eigenvalues are strictly positive.

🔄 Orthogonal Matrices and the Group O(d)

Definition. $R \in \mathbb{R}^{d \times d}$ is orthogonal if $R^\top R = RR^\top = I_d$. Equivalently, $R^{-1} = R^\top$, and $R$ preserves inner products: $\langle Ru, Rv\rangle = u^\top R^\top R v = u^\top v$.

The orthogonal group. $O(d) = \{R \in \mathbb{R}^{d \times d} : R^\top R = I\}$ is a group under matrix multiplication. It has two connected components: rotations ($\det R = +1$, forming $SO(d)$) and reflections ($\det R = -1$).

SSL connection. In SSL Vision §Exercise 6 (Barlow Twins), the solution set of $\mathcal{L}_\text{BT} = 0$ forms an orbit under $O(d)$: if $Z$ is a solution, so is $ZR$ for any $R \in O(d)$. This means the SSL objective does not prefer any particular orientation of the embedding space — all rotations are equivalent. This is the rotational symmetry of SSL.

Exercise 10: Orthogonal Invariance of the Covariance

Rotating the embedding space does not change the spectrum of the covariance — only the eigenvectors rotate.

Prerequisites: Orthogonal Matrices and the Group O(d)

Let $Z \in \mathbb{R}^{N \times d}$ have covariance $\hat\Sigma = Z^\top Z / N$ (assuming zero mean). Let $\tilde Z = ZR$ for $R \in O(d)$. Compute the covariance $\tilde\Sigma$ of $\tilde Z$ and show it has the same eigenvalues as $\hat\Sigma$ (but generally different eigenvectors). What does this imply about $\mathrm{srank}(\tilde\Sigma)$ and $\mathrm{rank}(\tilde\Sigma)$?

Solution to Exercise 10

Key insight: Orthogonal transformations rotate the eigenvectors but leave eigenvalues unchanged — so rank, stable rank, and all spectral properties are $O(d)$-invariant.

Sketch: $\tilde\Sigma = \tilde Z^\top \tilde Z / N = (ZR)^\top(ZR)/N = R^\top(Z^\top Z/N)R = R^\top\hat\Sigma R$. If $\hat\Sigma = Q\Lambda Q^\top$, then $\tilde\Sigma = R^\top Q\Lambda Q^\top R = (Q^\top R)^\top \Lambda (Q^\top R)$ — so $\tilde\Sigma$ has the same eigenvalues $\Lambda$ but eigenvectors $Q^\top R$ (a rotation of $Q$). Therefore $\mathrm{srank}(\tilde\Sigma) = \mathrm{srank}(\hat\Sigma)$ and $\mathrm{rank}(\tilde\Sigma) = \mathrm{rank}(\hat\Sigma)$. The rotational ambiguity in SSL ($ZR$ is equally valid as $Z$) does not affect any spectral diagnostic — tracking $\mathrm{srank}$ during training is invariant to the arbitrary rotation of the learned basis.

🧮 Matrix Calculus

Two gradient rules appear repeatedly in SSL loss derivations.

Gradient of trace. For differentiable $f : \mathbb{R}^{d \times d} \to \mathbb{R}$: \[\frac{\partial}{\partial A}\,\mathrm{tr}(AB) = B^\top, \qquad \frac{\partial}{\partial A}\,\mathrm{tr}(A^\top A) = 2A.\]

Gradient of the Frobenius norm. \[\frac{\partial}{\partial A}\,\|A - B\|_F^2 = 2(A - B).\]

Gradient of a quadratic form. For symmetric $\Sigma$: \[\frac{\partial}{\partial x}\,(x^\top \Sigma x) = 2\Sigma x.\]

Jacobian of matrix products. When $A$ depends on parameter $\theta$ via $A(\theta) = f(\theta)g(\theta)^\top$, the chain rule applies componentwise: $\partial \ell / \partial \theta = (\partial \ell / \partial A) \cdot (\partial A / \partial \theta)$, using the Frobenius inner product $\langle \cdot, \cdot \rangle_F$.

Exercise 11: Gradient of the Barlow Twins Loss

Computing the gradient of $\mathcal{L}_\text{BT}$ with respect to the embeddings reveals the update rule that drives $C \to I$.

Prerequisites: Matrix Calculus

For simplicity, take $\lambda = 1$ so $\mathcal{L}_\text{BT} = \|C - I\|_F^2$ where $C = (Z^A)^\top Z^B / N$ and $Z^A, Z^B$ are batch-normalized (treat them as free variables). Compute $\partial \mathcal{L}_\text{BT} / \partial Z^A$ and show it equals $2(C - I)(Z^B)^\top / N$. Interpret: when $C = I$ (target reached), the gradient vanishes — confirming $C = I$ is a fixed point.

Solution to Exercise 11

Key insight: The gradient of the Barlow Twins loss w.r.t. $Z^A$ is proportional to the “error matrix” $(C - I)$ projected back through $Z^B$ — it pushes $Z^A$ to reduce the deviation of each cross-correlation from its target.

Sketch: Write $\mathcal{L}_\text{BT} = \mathrm{tr}((C-I)^\top(C-I))$ and $C = (Z^A)^\top Z^B/N$. Then: \[\frac{\partial\mathcal{L}}{\partial C} = 2(C - I).\] By the chain rule, $\partial \mathcal{L}/\partial Z^A = (\partial \mathcal{L}/\partial C)(\partial C/\partial Z^A)$. Since $C = (Z^A)^\top Z^B/N$, we have $\partial C_{ij}/\partial Z^A_{bk} = Z^B_{bj}\delta_{ik}/N$, giving: \[\frac{\partial\mathcal{L}}{\partial Z^A} = \frac{2}{N}(C-I)(Z^B)^\top \cdot \frac{1}{N} \cdot N = \frac{2}{N} Z^B (C - I)^\top.\] At $C = I$: gradient $= 0$ ✓. When $C_{ij} > \delta_{ij}$ (dimensions over-correlated), the gradient pushes $Z^A$ in the direction that reduces $C_{ij}$ toward $\delta_{ij}$.

📐 Cross-Covariance and Gaussian Conditioning

When two random vectors $z_1, z_2 \in \mathbb{R}^d$ are jointly distributed with means $\mu_1, \mu_2$, the cross-covariance matrix is:

\[\Sigma_{21} = \mathbb{E}[(z_2 - \mu_2)(z_1 - \mu_1)^\top] \in \mathbb{R}^{d \times d}, \qquad \Sigma_{12} = \Sigma_{21}^\top.\]

Unlike the within-variable covariance $\Sigma_{11} = \mathrm{Cov}(z_1) \succeq 0$, the cross-covariance $\Sigma_{21}$ is not symmetric and need not be PSD.

Gaussian conditioning formula. If $(z_1, z_2)$ are jointly Gaussian with zero means and block covariance:

\[\mathrm{Cov}\!\begin{pmatrix}z_1 \\ z_2\end{pmatrix} = \begin{pmatrix}\Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22}\end{pmatrix},\]

then the conditional distribution is:

\[z_2 \mid z_1 \sim \mathcal{N}\!\bigl(\Sigma_{21}\Sigma_{11}^{-1}z_1,\;\; \Sigma_{22} - \Sigma_{21}\Sigma_{11}^{-1}\Sigma_{12}\bigr).\]

The conditional mean $\mathbb{E}[z_2 \mid z_1] = \Sigma_{21}\Sigma_{11}^{-1}z_1$ is the linear least-squares predictor of $z_2$ given $z_1$. The conditional covariance $\Sigma_{22} - \Sigma_{21}\Sigma_{11}^{-1}\Sigma_{12}$ is the Schur complement of $\Sigma_{11}$ — it measures residual uncertainty in $z_2$ after observing $z_1$.

SSL connection. In the SimSiam EM derivation (see SSL Theory §SimSiam as EM), the E-step optimal predictor is $h^*(z_1) = \mathbb{E}[z_2 \mid z_1]$. For jointly Gaussian zero-mean embeddings this equals $\Sigma_{21}\Sigma_{11}^{-1}z_1$. The M-step then minimizes $\mathrm{tr}(\Sigma_{22} - \Sigma_{21}\Sigma_{11}^{-1}\Sigma_{12})$ — the total conditional variance — with respect to the encoder weights.

Exercise 12: Deriving the Optimal Linear Predictor

The normal equations for least-squares regression give $A^* = \Sigma_{21}\Sigma_{11}^{-1}$ — the same coefficient as Gaussian conditioning.

Prerequisites: Cross-Covariance and Gaussian Conditioning, Matrix Calculus

Let $z_1, z_2 \in \mathbb{R}^d$ have zero means with $\mathbb{E}[z_1 z_1^\top] = \Sigma_{11}$ and $\mathbb{E}[z_2 z_1^\top] = \Sigma_{21}$. Find $A^* = \operatorname{argmin}_{A \in \mathbb{R}^{d \times d}}\, \mathbb{E}[\|Az_1 - z_2\|^2]$ by expanding the objective, computing $\partial/\partial A$, and setting to zero. Show $A^* = \Sigma_{21}\Sigma_{11}^{-1}$. What does $A^*$ reduce to when (a) $z_1 = z_2$ (trivial augmentations) and (b) $\Sigma_{21} = 0$ (views are uncorrelated)?

Solution to Exercise 12

Key insight: The OLS normal equations and the Gaussian conditioning formula give the same linear map — both minimize expected squared prediction error.

Sketch: Expand: $\mathbb{E}[\|Az_1 - z_2\|^2] = \mathrm{tr}(A\Sigma_{11}A^\top - 2A\Sigma_{12} + \Sigma_{22})$. Taking $\partial/\partial A$ (using $\partial\,\mathrm{tr}(ABA^\top)/\partial A = 2AB$ for symmetric $B$ and $\partial\,\mathrm{tr}(AC)/\partial A = C^\top$): derivative $= 2A\Sigma_{11} - 2\Sigma_{21} = 0$, giving $A^* = \Sigma_{21}\Sigma_{11}^{-1}$. (a) $z_1 = z_2$: $\Sigma_{21} = \Sigma_{11}$, so $A^* = I$ — the predictor is the identity, gradient vanishes, nothing is learned (trivial augmentations provide no self-supervised signal). (b) $\Sigma_{21} = 0$: $A^* = 0$ — the predictor outputs zero regardless of input; the M-step gradient pushes the encoder toward collapse since both views carry no shared information.

🎯 Rayleigh Quotient and Variational Characterization

For a symmetric matrix $A \in \mathbb{R}^{d \times d}$, the Rayleigh quotient of a nonzero vector $x$ is:

\[R(A, x) = \frac{x^\top A x}{\|x\|^2}.\]

Variational characterization of eigenvalues (Courant–Fischer). If $A = Q\Lambda Q^\top$ with $\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d$, then:

\[\lambda_1 = \max_{\|x\|=1} x^\top A x, \qquad \lambda_d = \min_{\|x\|=1} x^\top A x.\]

The maximum is attained at $x = q_1$ (leading eigenvector); the minimum at $x = q_d$. More generally, the $k$-th eigenvalue equals the maximum Rayleigh quotient over all unit vectors orthogonal to $q_1, \ldots, q_{k-1}$.

Multi-dimensional generalization. For $W \in \mathbb{R}^{k \times d}$ with $WW^\top = I_k$ (orthonormal rows):

\[\max_{WW^\top = I_k}\; \mathrm{tr}(WAW^\top) = \sum_{i=1}^k \lambda_i,\]

achieved when the rows of $W$ are $q_1^\top, \ldots, q_k^\top$ (top-$k$ eigenvectors). This identity is proved by writing $W = \tilde{Q}R$ with $R \in O(k)$ and reducing to the 1D Rayleigh quotient applied to each column of $\tilde{Q}$.

SSL connection. Tian et al.’s convergence theorem (see SSL Theory §What Do SSL Representations Converge To?) states that the optimal linear encoder solves $\max_{WW^\top = I_d} \mathrm{tr}(WFW^\top)$. By the Courant–Fischer identity above, the solution is $W^* = [q_1, \ldots, q_d]^\top$ — the top-$d$ eigenvectors of $F$. SSL in the linear regime is PCA of the augmentation-averaged cross-covariance.

Exercise 13: Maximizing Projected Variance

The variational characterization proves that SSL in the linear regime finds the principal components of the augmentation-averaged covariance.

Prerequisites: Rayleigh Quotient and Variational Characterization, Eigenvalues and the Spectral Theorem

Let $F \in \mathbb{R}^{n \times n}$ be symmetric PSD with eigenvalues $\lambda_1 \geq \cdots \geq \lambda_n \geq 0$ and eigenvectors $q_1, \ldots, q_n$. For $W \in \mathbb{R}^{d \times n}$ with $WW^\top = I_d$, prove that $\mathrm{tr}(WFW^\top) \leq \sum_{i=1}^d \lambda_i$ and that equality is achieved by $W^* = [q_1, \ldots, q_d]^\top$. Conclude: if SSL in the linear regime maximizes $\mathrm{tr}(WFW^\top)$, what representation does it learn?

Solution to Exercise 13

Key insight: The orthonormality constraint forces each row of $W$ to “pick” a direction in $\mathbb{R}^n$; total projected variance is maximized by greedily picking the directions of largest variance — i.e., the top eigenvectors.

Sketch: Let $w_1, \ldots, w_d$ be the rows of $W$ (orthonormal). Then $\mathrm{tr}(WFW^\top) = \sum_{i=1}^d w_i^\top F w_i = \sum_{i=1}^d R(F, w_i)$. Since $w_1, \ldots, w_d$ are orthonormal, by the Courant–Fischer theorem applied iteratively: $\sum_{i=1}^d R(F, w_i) \leq \sum_{i=1}^d \lambda_i$, with equality when $w_i = q_i$ (the $i$-th eigenvector of $F$). Hence $W^* = [q_1, \ldots, q_d]^\top$. Conclusion: SSL in the linear regime learns the projection onto the subspace spanned by the top-$d$ eigenvectors of $F$ — i.e., PCA of $F$. For additive-noise augmentations, $F = \Sigma_x$ (data covariance), so SSL recovers standard PCA. With richer augmentations, $F$ encodes only augmentation-consistent structure, and SSL recovers PCA of that filtered covariance.

🔗 Synthesis Exercises

The following exercises each combine two or more prior sections. They directly underpin the three SSL theory results with no dedicated LA prereq above: the NT-Xent dimensional collapse bound, VICReg’s explicit eigenvalue constraint, and the SimSiam M-step conditional variance interpretation.

Exercise 14: Gram Matrix Rank and Dimensional Collapse

The Gram matrix $ZZ^\top$ and the covariance $Z^\top Z$ share all nonzero eigenvalues — this bounds the rank of any batch of embeddings by the embedding dimension $d$, not the batch size $N$.

Prerequisites: Rank and the Column Space, Singular Value Decomposition

Let $Z \in \mathbb{R}^{N \times d}$ with $N > d$. Let $G = ZZ^\top \in \mathbb{R}^{N \times N}$ be the Gram matrix and $\hat\Sigma = Z^\top Z / N \in \mathbb{R}^{d \times d}$. (a) Show that $G$ and $\hat\Sigma$ have the same nonzero eigenvalues (up to the $1/N$ scaling factor), and conclude $\mathrm{rank}(G) = \mathrm{rank}(Z) \leq d$. (b) The NT-Xent loss depends only on pairwise cosine similarities, which are entries of the normalized Gram matrix. Explain why this means the NT-Xent loss landscape is insensitive to the $(N - d)$ “extra” directions in $\mathbb{R}^N$ — and why this implies dimensional collapse to rank $\leq d$ carries no penalty from the loss.

Solution to Exercise 14

Key insight: The Gram matrix and covariance share nonzero eigenvalues because $ZZ^\top$ and $Z^\top Z$ are related by the same SVD; the embedding dimension $d < N$ caps the rank of $G$, making NT-Xent blind to most of the batch’s representation space.

Sketch: (a) Let $Z = U\Sigma V^\top$ be the SVD. Then $G = ZZ^\top = U\Sigma^2 U^\top$ and $Z^\top Z = V\Sigma^2 V^\top$. Both have the same diagonal $\Sigma^2$, so the same nonzero eigenvalues (those of $\hat\Sigma$ are $\sigma_i^2/N$). The rank is $r = $ number of nonzero singular values $\leq \min(N, d) = d$. (b) NT-Xent for a batch of $N$ embeddings is a function of $G_{ij} = z_i^\top z_j / (\|z_i\|\|z_j\|)$. Since $\mathrm{rank}(G) \leq d$, all $N$ embeddings lie in a $d$-dimensional subspace. If $d \ll N$, there are $N - d$ directions in $\mathbb{R}^N$ along which the Gram matrix is zero — the loss is constant along these directions. Gradient descent therefore has no incentive to use more than $d$ dimensions, even when $d \ll N$. This is the root cause of dimensional collapse in contrastive learning: with $N$ negatives and embedding dimension $d \ll N$, the NT-Xent minimizer can live in a rank-$d$ subspace.

Exercise 15: VICReg Variance Term and Eigenvalue Lower Bound

The variance hinge directly lower-bounds each diagonal of the within-branch covariance; combined with the covariance regularizer, this implies $\hat{C}(Z) \succeq (γ^2 - \varepsilon)I_d$.

Prerequisites: Positive Semidefinite Matrices, Covariance Matrices

VICReg’s variance term is $v(Z) = \frac{1}{d}\sum_{j=1}^d \max(0,\, \gamma - \sqrt{\mathrm{Var}_b(z^j) + \varepsilon})$. (a) Show that $v(Z) = 0$ implies $\hat{C}(Z)_{jj} \geq \gamma^2 - \varepsilon$ for every dimension $j$. (b) Now suppose additionally that the covariance regularizer $c(Z) = 0$ (all off-diagonal entries of $\hat{C}(Z)$ are zero, i.e. $\hat{C}(Z)$ is diagonal). Conclude that $\hat{C}(Z) \succeq (\gamma^2 - \varepsilon)I_d$ — the minimum eigenvalue is lower-bounded by $\gamma^2 - \varepsilon > 0$. Why does this mean VICReg explicitly prevents collapse via an eigenvalue constraint?

Solution to Exercise 15

Key insight: The variance hinge is an eigenvalue lower bound in disguise — it directly controls the diagonal of $\hat{C}(Z)$, and the covariance regularizer then ensures the diagonal entries are the eigenvalues.

Sketch: (a) $v(Z) = 0$ means every hinge is inactive: $\sqrt{\mathrm{Var}_b(z^j) + \varepsilon} \geq \gamma$ for all $j$, so $\mathrm{Var}_b(z^j) \geq \gamma^2 - \varepsilon$. Since $\hat{C}(Z)_{jj} = \mathrm{Var}_b(z^j)$, we have $\hat{C}(Z)_{jj} \geq \gamma^2 - \varepsilon$ for each $j$. (b) If $\hat{C}(Z)$ is diagonal, its eigenvalues are exactly its diagonal entries. All of these are $\geq \gamma^2 - \varepsilon > 0$ (for $\varepsilon < \gamma^2$), so $\hat{C}(Z) - (\gamma^2 - \varepsilon)I_d \succeq 0$, i.e. $\hat{C}(Z) \succeq (\gamma^2 - \varepsilon)I_d$. This directly implies $\hat{C}(Z) \succ 0$ — no collapse. Unlike BYOL or SimSiam (which prevent collapse implicitly via architecture), VICReg makes the eigenvalue constraint a hard condition in the loss: $v(Z) = 0$ is a necessary condition for any optimizer to declare convergence.

Exercise 16: Schur Complement, Conditional Variance, and the SimSiam M-Step

The Schur complement of a block covariance matrix is the conditional covariance of $z_2$ given $z_1$ — and minimizing it is the M-step objective in SimSiam.

Prerequisites: Cross-Covariance and Gaussian Conditioning, Positive Semidefinite Matrices

Let $(z_1, z_2)$ be jointly Gaussian with zero means and block covariance as in §Cross-Covariance. The Schur complement of $\Sigma_{11}$ in the joint covariance is $S = \Sigma_{22} - \Sigma_{21}\Sigma_{11}^{-1}\Sigma_{12}$. (a) Show $S \succeq 0$ by writing $S$ as a conditional variance. (b) Show $\mathrm{tr}(S) = \mathrm{tr}(\Sigma_{22}) - \mathrm{tr}(\Sigma_{21}\Sigma_{11}^{-1}\Sigma_{12})$ and interpret this as total variance minus explained variance. (c) The SimSiam M-step updates the encoder to minimize $\mathrm{tr}(S)$ (total conditional variance). Using the expression from (b), explain why this is equivalent to maximizing the explained variance of $z_2$ by $z_1$.

Solution to Exercise 16

Key insight: The Schur complement is a conditional covariance and is always PSD; minimizing its trace is the same as maximizing $R^2$ — the fraction of variance in $z_2$ explained by $z_1$.

Sketch: (a) For any $x \in \mathbb{R}^d$: $x^\top S x = x^\top \Sigma_{22} x - x^\top \Sigma_{21}\Sigma_{11}^{-1}\Sigma_{12} x = \mathrm{Var}(x^\top z_2) - \mathrm{Var}(\mathbb{E}[x^\top z_2 \mid z_1]) = \mathrm{Var}(x^\top z_2 \mid z_1) \geq 0$ (by the law of total variance). So $S \succeq 0$ — it equals the conditional covariance of $z_2$ given $z_1$. (b) $\mathrm{tr}(S) = \sum_j \mathrm{Var}(z_{2,j} \mid z_1)$ (sum of conditional variances per dimension). Since $\mathrm{tr}(\Sigma_{22}) = \sum_j \mathrm{Var}(z_{2,j})$ is the total variance: $\mathrm{tr}(S) = \mathrm{tr}(\Sigma_{22}) - \underbrace{\mathrm{tr}(\Sigma_{21}\Sigma_{11}^{-1}\Sigma_{12})}_{\text{explained variance}}$. (c) Since $\mathrm{tr}(\Sigma_{22})$ is fixed (depends only on the data, not the encoder), minimizing $\mathrm{tr}(S)$ is equivalent to maximizing $\mathrm{tr}(\Sigma_{21}\Sigma_{11}^{-1}\Sigma_{12})$. This is the encoder objective: make the views $z_1, z_2$ as mutually predictable as possible — i.e., maximize the total explained variance (multivariate $R^2$). The stop-gradient ensures $\Sigma_{21}$ and $\Sigma_{11}$ are treated as fixed during the M-step, exactly as in EM.

📚 References

Reference Name	Brief Summary	Link
Strang, Introduction to Linear Algebra (5th ed.)	Standard reference for rank, eigenvalues, SVD, and PCA with geometric intuition	MIT OpenCourseWare
Horn & Johnson, Matrix Analysis	Rigorous treatment of spectral theory, PSD matrices, and matrix norms	Cambridge University Press
Petersen & Pedersen, The Matrix Cookbook	Dense reference for matrix calculus identities (trace gradients, Frobenius norm derivatives)	matrixcookbook.com
SSL Theory	Uses all concepts above: stable rank, PSD covariance, SVD/PCA convergence, matrix calculus for loss gradients	local

Quantity	Formula	Properties
Trace	\(\mathrm{tr}(A) = \sum_i A_{ii}\)	\(\mathrm{tr}(AB) = \mathrm{tr}(BA)\); \(\mathrm{tr}(A) = \sum_i \lambda_i\)
Frobenius norm	\(\\|A\\|_F = \sqrt{\sum_{i,j}A_{ij}^2} = \sqrt{\mathrm{tr}(A^\top A)}\)	\(\\|A\\|_F^2 = \sum_i \sigma_i^2\)
Spectral norm	\(\\|A\\|_2 = \sigma_1\) (largest singular value of \(A\))	\(\\|A\\|_2 \leq \\|A\\|_F \leq \sqrt{\mathrm{rank}(A)}\,\\|A\\|_2\)
Stable rank	\(\mathrm{srank}(\Sigma) = (\mathrm{tr}\,\Sigma)^2 / \\|\Sigma\\|_F^2 = (\sum_i \lambda_i)^2 / \sum_i \lambda_i^2\)	\(\in [1, d]\) for \(\Sigma \succeq 0\)