An Interactive Introduction to Model-Agnostic Meta-Learning

Exploring the world of model-agnostic meta-learning and its variants.

This page is part of a multi-part series on Model-Agnostic Meta-Learning. If you are already familiar with the topic, use the menu on the right side to jump straight to the part that is of interest for you. Otherwise, we suggest you start at the beginning.

First-Order MAML (FOMAML)

FOMAML was suggested by Finn et al. (the authors of MAML, in the very paper that introduces MAML) and is a straightforward heuristic to get rid of the second-order terms (which we introduced on the last page): Setting them to zero! As a result \[\nabla_\theta U_{\tau_i}(\theta) = I\] and the overall meta loss gradient reduces to \[ \nabla_\theta \mathcal{L}(\theta) = \sum_{i} \nabla_{U_{\tau_i}(\theta)} \mathcal{L}_{\tau_i, \text{test}} .\] Simple right? Maybe a bit too simple. Let us have a detailed look at the term we are discarding, namely \[ \nabla^2_\theta \mathcal{L}_{\tau_i, \text{train}}(\theta). \] This term is known as the Hessian of loss function \(\mathcal{L}_{\tau_i, \text{train}}\), which describes the local curvature as a function . Further, setting it to zero results in a linear approximation of the meta-gradient, being more accurate the more locally linear the meta-gradient is at meta-parameter \(\theta\). To see what the Hessian actually entails, let us resolve it under the assumption of MSE loss, neural net \(M\) and dataset \(\mathcal{D} := (x, y)\). We omit some of the subscripts to make the formulae more readable and write \[ \nabla^2 \mathcal{L}(\theta) = \nabla^2 \frac{1}{2} (y - M(x; \theta)^T(y - M(x; \theta)) \] \[ = \nabla M(x; \theta)\nabla M(x; \theta)^T -(y - M(x; \theta))^T \nabla^2 M(x; \theta). \] So the only second-order term in the Hessian of the loss function is the Hessian of the neural net \(M\). While there is empirical evidence to the local curvature of Hessians of neural nets being near zero after training (and near-zero local curvature would easily justify dropping the Hessian in the MAML meta-update altogether), the same study also indicates that this is not necessarily the case on randomly initialized weights . On the other hand, the authors of MAML hypothesize that the often by design nearly linear nature of neural nets (especially the ones with ReLU layers that they use - see ReLUs ), might explain the success of FOMAML, since nearly linear functions have nearly zero Hessians.

And successful it is! If you compare, e.g., Table 1 in the MAML paper, you will find that FOMAML easily keeps up with its second-order counterpart in terms of classification performance. So depending on your personal taste in theoretical rigor, this explanation might be more or less satisfactory. If you are nonetheless interested in how local curvature affects a function space, take a look at the following figure. Here we prepared a very simple function space, namely the space of \[ f(x) := \frac{1}{2} (x - \frac{1}{2})^T C (x - \frac{1}{2}) + g^T x, \] with Hessian \(C \in \mathbb{R}^{2 \times 2}\), constant \(g \in \mathbb{R}^2\) and gradient \(C(x - \frac{1}{2}) + g\), where we assume that \(C\) is a symmetric matrix, i.e. that it has the form \[ C = \begin{bmatrix} a & b \\ b & c \end{bmatrix}. \] Changing \(a, b, c \) lets you observe the effect of curvature on the form of the function space. As you should be able to verify, non-zero values for the Hessian curve the space and the more curvature we introduce, the poorer the first-order approximation \(\nabla f_{C=0}(x) \) to the gradient becomes.

Hopefully, you have gained some understanding of how FOMAML works and what effect second-order terms (encoding local curvature) can have on the loss space, as well as arguments for and against linear approximations of the meta-gradient.

FOMAML and the fact that it can compete so easily with MAML teaches us that the information necessary to learn across tasks is contained, for the most part, not in any Hessian, but within the linear parts of the meta-gradient. Following up on this narrative, we will next study Reptile, another prominent first-order method, with a slightly different approach.