This page is part of a multi-part series on Model-Agnostic Meta-Learning. If you are already familiar with the topic, use the menu on the right side to jump straight to the part that is of interest for you. Otherwise, we suggest you start at the beginning.
FOMAML was suggested by Finn et al. (the authors of MAML, in the very paper that introduces MAML) and
is a straightforward heuristic to get rid of
the second-order terms (which we introduced on
the last page):
Setting them to zero! As a result
\[\nabla_\theta U_{\tau_i}(\theta) = I\]
and the overall meta loss gradient reduces to
\[ \nabla_\theta \mathcal{L}(\theta) = \sum_{i} \nabla_{U_{\tau_i}(\theta)} \mathcal{L}_{\tau_i,
\text{test}}
.\]
Simple right? Maybe a bit too simple. Let us have a detailed look at the term we are discarding, namely
\[ \nabla^2_\theta \mathcal{L}_{\tau_i, \text{train}}(\theta). \]
This term is known as the Hessian of loss function \(\mathcal{L}_{\tau_i, \text{train}}\), which
describes the local curvature as a function
And successful it is! If you compare, e.g., Table 1 in the MAML paper, you will find that FOMAML easily keeps up with its second-order counterpart in terms of classification performance. So depending on your personal taste in theoretical rigor, this explanation might be more or less satisfactory. If you are nonetheless interested in how local curvature affects a function space, take a look at the following figure. Here we prepared a very simple function space, namely the space of \[ f(x) := \frac{1}{2} (x - \frac{1}{2})^T C (x - \frac{1}{2}) + g^T x, \] with Hessian \(C \in \mathbb{R}^{2 \times 2}\), constant \(g \in \mathbb{R}^2\) and gradient \(C(x - \frac{1}{2}) + g\), where we assume that \(C\) is a symmetric matrix, i.e. that it has the form \[ C = \begin{bmatrix} a & b \\ b & c \end{bmatrix}. \] Changing \(a, b, c \) lets you observe the effect of curvature on the form of the function space. As you should be able to verify, non-zero values for the Hessian curve the space and the more curvature we introduce, the poorer the first-order approximation \(\nabla f_{C=0}(x) \) to the gradient becomes.
Hopefully, you have gained some understanding of how FOMAML works and what effect second-order terms (encoding local curvature) can have on the loss space, as well as arguments for and against linear approximations of the meta-gradient.
FOMAML and the fact that it can compete so easily with MAML teaches us that the information necessary to learn across tasks is contained, for the most part, not in any Hessian, but within the linear parts of the meta-gradient. Following up on this narrative, we will next study Reptile, another prominent first-order method, with a slightly different approach.