Gradient approximations in derivative-free optimization
Most of modern ML is heavily reliant on gradient descent and its variants. We have some loss function we want to minimize; if the loss function we actually care about is not differentiable, we modify it until we have something that is. Then, we minimize it; if the loss function is non-convex, we don't worry too much about it. Typically, we do something like stochastic gradient descent (SGD) on a loss function that corresponds to empirical risk minimization (ERM). At each optimization step, we have a noisy approximation of the loss function and its gradient, using a random sample from our dataset. We hope that this helps us overcome non-convexity issues. Actually, we typically don't do vanilla SGD; we use techniques like momentum and Adam that account for the curvature of the loss function. These techniques utilize our estimates of the gradient from previous optimization steps. On rarer occasions where we don't have lots of samples, we skip the “stochastic” part of SGD; and we often use approximate Newton methods like L-BFGS which account for curvature (Hessian), also using gradients from previous steps.
But there's a subfield of ML that basically defines itself by the fact that it doesn't use gradients: black-box optimization. Why not use gradients? First, maybe the inputs to your loss function are discrete objects that can't even somehow be represented as real-valued vectors. Second, maybe you don't have a training dataset to perform ERM over; after each optimization step, you'll actually do an experiment and acquire data. Unlike ERM, your data isn't i.i.d., so you have to worry about collecting inputs that reasonably cover the possible space of inputs. This is the active learning / reinforcement learning scenario. Third, maybe you're actually worried that your loss function is so terribly non-convex that even the “stochastic” in SGD will leave you stuck in a sub-optimal local minimum. These examples basically cover the reasonable reasons to do black-box optimization: either (first case) you can't have a gradient, or (second and third cases) you need more than the gradient (and its history). So you need a model of what your loss function might look like over the whole input space.
But you might have unreasonable reasons. Maybe your loss function is not differentiable, and you don't want to bother modifying it to make it so. Or maybe your loss function is differentiable, but you don't want to do the algebra to figure out the gradient. This was actually a common, reasonable reason before the invention of auto-differentiation. And it's also a non-trivial reason that researchers started studying derivative-free optimization. Now though, there is autodiff. (But maybe, you're just really unlucky and you have something that PyTorch auto-diff can't derive, or is just really slow at computing. This is possible, but unlikely.) Note that if your reason is unreasonable, you probably could have a gradient if you tried harder, and it would be all you need ™. Also note that we approximate the non-smooth 0-1 loss not only because it is non-smooth, but also because it's not a very good classification loss; there's a statistical reason we use cross-entropy loss not 0-1 loss for early stopping in deep learning. So research into derivative-free optimization in such cases is also unreasonable — a relic of the bad old days before auto-diff.
Or is it? In a recent paper, A Theoretical and Empirical Comparison of Gradient Approximations in Derivative-Free Optimization (2022), a bunch of very reasonable people (Berahas, Cao, Choromanski, & Scheinberg) studied this very unreasonable problem. Their analysis is groundbreaking, and IMO it will likely have huge implications on the aforementioned reasonable problems.
To explain why, I first need to justify my claim that derivative-free optimization on unreasonable problems is actually unreasonable. Let's say you're minimizing
where the hat-symbol denotes our approximation of the gradient. This is not nice for derivative-free optimization. Typically, with derivative-free optimization, you would use some variant of finite-difference descent (FDD). Finite difference has error
So what does this paper accomplish? Recall that FDD has cost per iteration
The proposed approach has some nice benefits for an unreasonable problem, but I think the main long-term impact of this paper will be on reasonable problems. In particular, I think the proposed method will have major implications for reasonable problems where you do have the gradient, but need more than the gradient (and its history), because you care about the global input space. In these settings, black-box optimization methods basically don't use gradient information, because they need more than the gradient. This is incredibly suboptimal! But, black-box optimization methods do retain a history of function evaluations, and the proposed method allows you to estimate an approximate gradient using those evaluations. In the future, I expect black-box optimizers to use this approximate gradient to obtain major improvements in convergence.
Link to paper: https://link.springer.com/article/10.1007/s10208-021-09513-z