Nonsmooth Optimization (NSO)

"Classification of mathematical problems as linear and nonlinear is like classification of the Universe as bananas and non-bananas."

- Anonymous

Introduction to Nonsmooth Optimization

Nonsmooth optimization (NSO) refers to the general problem of minimizing (or maximizing) functions that are typically not differentiable at their minimizers (maximizers). Since the classical theory of optimization presumes certain differentiability and strong regularity assumptions upon the functions to be optimized, it can not be directly utilized. However, due to the complexity of the real world, functions involved in practical applications are often nonsmooth. That is, they are not necessarily differentiable. In what follows, we briefly introduce the basic concepts of nonsmooth analysis and optimization. For more details we refer to [1,2,3,4,5,6],7] and references therein.

Let us consider the NSO problem of the form

$\small \left\{\begin{matrix} \textrm{minimize} & f(x)\\ \,\,\textrm{subject to} & x \in G, \end{matrix}\right.$

where the objective function $\small \inline \mathit{f}:\mathbb{R}^{n}\rightarrow \mathbb{R}$ is supposed to be locally Lipschitz continuous on the feasible set $\small \inline G \subseteq \mathbb{R}^n$ . Note that no differentiability or convexity assumptions are made.

The simplest example of nonsmooth function is the absolute-value function on reals.

Example — Absolute-value function

$\small f(x)=|x|.$

The gradient of function $\small \inline f$ is

$\small \nabla f(x) = \left\{\begin{matrix} 1, & \textrm{when } x>0,\\ -1, & \textrm{when } x<0. \end{matrix}\right.$

Function $\small \inline f$ is not differentiable at $\small \inline x=0.$

NSO problems arise in many fields of applications, for example in

image denoising,
optimal control,
neural network training,
data mining,
economics and
computational chemistry and physics.

Moreover, using certain important methodologies for solving difficult smooth (continuously differentiable) problems leads directly to the need to solve nonsmooth problems, which are either smaller in dimension or simpler in structure. This is the case, for instance in

decompositions,
dual formulations and
exact penalty functions.

Finally, there exist so called stiff problems that are analytically smooth but numerically nonsmooth. This means that the gradient varies too rapidly and, thus, these problems behave like nonsmooth problems.

There are several approaches to solve NSO problems (see, e.g., [2]). The direct application of smooth gradient-based methods to nonsmooth problems is a simple approach but it may lead to a failure in convergence, in optimality conditions, or in gradient approximation [1]. All these difficulties arise from the fact that the objective function fails to have a derivative for some values of the variables. The following figure demonstrates the difficulties that are caused by nonsmoothness.

Difficulties caused by nonsmoothness

Smooth problem	Nonsmooth problem
Descent direction is obtained at the opposite direction of the gradient $\small \inline \nabla f(x).$	The gradient does not exist at every point, leading to difficulties in defining the descent direction.
The necessary optimality condition $\small \inline \nabla f(x) = 0.$	Gradient usually does not exist at the optimal point.
Difference approximation can be used to approximate the gradient.	Difference approximation is not useful and may lead to serious failures.
	The (smooth) algorithm does not converge or it converges to a non-optimal point.

On the other hand, using some derivative free method may be another approach but standard derivative free methods like genetic algorithms or Powell's method may be unreliable and become inefficient as the dimension of the problem increases. Moreover, the convergence of such methods has been proved only for smooth functions. In addition, different kind of smoothing and regularization techniques may give satisfactory results in some cases but are not, in general, as efficient as the direct nonsmooth approach [4]. Thus, special tools for solving NSO problems are needed.

Methods for solving NSO problems include subgradient methods (see e.g. [6]), bundle methods (see e.g. [4]), and gradient sampling methods (see e.g. [1]). All of them are based on the assumption that only the objective function value and one arbitrary subgradient (generalized gradient [2]) at each point are available.

The basic idea behind the subgradient methods is to generalize smooth methods by replacing the gradient with an arbitrary subgradient. Due to this simple structure, they are widely used NSO methods, although they may suffer from some serious drawbacks (this is true especially with the simplest versions of subgradient methods) [3]. An extensive overview of various subgradient methods can be found in [6].

At the moment, bundle methods are regarded as the most effective and reliable methods for NSO. They are based on the subdifferential theory developed by Rockafellar [5] and Clarke [2], where the classical differential theory is generalized for convex and locally Lipschitz continuous functions, respectively. The basic idea of bundle methods is to approximate the subdifferential (that is, the set of subgradients) of the objective function by gathering subgradients from previous iterations into a bundle. In this way, more information about the local behavior of the function is obtained than what an individual arbitrary subgradient can yield (cf. subgradient methods).

The newest approach is to use gradient sampling algorithms developed by Burke, Lewis and Overton. The gradient sampling method is a method for minimizing an objective function that is locally Lipschitz continuous and smooth in an open dense subset $\small \inline D$ of $\small \inline \mathbb{R}^n$ . The objective may be nonsmooth and/or nonconvex. Gradient sampling methods may be considered as a stabilized steepest descent algorithm. The central idea behind these techniques is to approximate the subdifferential of the objective function through random sampling of gradients near the current iteration point. The ongoing progress in the development of gradient sampling algorithms suggests that they have potential to rival bundle methods in the terms of theoretical might and practical performance.

Note that NSO techniques can be successfully applied to smooth problems but not vice versa [3] and, thus, we can say that NSO deals with a broader class of problems than smooth optimization. Although using a smooth method may be desirable when all the functions involved are known to be smooth, it is often hard to confirm the smoothness in practical applications (e.g. if function values are calculated via simulation). Moreover, as already mentioned, the problem may be analytically smooth but still behave numerically nonsmoothly, in which case an NSO method is needed.

Some NSO software and NSO software links can be found here. For more details on various NSO methods see [2].

Nonsmooth Analysis

The theory of nonsmooth analysis is based on convex analysis. Thus, we start by giving some definitions and results for convex (not necessarily differentiable) functions. We define the subgradient and the subdifferential of a convex function as they are defined in [5]. Then we generalize these results to nonconvex locally Lipschitz continuous functions.

Convex Analysis

Definition. The subdifferential of a convex function $\small \inline f:\mathbb{R}^n \rightarrow \mathbb{R}$ at $\small \inline x \in \mathbb{R}^n$ is the set $\small \inline \partial_c f(x)$ of vectors $\small \inline \xi \in \mathbb{R}^n$ such that

$\small \inline \partial_c f(x) = \left\{\,\xi \in \mathbb{R}^n \mid f({y}) \geq f({x}) + \xi^T(y-x) \text{ for all } y \in \mathbb{R}^n \,\right\}.$

Each vector $\small \inline \xi \in \partial_c f(x)$ is called a subgradient of $\small \inline f$ at $\small \inline x.$

Example — Absolute-value function

Function $\small \inline f(x) = |x|$ is clearly convex and differentiable when $\small \inline x \ne 0.$ By the definition of subdifferential

$\small \begin{align*} \xi \in \partial_c f(0) & \Leftrightarrow |y| \geq |0| + \xi \cdot (y - 0) \textrm{ for all } y \in \mathbb{R} \\ & \Leftrightarrow |y| \geq \xi \cdot y \textrm{ for all } y \in \mathbb{R}\\ & \Leftrightarrow \xi \leq 1 \textrm{ and } \xi \geq -1. \end{align*}$

Thus, $\small \inline \partial_c f(0) = [-1,1].$

Theorem. Let $\small \inline f:\mathbb{R}^n \rightarrow \mathbb{R}$ be a convex function. Then the classical directional derivative $\small \inline f'(x;d)$ exists in every direction $\small \inline d \in \mathbb{R}^n$ and it satisfies

$\small f'(x;d) = \inf_{t > 0} \frac{f(x + td) - f(x)}{t}.$

The next theorem shows the relationship between the subdifferential and the directional derivative. It turns out that knowing $\small \inline f'(x;d)$ is equivalent to knowing $\small \inline \partial_c f(x)$ .

Theorem. Let $\small \inline f:\mathbb{R}^n \rightarrow \mathbb{R}$ be a convex function. Then for all $\small \inline x \in \mathbb{R}^n$

Example — Absolute-value function

By the previous theorem we have

$\small \xi \in \partial_c f(0) \Longleftrightarrow f'({0,d}) \geq \xi \cdot d \text{ for all }d \in \mathbb{R}.$

Now

$\small f'({0,d}) = \lim_{t \downarrow 0} \frac{|0 + td|-|0|}{t} = \lim_{t \downarrow 0}\frac{t|d|}{t} = |d|$

and, thus,

$\small \begin{align*} \xi \in \partial_c f(0) &\Longleftrightarrow |d| \geq \xi \cdot d \text{ for all }d \in \mathbb{R} \\ &\Longleftrightarrow \xi \in [-1,1]. \end{align*}$

Nonconvex Analysis

Since classical directional derivatives do not necessarily exist for locally Lipschitz continuous functions, we first define a generalized directional derivative. We then generalize the subdifferential for nonconvex locally Lipschitz continuous functions.

Definition (Clarke). Let $\small \inline f:\mathbb{R}^n \rightarrow \mathbb{R}$ be a locally Lipschitz continuous function at $\small \inline x \in \mathbb{R}^n.$ The generalized directional derivative of $\small \inline f$ at $\small \inline x$ in the direction $\small \inline d \in \mathbb{R}^n$ is defined by

$\small f^\circ(x;d) = \limsup_{\overset{y \rightarrow x}{t \downarrow 0}} \frac{f(y + t d)-f(y)}{t}.$

Note that this generalized directional derivative always exists for locally Lipschitz continuous functions and, as a function of $\small \inline d$ , it is sublinear. Therefore, we can now define the subdifferential for nonconvex locally Lipschitz continuous functions as analogous to the previous theorem (2.) with the directional derivative replaced by the generalized directional derivative.

Definition (Clarke). Let $\small \inline f:\mathbb{R}^n \rightarrow \mathbb{R}$ be a locally Lipschitz continuous function at a point $\small \inline x \in \mathbb{R}^n.$ Then the subdifferential of $\small \inline f$ at $\small \inline x$ is the set $\small \inline \partial f(x)$ of vectors $\small \inline \xi \in \mathbb{R}^n$ such that

$\small \partial f(x) = \{\,\xi \in \mathbb{R}^n \mid f^\circ(x;d) \geq \xi^T d \text{ for all } d \in \mathbb{R}^n \,\}.$

Each vector $\small \inline \xi \in \partial f(x)$ is called a subgradient of $\small \inline f$ at $\small \inline x$ .

Theorem (Rademacher). Let $\small \inline S \subset \mathbb{R}^n$ be an open set. A function $\small \inline f:S \rightarrow \mathbb{R}$ that is locally Lipschitz continuous on $\small \inline S$ is differentiable almost everywhere on $\small \inline S$ .

Theorem. Let $\small \inline f:\mathbb{R}^n \rightarrow \mathbb{R}$ be a locally Lipschitz continuous function at a point $\small \inline x \in \mathbb{R}^n.$ Then

$\small \partial f(x)=\text{conv} \{\,\xi \in \mathbb{R}^n \mid \nabla f(x_i) \rightarrow \xi,\,\, x_i \rightarrow x \text{ and } f \text{ is differentiable at } x_i \,\},$

where $\small \inline \mathrm{conv} S$ denotes the convex hull of set $\small \inline S.$

Example — Absolute-value function

$\small f(x)=|x|.$

The subdifferential of the absolute-value function $\small \inline f$ at $\small \inline x=0$ is given by

$\small \inline \partial f(0)= \text{conv} \{ -1,1\} =[-1,1].$

The next list summarizes some properties of the subdifferential both in convex and nonconvex cases:

The subdifferential $\small \inline \partial_c f(x)$ for a convex function $\small \inline f$ is a nonempty, convex, and compact set such that $\small \inline \partial_c f(x) \subset B(0;L),$ where $\small \inline B(0;L)$ is an open ball with center $\small \inline 0$ and radius $\small \inline L>0$ and $\small \inline L$ is the Lipschitz constant of $\small \inline f$ at $\small \inline x.$
The subdifferential $\small \inline \partial f(x)$ for a for locally Lipschitz continuous function $\small \inline f$ is a nonempty, convex, and compact set such that $\small \inline \partial f(x) \subset B(0;L),$ where $\small \inline L>0$ is the Lipschitz constant of $\small \inline f$ at $\small \inline x.$ Moreover, $\small \inline f^\circ(x;d) = \max\,\{\,\xi^T d \mid \xi \in \partial f(x) \,\}$ for all $\small \inline d \in \mathbb{R}^n.$
The subdifferential for locally Lipschitz continuous functions is a generalization of the subdifferential for convex functions: If $\small \inline f:\mathbb{R}^n \rightarrow \mathbb{R}$ is a convex function, then $\small \inline f'(x;d) = f^\circ(x;d)$ for all $\small \inline d \in \mathbb{R}^n$ and $\small \inline \partial_c f(x) =\partial f(x)$ .
The subdifferential for locally Lipschitz continuous functions is a generalization of the classical derivative: If $\small \inline f:\mathbb{R}^n \rightarrow \mathbb{R}$ is both locally Lipschitz continuous and differentiable at $\small \inline x \in \mathbb{R}^n,$ then $\small \inline \nabla f(x) \in \partial f(x)$ . If, in addition, $\small \inline f:\mathbb{R}^n \rightarrow \mathbb{R}$ is continuously differentiable at $\small \inline x \in \mathbb{R}^n,$ then $\small \inline \partial f(x) = \{\nabla f(x)\}$ .

With the last point in mind, we can finally finish the whole subdifferential of the absolute-value function.

Example — Absolute-value function

$\small f(x)=|x|.$

The subdifferential of the absolute-value function $\small \inline f$ is given by

$\small \begin{align*} \partial f(x) = \begin{cases} \,\,\{\hphantom{-} 1\},& \quad \text{when } x > 0,\\ \,\,[\,-1,1 \,],& \quad \text{when } x = 0,\\ \,\,\{-1\},& \quad \text{when } x < 0. \end{cases} \end{align*}$

Nonsmooth Optimization: Theory

Finally, we present some results that connect the theories of nonsmooth analysis and optimization. The necessary conditions for a locally Lipschitz continuous function to attain its local minimum in an unconstrained case are given in the next theorem. For convex functions these conditions are also sufficient and the minimum is global.

Theorem Let $\small \inline f:\mathbb{R}^n \rightarrow \mathbb{R}$ be a locally Lipschitz continuous function at $\small \inline x \in \mathbb{R}^n.$ If $\small \inline f$ attains its local minimal value at $\small \inline x,$ then

Theorem If $\small \inline f:\mathbb{R}^n \rightarrow \mathbb{R}$ is a convex function, then the following conditions are equivalent:

Function $\small \inline f$ attains its global minimal value at $\small \inline x,$
$\small \inline 0 \in \partial_c f(x)$ and
$\small \inline f'(x;d) \geq 0$ for all $\small \inline d \in \mathbb{R}^n.$

Definition A point $\small \inline x \in \mathbb{R}^n$ satisfying $\small \inline 0 \in \partial f(x)$ is called a critical or a stationary point of $\small \inline f.$

Example — Absolute-value function

Function $\small \inline f(x)=|x|$ is convex.

$\small \text{The point } x=0 \text{ is the global minimum of } f. \quad \Longleftrightarrow \quad 0 \in [-1,1] = \partial f(0).$

References

A. Bagirov, N. Karmitsa, M.M. Mäkelä "Introduction to Nonsmooth Optimization: Theory, Practice and Software." Springer, 2014.
A. Bagirov, M. Gaudioso, N. Karmitsa, M.M. Mäkelä, S. Taheri (Eds.) "Numerical Nonsmooth Optimization: State-of-the-Art Algorithms." Springer, 2020.
F.H. Clarke: "Optimization and Nonsmooth Analysis", Wiley-Interscience, New York, 1983.
C. Lemarechal: "Nondifferentiable Optimization", in Optimization (G.L. Nemhauser, A.H.G. Rinnooy Kan, and M.J. Todd, Eds.), p. 529-572, Elsevier North-Holland, Inc., New York, 1989.
M.M. Mäkelä and P. Neittaanmäki: "Nonsmooth Optimization: Analysis and Algorithms with Applications to Optimal Control", World Scientific Publishing Co., Singapore, 1992.
R.T. Rockafellar: "Convex Analysis", Princeton University Press, Princeton, New Jersey, 1970.
N.Z. Shor: "Minimization Methods for Non-Differentiable Functions", Springer-Verlag, Berlin, 1985.