# Nonsmooth Optimization (NSO)

"Classification of mathematical problems as linear and nonlinear is like classification of the Universe as bananas and non-bananas."
- Anonymous

## Introduction to Nonsmooth Optimization

Nonsmooth optimization (NSO) refers to the general problem of minimizing (or maximizing) functions that are typically not differentiable at their minimizers (maximizers). Since the classical theory of optimization presumes certain differentiability and strong regularity assumptions upon the functions to be optimized, it can not be directly utilized. However, due to the complexity of the real world, functions involved in practical applications are often nonsmooth. That is, they are not necessarily differentiable. In what follows, we briefly introduce the basic concepts of nonsmooth analysis and optimization. For more details we refer to [1,2,3,4,5,6] and references therein.

Let us consider the NSO problem of the form

where the objective function is supposed to be locally Lipschitz continuous on the feasible set . Note that no differentiability or convexity assumptions are made.

The simplest example of nonsmooth function is the absolute-value function on reals.

#### Example — Absolute-value function

The gradient of function is

Function is not differentiable at

NSO problems arise in many fields of applications, for example in

• image denoising,
• optimal control,
• neural network training,
• data mining,
• economics and
• computational chemistry and physics.

Moreover, using certain important methodologies for solving difficult smooth (continuously differentiable) problems leads directly to the need to solve nonsmooth problems, which are either smaller in dimension or simpler in structure. This is the case, for instance in

• decompositions,
• dual formulations and
• exact penalty functions.

Finally, there exist so called stiff problems that are analytically smooth but numerically nonsmooth. This means that the gradient varies too rapidly and, thus, these problems behave like nonsmooth problems.

There are several approaches to solve NSO problems. The direct application of smooth gradient-based methods to nonsmooth problems is a simple approach but it may lead to a failure in convergence, in optimality conditions, or in gradient approximation . All these difficulties arise from the fact that the objective function fails to have a derivative for some values of the variables. The following figure demonstrates the difficulties that are caused by nonsmoothness.

#### Difficulties caused by nonsmoothness

Smooth problem Nonsmooth problem
• Descent direction is obtained at the opposite direction of the gradient
• The gradient does not exist at every point, leading to difficulties in defining the descent direction.
• The necessary optimality condition
• Gradient usually does not exist at the optimal point.
• Difference approximation can be used to approximate the gradient.
• Difference approximation is not useful and may lead to serious failures.
• The (smooth) algorithm does not converge or it converges to a non-optimal point.

On the other hand, using some derivative free method may be another approach but standard derivative free methods like genetic algorithms or Powell's method may be unreliable and become inefficient as the dimension of the problem increases. Moreover, the convergence of such methods has been proved only for smooth functions. In addition, different kind of smoothing and regularization techniques may give satisfactory results in some cases but are not, in general, as efficient as the direct nonsmooth approach . Thus, special tools for solving NSO problems are needed.

Methods for solving NSO problems include subgradient methods (see e.g. ), bundle methods (see e.g. ), and gradient sampling methods (see e.g. ). All of them are based on the assumption that only the objective function value and one arbitrary subgradient (generalized gradient ) at each point are available.

The basic idea behind the subgradient methods is to generalize smooth methods by replacing the gradient with an arbitrary subgradient. Due to this simple structure, they are widely used NSO methods, although they may suffer from some serious drawbacks (this is true especially with the simplest versions of subgradient methods) . An extensive overview of various subgradient methods can be found in .

At the moment, bundle methods are regarded as the most effective and reliable methods for NSO. They are based on the subdifferential theory developed by Rockafellar  and Clarke , where the classical differential theory is generalized for convex and locally Lipschitz continuous functions, respectively. The basic idea of bundle methods is to approximate the subdifferential (that is, the set of subgradients) of the objective function by gathering subgradients from previous iterations into a bundle. In this way, more information about the local behavior of the function is obtained than what an individual arbitrary subgradient can yield (cf. subgradient methods).

The newest approach is to use gradient sampling algorithms developed by Burke, Lewis and Overton. The gradient sampling method is a method for minimizing an objective function that is locally Lipschitz continuous and smooth in an open dense subset of . The objective may be nonsmooth and/or nonconvex. Gradient sampling methods may be considered as a stabilized steepest descent algorithm. The central idea behind these techniques is to approximate the subdifferential of the objective function through random sampling of gradients near the current iteration point. The ongoing progress in the development of gradient sampling algorithms suggests that they have potential to rival bundle methods in the terms of theoretical might and practical performance.

Note that NSO techniques can be successfully applied to smooth problems but not vice versa  and, thus, we can say that NSO deals with a broader class of problems than smooth optimization. Although using a smooth method may be desirable when all the functions involved are known to be smooth, it is often hard to confirm the smoothness in practical applications (e.g. if function values are calculated via simulation). Moreover, as already mentioned, the problem may be analytically smooth but still behave numerically nonsmoothly, in which case an NSO method is needed.

Some NSO software and NSO software links can be found here. For more details on different NSO methods see  part III and the references therein.

## Nonsmooth Analysis

The theory of nonsmooth analysis is based on convex analysis. Thus, we start by giving some definitions and results for convex (not necessarily differentiable) functions. We define the subgradient and the subdifferential of a convex function as they are defined in . Then we generalize these results to nonconvex locally Lipschitz continuous functions.

### Convex Analysis

Definition. The subdifferential of a convex function at is the set of vectors such that

Each vector is called a subgradient of at

#### Example — Absolute-value function

Function is clearly convex and differentiable when By the definition of subdifferential

Thus,

Theorem. Let be a convex function. Then the classical directional derivative exists in every direction and it satisfies

The next theorem shows the relationship between the subdifferential and the directional derivative. It turns out that knowing is equivalent to knowing .

Theorem. Let be a convex function. Then for all

1. for all and
2. .

#### Example — Absolute-value function

By the previous theorem we have

Now

and, thus,

### Nonconvex Analysis

Since classical directional derivatives do not necessarily exist for locally Lipschitz continuous functions, we first define a generalized directional derivative. We then generalize the subdifferential for nonconvex locally Lipschitz continuous functions.

Definition (Clarke). Let be a locally Lipschitz continuous function at The generalized directional derivative of at in the direction is defined by

Note that this generalized directional derivative always exists for locally Lipschitz continuous functions and, as a function of , it is sublinear. Therefore, we can now define the subdifferential for nonconvex locally Lipschitz continuous functions as analogous to the previous theorem (2.) with the directional derivative replaced by the generalized directional derivative.

Definition (Clarke). Let be a locally Lipschitz continuous function at a point Then the subdifferential of at is the set of vectors such that

Each vector is called a subgradient of at .

Theorem (Rademacher). Let be an open set. A function that is locally Lipschitz continuous on is differentiable almost everywhere on .

Theorem. Let be a locally Lipschitz continuous function at a point Then

where denotes the convex hull of set

#### Example — Absolute-value function

The subdifferential of the absolute-value function at is given by

The next list summarizes some properties of the subdifferential both in convex and nonconvex cases:

• The subdifferential for a convex function is a nonempty, convex, and compact set such that where is an open ball with center and radius and is the Lipschitz constant of at
• The subdifferential for a for locally Lipschitz continuous function is a nonempty, convex, and compact set such that where is the Lipschitz constant of at Moreover, for all
• The subdifferential for locally Lipschitz continuous functions is a generalization of the subdifferential for convex functions: If is a convex function, then for all and .
• The subdifferential for locally Lipschitz continuous functions is a generalization of the classical derivative: If is both locally Lipschitz continuous and differentiable at then . If, in addition, is continuously differentiable at then .

With the last point in mind, we can finally finish the whole subdifferential of the absolute-value function.

#### Example — Absolute-value function

The subdifferential of the absolute-value function is given by

## Nonsmooth Optimization: Theory

Finally, we present some results that connect the theories of nonsmooth analysis and optimization. The necessary conditions for a locally Lipschitz continuous function to attain its local minimum in an unconstrained case are given in the next theorem. For convex functions these conditions are also sufficient and the minimum is global.

Theorem Let be a locally Lipschitz continuous function at If attains its local minimal value at then

1. and
2. for all

Theorem If is a convex function, then the following conditions are equivalent:

1. Function attains its global minimal value at
2. and
3. for all

Definition A point satisfying is called a critical or a stationary point of

#### Example — Absolute-value function

Function is convex.

1. A. Bagirov, N. Karmitsa, M.M. Mäkelä "Introduction to Nonsmooth Optimization: Theory, Practice and Software." Springer, 2014.
2. F.H. Clarke: "Optimization and Nonsmooth Analysis", Wiley-Interscience, New York, 1983.
3. C. Lemarechal: "Nondifferentiable Optimization", in Optimization (G.L. Nemhauser, A.H.G. Rinnooy Kan, and M.J. Todd, Eds.), p. 529-572, Elsevier North-Holland, Inc., New York, 1989.
4. M.M. Mäkelä and P. Neittaanmäki: "Nonsmooth Optimization: Analysis and Algorithms with Applications to Optimal Control", World Scientific Publishing Co., Singapore, 1992.
5. R.T. Rockafellar: "Convex Analysis", Princeton University Press, Princeton, New Jersey, 1970.
6. N.Z. Shor: "Minimization Methods for Non-Differentiable Functions", Springer-Verlag, Berlin, 1985.