Convex vs. Non-Convex Functions

The ease of training a machine learning model is determined by the geometric shape of its loss landscape. Functions are broadly divided into two categories: convex and non-convex. Convex functions are well-behaved with a single valley, making optimization straightforward and guaranteed. Non-convex functions are rugged, containing multiple peaks, valleys, and flat regions, which presents a significant challenge for optimization algorithms.

Convex Functions: The Ideal Landscape

A function is convex if a line segment drawn between any two points on its graph lies above or on the graph. Visually, a convex function looks like a bowl, with a single, clear bottom.

The Convexity Test

Mathematically, a function $f$ is convex if for any two points $x_1, x_2$ and any parameter $t \in [0, 1]$: $$f(t x_1 + (1-t) x_2) \le t f(x_1) + (1-t) f(x_2)$$ This inequality guarantees that the function curve never bulges upward between any two points.

Global Minimum Guarantee

The defining characteristic of convex functions is that any local minimum is guaranteed to be the global minimum. If an optimizer finds a point where the slope is zero, it has found the absolute best solution. Linear regression, logistic regression, and Support Vector Machines (SVMs) utilize convex loss functions, making their training extremely reliable.

Non-Convex Functions: The Deep Learning Reality

Deep neural networks do not have convex loss landscapes. Due to the interaction of multiple hidden layers and non-linear activation functions, the loss landscape becomes highly non-convex.

The Non-Convex Shape

A non-convex function is one that fails the convexity test. Its landscape is rugged, with multiple local minima (suboptimal valleys), local maxima (peaks), and saddle points (flat areas).

Navigating Non-Convexity

When training a neural network, standard gradient descent is not guaranteed to find the global minimum. The optimizer can get trapped in a local minimum or stall on a flat region. To combat this, researchers use stochastic gradient descent (SGD), momentum, and adaptive learning rates (like Adam) to escape suboptimal valleys.