Function minimization#

Function minimization is one of the most common tasks in physics and engineering. It is used in many different contexts, such as energy minimization in statistical physics, optimization of parameters in model fitting or machine learning, etc.

Function minimization has a close relationship with the root finding problem. In particular, it can be reduced to finding the root of the first derivative of the function.

Let us consider different methods for function minimization. As an example, let us consider the function \(f(x) = \sin(x)\) on an interval \([0, 2\pi]\).

sinx-minimum

Newton method#

The extremum (minimum or maximum) of a function \(f(x)\) is located at a point where its derivative equals zero: \(f'(x) = 0\). The Newton-Raphson method can be applied to find these critical points by treating the derivative itself as the function whose root we need to find.

Algorithm#

  1. Start with an initial guess \(x_0\)

  2. Apply the Newton-Raphson iteration formula for the root of \(f(x)\):

    \[x_{n+1} = x_n - \frac{f'(x_n)}{f''(x_n)}\]
  3. Repeat until convergence (when \(|x_{n+1} - x_n|\) is sufficiently small)

Technically, the method yields an extremum, which is not necessarily a minimum or maximum. To determine the type of extremum, we can examine the second derivative:

  • If \(f''(x) > 0\), the point is a minimum

  • If \(f''(x) < 0\), the point is a maximum

  • If \(f''(x) = 0\), the test is inconclusive (need higher-order derivatives)

Advantages and Limitations#

Advantages:

  • Quadratic convergence when close to the solution

  • Highly efficient for well-behaved functions

Limitations:

  • Requires calculation of both first and second derivatives

  • May diverge if the initial guess is poor

  • Can fail if \(f''(x_n) = 0\) or is very close to zero at any iteration

  • May converge to a local extremum rather than the global one

Implementation: require the specification of first and second derivatives, the initial guess, and the accuracy of the solution. We also restrict the number of iterations to avoid infinite loops.

def newton_extremum(df, d2f, x0, accuracy=1e-7, max_iterations=100):
    xprev = xnew = x0
    for i in range(max_iterations):
        xnew = xprev - df(xprev) / d2f(xprev)

        if (abs(xnew-xprev) < accuracy):
            return xnew
    
        xprev = xnew
    
    print("Newton method failed to converge to a required precision in " + str(max_iterations) + " iterations")
    
    return xnew   

Let us test the implementation of the Newton method for finding extrema on our example function \(f(x) = \sin(x)\).

def f(x):
    return np.sin(x)

def df(x):
    return np.cos(x)

def d2f(x):
    return -np.sin(x)

x0 = 5.
print("An extremum of sin(x) using Newton's method starting from x0 = ",x0," is at x =",newton_extremum(df,d2f, x0, 1.e-10))
An extremum of sin(x) using Newton's method starting from x0 =  5.0  is at x = 4.71238898038469

Gradient Descent#

Algorithm#

Gradient descent is an iterative optimization algorithm for finding a local minimum of a differentiable function. It works by taking steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point.

For a one-dimensional function \(f(x)\), the gradient descent method can be viewed as a modification of Newton’s method where the second derivative is replaced by a descent factor \(1/\gamma_n\):

\[x_{n+1} = x_n - \gamma_n f'(x_n)\]

Where:

  • \(\gamma_n > 0\) for finding a minimum

  • \(\gamma_n < 0\) for finding a maximum

\(\gamma_n\) is called the learning rate or step size

Implementation#

def gradient_descent(df, x0, gam = 0.01, accuracy=1e-7, max_iterations=100):
    xprev  = x0
    for i in range(max_iterations):
        xnew = xprev - gam * df(xprev)

        if (abs(xnew-xprev) < accuracy):
            return xnew
    
        xprev2 = xprev
        xprev = xnew
    
    print("Gradient descent method failed to converge to a required precision in " + str(max_iterations) + " iterations")
    
    return xnew     

Test it on our example

def f(x):
    return np.sin(x)

def df(x):
    return np.cos(x)

x0 = 5.
gam = 0.1
print("An extremum of sin(x) using gradient descent (gam = ", gam, ") starting from x0 = ",x0," is at x =",gradient_descent(df,x0, gam, 1.e-6))
An extremum of sin(x) using gradient descent (gam =  0.1 ) starting from x0 =  5.0  is at x = 4.712397537576874

Learning rate#

The choice of the learning rate \(\gamma_n\) is critical for the performance of gradient descent. If it is too small, the algorithm will converge slowly, while if it is too large, the algorithm may not converge at all, or get stuck in a loop.

It can be worthwhile to have a learning rate schedule, where the learning rate is adjusted during the optimization process, for example, decreasing it over time:

\[\gamma_n = \frac{\gamma_0}{1 + a n}\]

Multidimensional case#

Gradient descent can be straightforwardly extended to multivariate functions by using the gradient vector instead of the scalar gradient. The update rule becomes:

\[\mathbf{x}_{n+1} = \mathbf{x}_n - \gamma_n \nabla f(\mathbf{x}_n)\]