29
Sep 17

Taylor decomposition for unconstrained optimization

Taylor decomposition for unconstrained optimization

As is clear from my earlier post, the Taylor decomposition is the principal tool to derive the rule for finding function extremes. In many dimensions, we need this decomposition even more. We derive it for a function of two arguments by reducing the 2D case to a one-dimensional case.

In the one-dimensional case, the Taylor approximation with three terms is
(1) f(x+h)\approx f(x)+\frac{f'(x)}{1!}h+\frac{f''(x)}{2!}h^2.

We want to generalize this to the case when x=(x_{1},x_{2}) is a 2D argument. For this, we take any real number t\in R and consider points of type x+th, where h=(h_{1},h_{2}). As t changes, points th run over a straight line drawn through the vector h. At the same time, points x+th run over a straight line drawn through point x parallel to h. Consider an auxiliary function \phi defined by \phi (t)=f(x+th). This is a function of a real argument. If we obtain a Taylor approximation for it, letting t=1 will give us the desired generalization of (1).

Technical tools

Struggling with the chain rule. The chain rule for functions of a real argument says that [f(x(t))]'=f'(x(t))x'(t). Here we need its generalization for the 2D case:

(2) \frac{d}{dt}\left[f(x_1(t),x_2(t))\right]=\frac{\partial f(x_1(t),x_2(t))}{\partial x_1}x'_1(t)+\frac{\partial f(x_1(t),x_2(t))}{\partial x_2}x'_2(t).

In our case x_1(t)=x_1+th_1, x_2(t)=x_2+th_2. Hence, (2) gives:

(3) \frac{d\phi(t)}{dt}=\frac{d}{dt}\left[f(x+th)\right]=\frac{\partial f(x+th)}{\partial x_1}h_1+\frac{\partial f(x+th)}{\partial x_2}h_2.

Differentiating (3) we get

(4) \frac{d^2\phi (t)}{dt^2}=\frac{d}{dt}\frac{\partial f(x+th)}{\partial x_1}h_1+\frac{d}{dt}\frac{\partial f(x+th)}{\partial x_2}h_2

(applying chain rule to first term) =\left[\frac{\partial^2f(x+th)}{\partial x_1^2}h_1^2+\frac{\partial^2f(x+th)}{\partial x_1\partial x_2}h_1h_2\right]

(applying chain rule to second term) +\left[\frac{\partial^2f(x+th)}{\partial x_1\partial x_2}h_1h_2+\frac{\partial^2f(x+th)}{\partial x_2^2}h_2^2\right].

Struggling with matrix algebra. Expressions (3) and (4) are unwieldy. We apply matrix algebra to make them shorter. Denote the gradient Df(x)=\left(\frac{\partial f(x)}{\partial x_1},\frac{\partial f(x)}{\partial x_2}\right) (this is a row vector) and h=\left(\begin{array}{c}h_1\\h_2\end{array}\right) (a column vector). The multiplication rule for matrices (rows from the first matrix are multiplied by the columns from the second) allows us to rewrite (3) as

(5) \frac{d\phi (t)}{dt}=Df(x+th)h.

Further, denote
D^2f(x)=\left(\begin{array}{cc}\frac{\partial^2f(x)}{\partial x_1^2} &\frac{\partial^2f(x)}{\partial x_1\partial x_2}\\\frac{\partial^2f(x)}{\partial x_1\partial x_2}&\frac{\partial^2f(x)}{\partial x_2^2}\end{array}\right).
This matrix is called a Hessian. The expression h^TD^2f(x)h is called a quadratic form of the Hessian (T stands for transposition, so h^T is a row vector). Using the multiplication rule for matrices we see that

(6) h^TD^2f(x)h=h^T\left(\begin{array}{cc}\frac{\partial^2f(x)}{\partial x_1^2}&\frac{\partial^2f(x)}{\partial x_1\partial x_2} \\ \frac{\partial^2f(x)}{\partial x_1\partial x_2}&\frac{\partial^2f(x)}{\partial x_2^2}\end{array}\right)\left(\begin{array}{c}h_1\\h_2\end{array}\right)

=\frac{\partial^2f(x)}{\partial x_1^2}h_1^2+\frac{\partial^2f(x)}{\partial x_1\partial x_2}h_1h_2 +\frac{\partial^2f(x)}{\partial x_1\partial x_2}h_1h_2+\frac{\partial^2f(x)}{\partial x_2^2}h_2^2.

Combining (4) and (6) we have
(7) \frac{d^2\phi (t)}{dt^2}=h^TD^2f(x)h.

Finally, here is the result

(1) for \phi gives

\phi(t)\approx\phi(0)+\frac{\phi'(0)}{1!}t+\frac{\phi''(0)}{2!}t^2.

Plugging here (5) and (7) and letting t=1 we get a 2D analog of (1):
f(x+h)\approx f(x)+Df(x)h+\frac{1}{2}h^TD^2f(x)h.

Leave a Reply

You must be logged in to post a comment.