Dec 19

Review of mt3042 Optimization Guide by M. Baltovic

Review of mt3042 Optimization Guide by M. Baltovic

by Kairat Mynbaev, professor, ISE, KBTU, Almaty, Kazakhstan

This is a one-semester course in Optimization. It covers the following topics:

  • Introduction and review of relevant parts from real analysis, with emphasis on higher dimensions.
  • Weierstrass' Theorem on continuous functions on compact sets.
  • Review with added rigour of unconstrained optimization of differentiable functions.
  • Lagrange's Theorem on equality-constrained optimization.
  • The Kuhn-Tucker Theorem on inequality-constrained optimization.
  • Finite and infinite horizon dynamic programming.

The course is based mainly on the book by Sundaram, L.R. A First Course in Optimization Theory. (Cambridge University Press, 1996).

The course is given to fourth year students. I evaluate the guide on two points: how well it expands the students’ theoretical horizon and how well it prepares them to the challenges they may face while applying their knowledge in their careers, whether in industry or in academia.

  1. The exposition is overly formal. Experience shows that most students don’t understand the definition of continuity in terms of epsilon-delta. Right after giving it on p.18, the author gives an alternative definition in terms of sequences. I think it’s better to omit the former definition altogether and rephrase everything in terms of sequences. After all, the compactness notion relies on sequences.
  2. The differentiability definition 2.21 on p.19 is simply indigestible. It is in fact the Fréchet derivative applicable in the infinite-dimensional case. Who needs it here? Why not define the matrix as a matrix of partial derivatives, which students know from previous courses?
  3. The solution to Example 3.2 is overblown. A professional mathematician never thinks like that. A pro would explain the idea as follows: because of Condition 2, the function is close to zero in some neighborhood of infinity. Therefore, a maximum should be looked for in a bounded closed set. Since this is a compact, the Weierstrass theorem applies. With a proper graphical illustration, the students don’t need anything else. Remarks similar to this apply to many solutions in the guide. As a result, the students don’t see the forest behind the trees.
  4. Regarding the theoretical conditions, the author refers to Sundaram without explaining why they are imposed. Explanations in Sundaram are far from intuitive (see the first of them on p.107). In all cases for n=2 it is possible to give relatively simple intuitive explanations. Specifically,
    1. For first-order conditions see
    2. For second-order conditions see
    3. For the constraint qualification condition, see
      The explanation on pp.58-60 is hard to understand because of dimensionality.
    4. For the Lagrange method see
      (necessary conditions),
      (sufficient conditions),
      (case of many constraints) and
      (multiplier interpretation).
    5. In case of the Kuhn-Tucker theorem, the most important point is that, once the binding constraints have been determined, the nonbinding ones can be omitted from the analysis
      The proof of nonnegativity of the Lagrange multiplier for binding constraints is less than one page:
  5. In solutions that rely on the Kuhn-Tucker theorem, the author suggests to check the constraint qualification condition for all possible combinations of constraints. Not only is this time consuming, but this is also misleading, given the fact that often it is possible to determine the binding constraints and use the Lagrange method instead of the Kuhn-Tucker theorem.
  6. Minor remark: In Example 5.2, the function f(x,1-x) is obviously nonnegative.
  7. There are many optimization methods not covered in Sundaram’s book. One of them, Pontryagin’s maximum principle, is more general than the Bellman approach, see
    It may be too advanced for this course. However, convexity is covered by Sundaram and omitted by Baltovic, while it can be successfully applied to solve some problems from the guide, see
  8. Another important area that is covered neither in the book nor in the guide is numerical optimization. Numerical optimization is as important as the Maximum Likelihood method in Statistics because ML realization in any software employs numerical optimization. People need to learn at least one method in numerical optimization to understand error messages produced on the computer.
  9. Speaking of applications, I would eliminate all big exercises that have nothing to do with Economics or Finance, such as Exercise 6.2.
  10. In the solution to Exercise 6.1, the author produces a 3-page analysis but then in one line discovers that all that was for nothing because the profit is infinite. What’s the pedagogical value of such an exposition?

Conclusion. This is a badly written guide for a badly designed course. I suggest removing it from our program.

Dec 17

How to study mt3042 Optimisation: a guide to a guide

How to study mt3042 Optimisation: a guide to a guide

Section and examples numbering follows that of mt3042 Optimization Guide by M. Baltovic.

Main idea: look at geometry in the two-dimensional case

Here is an example. The norm of a vector x\in R^n is defined by \left\Vert x\right\Vert =\sqrt{\sum_{i=1}^nx_i^2}. The combination of squaring and extracting a square root often makes it difficult to understand how this construction works. Here is a simple inequality that allows one to do without this norm (or, put it differently, replace it with another norm). Take n=2.

\max \{|x_1|,|x_2|\}=\max \{\sqrt{x_1^2},\sqrt{x_2^2}\}\leq\max\{\sqrt{x_1^2+x_2^2},\sqrt{x_1^2+x_2^2}\}=\left\Vert  x\right\Vert =\sqrt{x_1^2+x_2^2}\leq\sqrt{\max\{|x_1|,|x_2|\}^2+\max \{|x_1|,|x_2|\}^2}=\sqrt{2}\max\{|x_1|,|x_2|\}.

We have proved that \max \{|x_1|,|x_2|\}\leq\left\Vert x\right\Vert\leq\sqrt{2}\max\{|x_1|,|x_2|\}. This easily generalizes to R^{n}:

(1) \max \{|x_1|,...,|x_n|\}\leq\left\Vert x\right\Vert\leq\sqrt{n}\max\{|x_1|,...,|x_n|\}.

Application. The set A\subset R^n is called bounded if there is a constant C such that \left\Vert x\right\Vert \leq C for all x\in A. (1) implies an equivalent definition: the set A\subset R^n is called bounded if there is a constant C such that \max\{|x_1|,...,|x_n|\}\leq C for all x\in A. See p.35 of Baltovic's guide, where the inequality y_{i}\leq \frac{f(\hat{x})}{p_{i}} is sufficient for proving boundedness of the set Y.

Theorem 2.2 (The Cauchy-Schwarz Inequality). This inequality does not have serious applications in the guide. For a nontrivial application of the Cauchy-Schwarz inequality see my post.

2.1.8. Avoid using the definition of continuity in terms of \varepsilon-\delta (Definition 2.18). Use Definition 2.19 in terms of sequences instead.

2.6.2. Definition 2.21 for many students is indigestible. Just say that the matrix A consists of partial derivatives of components of f=(f_1,...,f_m):

A=\left(\begin{array}{ccc}\frac{\partial f_1}{\partial x_1} &...& \frac{\partial f_m}{\partial x_1} \\... &.. .&... \\ \frac{\partial f_1}{\partial x_n} &...& \frac{\partial f_m}{\partial x_n} \end{array}\right) .

Theorem 2.11. The proof is really simple in the one-dimensional case. By the definition of the derivative, \frac{f(x_n)-f(x)}{x_n-x}\rightarrow  f^{\prime }(x) for any sequence x_n\rightarrow x. Multiplying this equation by x_n-x\rightarrow 0 we get f(x_{n})-f(x)\rightarrow  (x_n-x)f^{\prime }(x)\rightarrow 0, which proves continuity of f at x.

3.3.1. There is Math that happens on paper (formulas) and the one that happens in the head (logic). Many students see the formulas and miss the logic. Carefully read this section and see if the logic happens in your head.

3.4. The solution to Example 3.2 is overblown. A professional mathematician never thinks like that. A pro would explain the idea as follows: because of Condition 2, the function is close to zero in some neighborhood of infinity \{x:|x|>N\}. Therefore, a maximum should be looked for in the set \{x:|x|\leq N\}. Since this is a compact, the Weierstrass theorem applies. With a proper graphical illustration, the students don't need anything else.

4.2 First-order conditions for optima. See the proof.

4.4 Second-order conditions for optima. See explanation using the Taylor decomposition.

5.3 The Theorem of Lagrange. For the Lagrange method see necessary conditionssufficient conditions, and case of many constraints.

5.4 Proof of Lagrange's Theorem. See a simple explanation of the constraint qualification condition. The explanation on pp.58-60 is hard to understand because of dimensionality.

5.6 The Lagrangian multipliers. See simpler derivation.

6.4 Proof of the Kuhn-Tucker Theorem. In case of the Kuhn-Tucker theorem, the most important point is that, once the binding constraints have been determined, the nonbinding ones can be omitted from the analysis. The proof of nonnegativity of the Lagrange multiplier for binding constraints is less than one page.

Example 6.4. In solutions that rely on the Kuhn-Tucker theorem, the author suggests to check the constraint qualification condition for all possible combinations of constraints. Not only is this time consuming, but this is also misleading, given the fact that often it is possible to determine the binding constraints and use the Lagrange method instead of the Kuhn-Tucker theorem or, alternatively, to use the Kuhn-Tucker theorem for eliminating simple cases. The same problem can be solved using the convexity theory.

Example 6.5. In this case Baltovic makes a controversial experiment: what happens if we go the wrong way (expectedly, bad things happen), without providing the correct solution.

Solution to Exercise 6.1. In this exercise, the revenue is homogeneous of degree 2 and the cost is homogeneous of degree 1, which indicates that the profit is infinite. No need to do a three-page analysis!

7.6 The Bellman equations. There are many optimization methods not covered in Sundaram's book. One of them, Pontryagin's maximum principle, is more general that the Bellman approach.

p. 172. The bound \sum \delta ^{t}|r(s_{t},a_{t})|\leq K\sum \delta ^{t} is obvious and does not require the Cauchy-Schwarz inequality.

Example 8.1. See the solution of this example using the Cauchy-Schwarz inequality.

Dec 17

The economical way to use the Kuhn-Tucker theorem

The economical way to use the Kuhn-Tucker theorem

M. Baltovic in his guide to Optimization Theory blindly checks the constraint qualification condition for all possible constraints combinations. This is not rational because a) the work can be substantially reduced by properly using the Kuhn-Tucker conditions and b) some constraint combinations may not have sense.

We illustrate this point using Example 6.4:

maximize f(x_1,x_2)=p_1x_1+p_2x_2 subject to x_1\geq 0, x_2\geq 0, x_1^2+x_2^2\leq 1.

p_1,p_2 are assumed to be positive. (The solution can be found using also convexity).

The key is to eliminate simple cases one by one

Case 1. Are internal solutions possible? In case of internal solutions, the constraints are irrelevant and FOC's can be applied to the objective function alone. Equations \frac{\partial f}{\partial x_1}=0,\frac{\partial f}{\partial x_2}=0 give p_1=p_2=0, which is impossible.

Denote the boundaries b_1=\{x_1=0\}, b_2=\{x_2=0\}, b_3=\{x_1^2+x_2^2=1\}. The Kuhn-Tucker conditions are (the numbering follows that in Baltovic):

(6.3) \lambda _1\geq 0, x_1\geq 0, \lambda _1x_1=0

(6.4) \lambda _2\geq 0, x_2\geq 0, \lambda _2x_2=0

(6.5) \lambda _3\geq 0, 1\geq x_1^2+x_2^2, \lambda_3(1-x_1^2-x_2^2)=0

(6.6) p_1+\lambda_1-2\lambda_3x_1=0

(6.7) p_2+\lambda_2-2\lambda_3x_2=0

Remember that these conditions work only when the maximizing point satisfies the constraint qualification condition. It is obviously satisfied for b_1,b_2,b_3 considered separately.

Case 2. Suppose the maximum belongs to only b_1 (and not to b_1\cap b_2 or b_1\cap b_3). Then from (6.6) p_1=-\lambda_1\leq 0, which is impossible.

Case 3. Similarly, if the maximum belongs to only b_2, then p_2=-\lambda_2\leq 0, which is also impossible.

Case 4. If the maximum belongs only to b_3, then x_1,x_2>0 and from (6.3), (6.4) we see that \lambda_1=\lambda_2=0. From (6.5)-(6.7) then

x_1^2+x_2^2=1, p_1=2\lambda _3x_1, p_2=2\lambda_3x_2.

This system can be easily solved to give the maximizing point


The value of the objective function at this point is


Case 5. The only possibilities that are left are x^{(1)}=(0,0)\in b_1\cap b_2, x^{(2)}=(0,1)\in b_1\cap b_3, x^{(3)}=(1,0)\in b_2\cap b_3. Don't bother checking the constraint qualification for these points because a) it may fail, in which case the Kuhn-Tucker theorem does not apply, even though any of these points can be a maximum point, and b) even if it holds, none of these points may be a maximum point (the Kuhn-Tucker theorem provides just a necessary condition). Just check directly that f at these points takes values lower than at x_{\max}:

f(x^{(1)})=0, f(x^{(2)})=p_2, f(x^{(3)})=p_1.

It's easy to see that f(x_{\max})=\sqrt{p_1^2+p_2^2}\geq\max\{p_1,p_2\}.

Case 6. The intersection b_1\cap b_2\cap b_3 is empty, so checking the constraint qualification condition for it is a waste of time.

The solution in Baltovic takes more than three pages.

Nov 17

Using convexity in optimization

Using convexity in optimization

Using graphical analysis

Here we look at the solution of Example 6.4 from the mt3042 Optimization Guide by M. Baltovic:

maximize f(x_1,x_2)=p_1x_1+p_2x_2

subject to h_1(x_1,x_2)=x_1\geq 0, h_2(x_1,x_2)=x_2\geq0, h_3(x_1,x_2)=x_1^2+x_2^2\leq 1.

p_1,p_2 are assumed to be positive. This could be about maximizing revenue from selling two goods, and those familiar with graphical analysis in Economics can immediately say that the maximum is achieved where p_1x_1+p_2x_2 is tangent to x_1^2+x_2^2=1. The first two constraints are not binding. The proof of the Kuhn-Tucker theorem shows that then we can use the Lagrangian L=p_1x_1+p_2x_2+\lambda(x_1^2+x_2^2-1) that doesn't involve the nonbinding constraints. The implicit function existence condition necessary for application of the Lagrange theorem is \left(\frac{\partial h_3}{\partial x_1},\frac{\partial h_3}{\partial x_2}\right)=\left(2x_1,2x_2\right)\neq 0 and is satisfied everywhere except at the origin. The FOC's are

\frac{\partial L}{\partial x_1}=p_1+2\lambda x_1=0, \frac{\partial L}{\partial x_2}=p_2+2\lambda x_2=0,

\frac{\partial L}{\partial\lambda }=x_1^2+x_2^2-1.

I leave to the reader to find the solution:

(x_1,x_2)=\left(\frac{p_1}{\sqrt{p_1^2+p_2^2}},\frac{p_2}{\sqrt{p_1^2+p_2^2}}\right) .

Graphical analysis in a more educated way

A set is called convex if, together with any two points belonging to it, it contains the whole segment connecting those two points. A square and a disk are convex. A heart is not convex (all sets are considered with their interiors and boundaries).

For a function f, the set \text{epi}f=\left\{(x,y):y\geq f(x)\right\} of points lying on and above the graph y=f(x) is called an epigraph. f is called convex if its epigraph is convex. Examples of convex functions: a linear function y=p_1x_1+p_2x_2, a paraboloid y=x_1^2+x_2^2, an exponential function y=e^x. Examples of functions that are not convex: y=\log x, a parabola y=ax^2+bx+c with a<0.

Maximum principle. A convex function on a compact convex domain attains its maximum only on the boundary.

Geometric interpretation of the gradient: let y=f(x_1,x_2) be differentiable. Then the gradient \left(\frac{\partial f}{\partial x_1},\frac{\partial f}{\partial x_2}\right), at any given point (x_1,x_2), shows the direction in which the growth of the function is the fastest.

In our case both the function and the domain are convex, so by the maximum principle the maximum is attained on the boundary. Since the gradient is (p_1,p_2), the values of the function cannot be maximized at the coordinate axes. Thus, only the third constraint can be binding. The rest of the solution employs the Lagrangian as above.

Compare the above solutions with the three-page solution from Baltovic (pp.93-96).

Main lessons: (1) Don't pedantically look through all possible combinations of constraints. This can be good only in a multidimensional problem where you have no geometric guidance. (2) Use geometry, convexity or any other means to figure out which constraints are binding. Then the proof of the Kuhn-Tucker theorem tells you that you can work with the Lagrangian with only binding constraints included. Check the constraint qualification (which I call the implicit function existence condition) only for binding constraints.

Oct 17

The Kuhn-Tucker theorem ins and outs

The Kuhn-Tucker theorem ins and outs

We capitalize on previous posts on constrained optimization to explain the format of the Kuhn-Tucker theorem. So far all proofs for one constraint have been done in full. Here we indicate how they can be generalized to the case of many constraints, without going into technicalities.

Problem statement

Consider the maximization problem with several inequality constraints

(1) maximize f(x) subject to h_1(x)\geq 0,...,h_l(x)\geq 0.

Suppose x is a maximizing point and k out of l constraints are binding. Renumbering the constraints, if necessary, we can assume that it is the first k constraints that are binding:

(2) h_1(x)=0,...,h_k(x)=0,\ h_{k+1}(x)>0,...,h_l(x)>0.

The full Lagrangian is defined by L(x,\lambda)=f(x)+\lambda H(x) where

H(x)=\left(\begin{array}{c}h_1(x) \\... \\h_l(x) \end{array}\right) ,\ \lambda =(\lambda _1,...,\lambda_l).

The purpose is to formulate conditions which the Lagrangian, Lagrangian multipliers and constraints should satisfy.

The implicit function existence

Let us separate the equality constraints by including them in

H_k(x)=\left(\begin{array}{c}h_1(x) \\... \\h_k(x)\end{array}\right).

Equality constraints are supposed to determine some of the variables as implicit functions of the others. The number of the implicit functions should be equal to the number of constraints. This results in the condition that the matrix

\frac{\partial H_k}{\partial x}=\left(\begin{array}{ccc}\frac{\partial h_1}{\partial x_1} &... & \frac{\partial h_1}{\partial x_n} \\ ...&...&... \\ \frac{\partial h_k}{\partial x_1} &... &\frac{\partial h_k}{\partial x_n} \end{array}\right)

should have rank k. The number of "the other variables" should be positive, which leads to the condition k<n.

Reduction of inequality-constrained problem to equality-constrained one

If you compare my posts with the book by Sundaram, R.K. (A First Course in Optimisation Theory, Cambridge: Cambridge University Press, 1996) or the MT3042 Optimisation Theory Guide by M. Balcovic, you will notice that they assume the domain of f to be an open set, while I everywhere implicitly assumed it to be the whole R^n. This difference doesn't matter, as long as the extreme point is an internal point of the domain, which I also implicitly assumed. Another assumption that I have been skipping is that the functions have continuous derivatives of appropriate orders (one or two, depending on the context). Now I have to explicitly make and use such assumptions.

Smoothness conditions. The objective and constraint functions have continuous first derivatives.

Differentiable functions are known to be continuous. Therefore the set U=\{x\in R^n:h_{k+1}(x)>0,...,h_l(x)>0\} determined by the nonbinding constraints is open. Since by assumption the maximizing point belongs to the set (2), it will not change if we replace the original domain of f, whatever it was, by U. What we have achieved by this trick is that the nonbinding constraints have been subsumed in the domain and can be dropped from the optimization problem. The resulting problem

(3) maximize f(x) subject to h_1(x)=0,...,h_k(x)=0

has only equality constraints and admits application of the result we already have.

The Kuhn-Tucker theorem derived from constrained optimization

For (3) we form a partial Lagrangian L_k(x,\lambda^{(k)})=f(x)+\lambda^{(k)}H_k(x) with \lambda^{(k)}=(\lambda_1,...,\lambda_k). The first-order conditions are

(4) \frac{\partial L_k}{\partial x}=0, \frac{\partial L_k}{\partial\lambda^{(k)}}=H_k(x)=0.

Let us set the remaining components of \lambda to zero:

(5) \lambda_{k+1}=...=\lambda _l=0.

For this reason the partial Lagrangian is the same as the full one, L_k(x,\lambda^{(k)})=f(x)+\lambda^{(k)}H_{k}(x)=f(x)+\lambda H(x), and the first equation from (4) is the same as \frac{\partial L}{\partial x}=0 or, in terms of gradients,

(A) Df(x)+\lambda_1Dh_1(x)+...+\lambda_lDh_l(x)=0.

From (1), obviously,

(B) h_1(x)\geq 0,...,h_l(x)\geq 0.

Next we need to establish

(C) \lambda _1h_1(x)=0,...,\lambda_lh_l(x)=0.

The first k equations follow from (2) and the last k-l follow from (5). Finally, the Kuhn-Tucker theorem states that

(D) \lambda_1\geq 0,...,\lambda _l\geq 0.

The last k-l inequalities follow from (5). The first k inequalities actually express the property: If a constraint is efficient, then the Lagrange multiplier is nonnegative. This property was established by perturbing one equality constraint. Since in the present case each equality constraint can be perturbed independently of the others, the same proof goes through.

Kuhn-Tucker theorem. If the implicit function existence and smoothness conditions are satisfied, then an internal maximum x must satisfy (A)-(D).

Remark. Karush, in his master's thesis, discovered this theorem 12 years before Kuhn and Tucker published it.



Oct 17

The Kuhn-Tucker theorem: a first look

The Kuhn-Tucker theorem: a first look

In its simplest form, this theorem addresses optimization with an inequality constraint: maximize the objective function f(x,y) subject to the inequality constraint g(x,y)\geq 0.

If the solution (x^\ast,y^\ast) satisfies

(1) g(x^\ast,y^\ast)=0,

the constraint is called efficient (binding), otherwise it is called slack (or nonbinding). Here we answer the question: when an efficient constraint is efficient? More precisely, what can be said if the maximum is attained at a point such that (1) is true, as opposed to g(x^\ast,y^\ast)>0? Note that when the maximum in fact satisfies (1), we can use the Lagrangian if the implicit function existence condition holds:

(2) \left(\frac{\partial g}{\partial x},\frac{\partial g}{\partial y}\right)  \neq 0.

Constrained set as a union of level sets

Consider the curve (level set) C_\eta=\{(x,y):g(x,y)=\eta\}, \eta\geq 0. Obviously, the constrained set \{(x,y):g(x,y)\geq 0\} is a union of these level sets. We need to compare the values of the objective function on C_0 with those on C_\eta for \eta>0.

Applying the Lagrangian multiplier interpretation

Assuming (2), let \lambda be the Lagrangian multiplier for the equality constrained problem: maximize f(x,y) subject to g(x,y)=0. Writing g(x,y)=\eta as g(x,y)+c=0 with c=-\eta we can apply the equation

(3) \frac{dF}{dc}|_{c=0}=\lambda

where F(c)=f(x,y(x,c)) is the maximized value on the curve C_{c}. To simplify the notation, the maximizing point on the curve C_\eta is denoted (x^\ast(\eta),y^\ast(\eta )) and the maximum value F(\eta)=f(x^\ast(\eta),y^\ast(\eta)). In terms of the original variable \eta equation (3) becomes

\frac{dF}{d\eta }|_{\eta =0}=-\lambda .

This equation explains everything. If the value F(0)=f(x^\ast(0),y^\ast(0)) is larger than or equal to F(\eta)=f(x^\ast(\eta ),y^\ast(\eta )), \eta>0, then the derivative (1) is nonpositive, which translates to nonnegativity of \lambda .

Conclusion. If a constraint is efficient, then the Lagrange multiplier is nonnegative.