Dec 19

Review of mt3042 Optimization Guide by M. Baltovic

Review of mt3042 Optimization Guide by M. Baltovic

by Kairat Mynbaev, professor, ISE, KBTU, Almaty, Kazakhstan

This is a one-semester course in Optimization. It covers the following topics:

  • Introduction and review of relevant parts from real analysis, with emphasis on higher dimensions.
  • Weierstrass' Theorem on continuous functions on compact sets.
  • Review with added rigour of unconstrained optimization of differentiable functions.
  • Lagrange's Theorem on equality-constrained optimization.
  • The Kuhn-Tucker Theorem on inequality-constrained optimization.
  • Finite and infinite horizon dynamic programming.

The course is based mainly on the book by Sundaram, L.R. A First Course in Optimization Theory. (Cambridge University Press, 1996).

The course is given to fourth year students. I evaluate the guide on two points: how well it expands the students’ theoretical horizon and how well it prepares them to the challenges they may face while applying their knowledge in their careers, whether in industry or in academia.

  1. The exposition is overly formal. Experience shows that most students don’t understand the definition of continuity in terms of epsilon-delta. Right after giving it on p.18, the author gives an alternative definition in terms of sequences. I think it’s better to omit the former definition altogether and rephrase everything in terms of sequences. After all, the compactness notion relies on sequences.
  2. The differentiability definition 2.21 on p.19 is simply indigestible. It is in fact the Fréchet derivative applicable in the infinite-dimensional case. Who needs it here? Why not define the matrix as a matrix of partial derivatives, which students know from previous courses?
  3. The solution to Example 3.2 is overblown. A professional mathematician never thinks like that. A pro would explain the idea as follows: because of Condition 2, the function is close to zero in some neighborhood of infinity. Therefore, a maximum should be looked for in a bounded closed set. Since this is a compact, the Weierstrass theorem applies. With a proper graphical illustration, the students don’t need anything else. Remarks similar to this apply to many solutions in the guide. As a result, the students don’t see the forest behind the trees.
  4. Regarding the theoretical conditions, the author refers to Sundaram without explaining why they are imposed. Explanations in Sundaram are far from intuitive (see the first of them on p.107). In all cases for n=2 it is possible to give relatively simple intuitive explanations. Specifically,
    1. For first-order conditions see
    2. For second-order conditions see
    3. For the constraint qualification condition, see
      The explanation on pp.58-60 is hard to understand because of dimensionality.
    4. For the Lagrange method see
      (necessary conditions),
      (sufficient conditions),
      (case of many constraints) and
      (multiplier interpretation).
    5. In case of the Kuhn-Tucker theorem, the most important point is that, once the binding constraints have been determined, the nonbinding ones can be omitted from the analysis
      The proof of nonnegativity of the Lagrange multiplier for binding constraints is less than one page:
  5. In solutions that rely on the Kuhn-Tucker theorem, the author suggests to check the constraint qualification condition for all possible combinations of constraints. Not only is this time consuming, but this is also misleading, given the fact that often it is possible to determine the binding constraints and use the Lagrange method instead of the Kuhn-Tucker theorem.
  6. Minor remark: In Example 5.2, the function f(x,1-x) is obviously nonnegative.
  7. There are many optimization methods not covered in Sundaram’s book. One of them, Pontryagin’s maximum principle, is more general than the Bellman approach, see
    It may be too advanced for this course. However, convexity is covered by Sundaram and omitted by Baltovic, while it can be successfully applied to solve some problems from the guide, see
  8. Another important area that is covered neither in the book nor in the guide is numerical optimization. Numerical optimization is as important as the Maximum Likelihood method in Statistics because ML realization in any software employs numerical optimization. People need to learn at least one method in numerical optimization to understand error messages produced on the computer.
  9. Speaking of applications, I would eliminate all big exercises that have nothing to do with Economics or Finance, such as Exercise 6.2.
  10. In the solution to Exercise 6.1, the author produces a 3-page analysis but then in one line discovers that all that was for nothing because the profit is infinite. What’s the pedagogical value of such an exposition?

Conclusion. This is a badly written guide for a badly designed course. I suggest removing it from our program.

Aug 19

Elementary transformations

Elementary transformations

Here we look at matrix representations of transformations called elementary.

Exercise 1. Let e_{i} denote the i-th unit row vector and let A be an arbitrary matrix. Then

a) premultiplication of A by e_{j} cuts out of A the j-th row A_{j}.

b) Premultiplication of A by e_{i}^{T}e_{j} puts the row A_{j} into the i-th row of the null matrix.

c) Premultiplication of A by I+ce_{i}^{T}e_{j} adds row A_{j} multiplied by c to row A_{i}, without changing the other rows of A.

d) The matrix I+ce_{i}^{T}e_{j} has determinant 1.

Proof. a) It's easy to see that

e_{j}A=\left( 0...0~1~0...0\right) \left(\begin{array}{ccc}a_{11} & ... & a_{1n} \\  ... & ... & ... \\a_{j1} & ... & a_{jn} \\... & ... & ... \\a_{n1} & ... & a_{nn}\end{array}  \right) =\left(\begin{array}{ccc}a_{j1} & ... & a_{jn}\end{array}\right) =A_{j}.

b) Obviously,

e_{i}^{T}e_{j}A=\left(\begin{array}{c}0 \\... \\0 \\1 \\0 \\... \\0\end{array}  \right) \left(\begin{array}{ccc}a_{j1} & ... & a_{jn}\end{array}  \right) =\left(\begin{array}{ccc}0 & ... & 0 \\... & ... & ... \\a_{j1} & ... & a_{jn} \\... & ... & ... \\  0 & ... & 0\end{array}  \right) =\left(\begin{array}{c}\Theta \\A_{j} \\\Theta\end{array}  \right)

(A_{j} in the i-th row, \Theta denotes null matrices of conformable dimensions)

c) (I+ce_{i}^{T}e_{j})A=A+ce_{i}^{T}e_{j}A=\left(\begin{array}{c}  A_{1} \\... \\A_{i} \\... \\A_{n}\end{array}  \right) +\left(\begin{array}{c}\Theta \\... \\cA_{j} \\... \\\Theta\end{array}  \right) =\left(\begin{array}{c}A_{1} \\... \\A_{i}+cA_{j} \\  ... \\A_{n}\end{array}\right) .

d) The matrix A=I+ce_{i}^{T}e_{j} has ones on the main diagonal and only one nonzero element a_{ij}=c outside it. By the Leibniz formula it's determinant is 1.

The reader can easily solve the next

Exercise 2. a) Postmultiplication of A by e_{j}^{T} cuts out of A the j-th column A^{(j)}.

b) Postmultiplication of A by e_{j}^{T}e_{i} puts the column A^{(j)} into the i-th column of the null matrix.

c) Postmultiplication of A by I+ce_{j}^{T}e_{i} adds column A^{(j)} multiplied by c to column A^{(i)}, without changing the other columns of A.

d) The matrix I+ce_{j}^{T}e_{i}=(I+ce_{i}^{T}e_{j})^{T} has determinant 1.

Exercise 3. a) Premultiplication of A by

(1) \left(\begin{array}{c}e_{1} \\... \\e_{j} \\... \\e_{i} \\... \\e_{n}\end{array}  \right)

permutes rows A_{i},A_{j}.

b) Postmultiplication of A by the transpose of (1) permutes columns A^{(i)},A^{(j)}.

This is a general property of permutation matrices. Recall also that their determinants can be only \pm 1.

Definition. 1) Adding some row multiplied by a constant to another row or 2) adding some column multiplied by a constant to another column or 3) permuting rows or columns is called an elementary operation. Accordingly, matrices that realize them are called elementary matrices.

Aug 18

Law and order in the set of matrices

Law and order in the set of matrices

The law is to feel by touch every little fact. The order is discussed below.

Why do complex numbers come right before this topic?

The analog of a conjugate number is the transpose A^T. How can you tell this? Using Exercise 2 we see that (cx)\cdot {y}=x\cdot (\bar{c}y). This looks similar to the identity (Ax)\cdot y=x\cdot(A^Ty) (see Exercise 4). Therefore the mapping c\rightarrow \bar{c} is similar to A\rightarrow A^{T}.

Once you know this, it's easy to come up with a couple of ideas.

Idea 1. From the characterization of real numbers (see Equation (4)) we see that matrices that satisfy A=A^T (symmetric matrices) correspond to real numbers and should be better in some way than asymmetric ones.

Idea 2. From Equation (3) we see that the matrix A^TA should be symmetric and non-negative.

What is a non-negative matrix?

The set of real numbers is ordered in the sense that for any two real numbers a,b we can say that either a\geq b or a<b is true. The most important property that we used in my class is this: if a\geq b and c>0, then ac\geq bc (the sign of an inequality is preserved if the inequality is multiplied by a positive number). Since any two numbers can be compared like that, it is a complete order.

One way in which symmetric matrices are better than more general ones is that for symmetric matrices one can define order. The limitation caused by dimensionality is that this order is not complete (some symmetric matrices are not comparable).

Exercise 1. For the matrix A=\left(\begin{array}{cc}a_{11}&a_{12}\\a_{12}&a_{22}\end{array}\right) and vector x\in R^2 find the expression Q_A(x)=x^TAx. What is the value of this expression at x=0?

SolutionQ_A(x)=x_1^2+2a_{12}x_1x_2+a_{22}x_2^2 and Q_A(0)=0.

Definition 1. The function Q_A(x)=x^TAx is called a quadratic form of the matrix A. Here A is symmetric of size n\times n and x\in R^n.

Discussion. 1) The facts that A is in the subscript and the argument is x mean that A is fixed and x is changing.

2) While the argument is a vector, the value of this function is a real number: Q_A acts from R^n to R.

3) Q_A(x) does not contain constant or linear terms (of type c and ax_i). It contains only quadratic terms (write x_1x_2=x_1^1x_2^1 to see that the total power is 2), that's why it is called a quadratic form and not a quadratic function.

Definition 2. We say that A is positive if Q_A(x)>0 for all nonzero x\in R^n and non-negative if Q_A(x)\geq 0 for all x\in R^n. (Most sources say positive definite instead of just positive and non-negative definite instead of just non-negative. I prefer a shorter terminology. If you don't understand why in the definition of positivity we require nonzero x, go back to Exercise 1). As with numbers, for two symmetric matrices A,B of the same size, we write A>B or A\geq B if A-B is positive or non-negative, respectively. Continuing this idea, we can say that A is negative if -A is positive.

More on motivation. A legitimate definition of order A\geq B would obtain if we compared the two matrices element-wise. Definition 2 is motivated by the fact that quadratic forms arise in the multivariate Taylor decomposition.

Sylvester's criterion is the only practical tool for determining positivity or non-negativity. However, in one case this is simple.

Exercise 2. Show that A^TA is symmetric and non-negative.

Solution. The symmetry is straightforward and has been shown before. Non-negativity is not difficult either: Q_{A^TA}(x)=x^TA^TAx=(Ax)^TAx=\|Ax\|^2\ge 0.


The graph of a quadratic form in good cases is an elliptic paraboloid and has various other names in worse cases. Geometrically, the definition of the inequality A\geq B means that the graph of A is everywhere above the graph of B (at the origin they always coincide). In particular, A\geq 0 means that the graph of A is everywhere above the horizontal plane.

Examples. All examples are matrices of size 2\times 2.

Quadratic form of identity (elliptic paraboloid)

Figure 1. Quadratic form of identity (elliptic paraboloid)

1) The identity matrix I is positive because Q_I(x)=x_1^2+x_2^2, see Figure 1.

Parabolic cylinder

Figure 2. Parabolic cylinder

2) The matrix A=\left(\begin{array}{cc}1&0\\0&0\end{array}\right) is non-negative. Its quadratic form Q_A(x)=x_1^2 grows when |x_1| grows and stays flat when x_2 changes and x_1 is fixed, see Figure 2.

Hyperbolic paraboloid

Figure 3. Hyperbolic paraboloid

3) The matrix B=\left(\begin{array}{cc}1&0\\0&-1\end{array}\right) is not positive or non-negative or negative or non-positive. Its quadratic form Q_B(x)=x_1^2-x_2^2 is a parabola with branches looking upward when the second argument is fixed and a parabola with branches looking downward when the first argument is fixed, see Figure 3. When a surface behaves like that around some point, that point is called a saddle point.

Aug 18

Summary and questions for repetition

Summary and questions for repetition

I was planning to cover the Moore-Penrose inverse which allows one to solve the equation Ax=y for any A (not necessarily square). Now my feeling is that it would be too much for a standard linear algebra course. This is the most easily accessible sourse.

  1. Give the definition and example of an orthonormal system. Prove that elements of such a system are linearly independent.

  2. "To know how a matrix acts on vectors, it is enough to know how it acts on the elements of an orthonormal basis." Explain.

  3. How can you reveal the elements of a matrix A from (Ax)\cdot y?

  4. A linear mapping from one Euclidean space to another generates a matrix. Prove.

  5. Prove that an inverse of a linear mapping is linear.

  6. What is a linear mapping in the one-dimensional case?

  7. In case of a square matrix, what are the four equivalent conditions for the equation Ax=y to be good (uniquely solvable for all y)?

  8. Give two equivalent definitions of linear independence.

  9. List the simple facts about linear dependence that students need to learn first.

  10. Prove the criterion of linear independence.

  11. Let the vectors x^{(1)},...,x^{(k)} be linearly dependent and consider the regression model y=\beta_1x^{(1)}+...+\beta_kx^{(k)}+u. Show that here the coefficients \beta_1,...,\beta_k cannot be uniquely determined (this is a purely algebraic fact, you don't need to know anything about multiple regression).

  12. Define a basis. Prove that if x^{(1)},...,x^{(k)} is a basis and x is decomposed as x=a_1x^{(1)}+...+a_kx^{(k)}, then the coefficients a_1,...,a_k are unique. Prove further that they are linear functions of x.

  13. Prove that the terms in the orthogonal sum of two subspaces have intersection zero.

  14. Prove dimension additivity.

  15. Prove that a matrix and its transpose have the same rank.

  16. Prove the rank-nullity theorem.

  17. Prove the upper bound on the matrix rank in terms of the matrix dimensions.

  18. Vectors are linearly independent if and only if one of them can be expressed as a linear combination of the others.

  19. What can be said about linear (in)dependence if some vectors are added to or removed from the system of vectors?

  20. Prove that if the number of vectors in a system is larger than the space dimension, then such a system is linearly dependent.

  21. Give a list of all properties of rank that you've learned so far.

  22. Prove the inequality \text{rank}(AB)\le \min\{\text{rank}(A),\text{rank}(B)\} and use it to give an alternative proof of the rank-nullity theorem.

  23. Prove that the definition of the space dimension does not depend on the basis used.

Aug 18

Rank of a matrix and the rank-nullity theorem

Rank of a matrix and the rank-nullity theorem

If you find proofs shorter than mine, shoot me an email.

Rank and its properties

Definition 1. The dimension of the image \text{Img}(A) is called a rank of the matrix A and denoted \text{rank}(A).

Exercise 1. For any matrix A

(1) \text{rank}(A)=\text{rank}(A^{T}).

Proof. Suppose A is of size n\times k. We know that

(2) R^k=N(A)\oplus\text{Img}(A^T).

By (2) any x\in R^k can be represented as

(3) x=y+z, with y\in N(A), z\in\text{Img}(A^T) and y\perp z.

From (3) Ax=Ay+Az=Az. By Definition 1, \text{Img}(A^T) is spanned by some linearly independent vectors z^{(1)},...,z^{(m)} with m=\dim\text{Img}(A^T). Hence z=\sum_{i=1}^ma_iz^{(i)} with some a_i depending on z and Ax=\sum_{i=1}^ma_iAz^{(i)}. This shows that \text{Img}(A) is spanned by Az^{(1)},...,Az^{(m)}. If we prove linear independence of these vectors, (1) will follow.

Suppose \sum_{i=1}^mb_iAz^{(i)}=0 with some b_{i}. Then A\left(\sum_{i=1}^mb_iz^{(i)}\right)=0. This tells us that \sum_{i=1}^mb_iz^{(i)}\in N(A). Since the vector \sum_{i=1}^mb_iz^{(i)} also belongs to \text{Img}(A^T), by Exercise 3 it is zero. By linear independence, \sum_{i=1}^mb_iz^{(i)}=0 is possible only when b_1=...=b_m=0. This proves the linear independence of Az^{(1)},...,Az^{(m)} and the statement.

Exercise 2 (rank-nullity theorem) If A is n\times k, then \dim N(A)+\dim \text{Img}(A)=k.

Proof. Exercise 4 and equation (2) imply k=\dim N(A)+\dim\text{Img}(A^T). But we know from Exercise 1 that \dim\text{Img}(A^T)=\dim\text{Img}(A).

Exercise 3. Let A be of size n\times k. Then \text{rank}(A)\leq\min\{n,k\}.

Proof. From Exercise 2 we see that \text{rank}(A)=\dim\text{Img}(A)\leq k. Applying this inequality to A^T we get \text{rank}(A)=\text{rank}(A^T)=\dim\text{Img}(A^T)\leq n.

Aug 18

Basis and dimension

Basis and dimension

Definition 1. We say that vectors x^{(1)},...,x^{(m)} form a basis in a subspace L if 1) it is spanned by x^{(1)},...,x^{(m)} and 2) these vectors are linearly independent. The number of vectors in the basis is called a dimension of L and the notation is \dim L.

An orthogonal basis is a special type of a basis, when in addition to the above conditions 1)-2) the basis vectors are orthonormal. For the dimension definition to be correct, the number of vectors in any basis should be the same. We prove correctness in a separate post.

Exercise 1. In R^n the unit vectors are linearly independent. Prove this fact 1) directly and 2) using the properties of an orthonormal system.

Direct proof. Any x\in R^n can be represented as

(1) x=x_1e_1+...+x_ne_n.

If the right side is zero, then x=0 and all x_i are zero.

Proof using orthonormality.  If the right side in (1) is zero, then x_i=x\cdot e_i=0 for all i.

Exercise 2. \dim R^{n}=n.

Proof. (1) shows that R^n is spanned by e_1,...,e_n. Besides, they are linearly independent by Exercise 1.

Definition 2. Let L_1,\ L_2 be two subspaces such that any element of one is orthogonal to any element of the other. Then the set \{x_1+x_2:x_1\in L_1,\ x_2\in L_2\} is called an orthogonal sum of L_1,\ L_2 and denoted L=L_1\oplus L_2.

Exercise 3. If a vector x belongs to both terms in the orthogonal sum of two subspaces L=L_1\oplus L_2, then it is zero. This means that L_1\cap L_2=\{0\}.

Proof. This is because any element of L_1 is orthogonal to any element of L_2, so x is orthogonal to itself, 0=x\cdot x=\|x\|^2 and x=0.

Exercise 4 (dimension additivity) Let L=L_1\oplus L_2 be an orthogonal sum of two subspaces. Then \dim L=\dim L_1+\dim L_2.

Proof. Let l_i=\dim L_i, i=1,2. By definition, L_1 is spanned by some linearly independent vectors y^{(1)},...,y^{(l_{1})} and L_2 is spanned by some linearly independent vectors z^{(1)},...,z^{(l_{2})}. Any x\in L can be decomposed as x=y+z, y\in L_1, z\in L_2. Since y,z can be further decomposed as y=\sum_{i=1}^{l_1}a_iy^{(i)}, z=\sum_{i=1}^{l_2}b_iz^{(i)}, the system y^{(1)},...,y^{(l_1)}, z^{(1)},...,z^{(l_2)} spans L.

Moreover, this system is linearly independent. If



L_1\ni\sum_{i=1}^{l_1}a_iy^{(i)}=-\sum_{i=1}^{l_2}b_iz^{(i)}\in L_2.

By Exercise 3 then \sum_{i=1}^{l_1}a_iy^{(i)}=0, \sum_{i=1}^{l_2}b_iz^{(i)}=0. By linear independence of the vectors in the two systems all coefficients a_i,b_i must be zero.

The conclusion is that \dim L=l_1+l_2.

Jul 18

Linear dependence of vectors: definition and principal result

Linear dependence of vectors: definition and principal result

This is a topic most students have trouble with. It's because there is a lot of logic involved, so skimming it is not a good idea. We start with the most common definition.

Definition 1. Let x^{(1)},...,x^{(k)}\in R^n be some vectors. They are called linearly dependent if there exist numbers a_1,...,a_k, not all of which are zero, such that

(1) a_1x^{(1)}+...+a_kx^{(k)}=0.

The sheer length of this definition scares some people and all they remember is equation (1). Stating just (1) instead of the whole definition is like skinning an animal.

Seizing the bull by the horns

The first matter of business is to shorten the definition and relate it to what we already know.

It's better to join the set of numbers a_1,...,a_k into a vector a=(a_1,...,a_k). Then the requirement that "not all of a_1,...,a_k are zero" is equivalent to a single equation a\neq 0. Further, let us write the vectors x^{(1)},...,x^{(k)} as columns and put them into a matrix X=\left( x^{(1)},...,x^{(k)}\right) . Using multiplication of partitioned matrices we see that (1) is equivalent to

(2) Xa=0

and therefore Definition 1 is equivalent to the following.

Definition 2. Vectors x^{(1)},...,x^{(k)}\in R^n are called linearly dependent if the homogeneous equation (2) has nonzero solutions a\in R^k, that is, the null space of X is not \{0\}.

Negating Definition 2 gives the definition of linear independence. Negating Definition 2 is easier than negating Definition 1.

Definition 3. Vectors x^{(1)},...,x^{(k)}\in R^{n} are called linearly independent if N(X)=\{0\} (for any nonzero a\in R^k we have Xa\neq 0 or, alternatively, Xa=0 only for a=0).

Exercise 1. For any matrix X, one has

(3) N(X)=N(X^TX).

Proof. 1) Proving that N(X)\subset N(X^TX). If a\in N(X), then Xa=0 and X^TXa=0, so a\in N(X^TX). 2) Proving that N(X^TX)\subset N(X). If a\in N(X^TX), then X^TXa=0 and 0=(X^TXa)\cdot a=(Xa)\cdot(Xa)=\|Xa\|^2, so a\in N(X).

Exercise 2. X^TX is a square symmetric matrix, for any X.

Proof. If X is of size n\times k, then X^TX is k\times k. It is symmetric: (X^TX)^T=X^T(X^T)^T=X^TX. By the way, some students write (X^TX)^{-1}=X^{-1}(X^T)^{-1}. You cannot do this if X is not square.

Criterion of linear independence. Vectors x^{(1)},...,x^{(k)}\in R^n are linearly independent if and only if \det X^TX\neq 0.

Proof. We are going to use (3). By Exercise 2, A=X^TX is a square matrix and for a square matrix we have the equivalence N(A)=\{0\}\Longleftrightarrow \det A\neq 0. Application of this result proves the statement.

Direct application of Definition 1 can be problematic. To prove that some vectors are linearly dependent, you have to produce a_1,...,a_k such that Definition 1 is satisfied. This usually involves some guesswork, see exercises below. The criterion above doesn't require guessing and can be realized on the computer. The linear independence requirement is common in multiple regression analysis but not all econometricians know this criterion.

Putting some flesh on the bones

These are simple facts you need to know in addition to the above criterion.

Exercise 3. 1) Why do we exclude the case when all a_1,...,a_k are zero?

2) What happens if among x^{(1)},...,x^{(k)} there are zero vectors?

3) Show that in case of two non-zero vectors Definition 1 is equivalent to just proportionality of one vector to another.

Solution. 1) If all a_1,...,a_k are zero, (1) is trivially satisfied, no matter what the vectors are.

2) If one of the vectors x^{(1)},...,x^{(k)} is zero, the coefficient of that vector can be set to one and all others to zero, so such vectors will be linearly dependent by Definition 1.

3) Consider two non-zero vectors x^{(1)},x^{(2)}. If they are linearly dependent, then

(4) a_1x^{(1)}+a_2x^{(2)}=0

where at least one of a_1,a_2 is not zero. Suppose a_1\neq 0. Then a_2\neq 0 because otherwise x^{(1)} would be zero. Hence

(5) x^{(1)}=-a_2/a_1x^{(2)}=cx^{(2)}

where the proportionality coefficient c is not zero. Conversely, (5) implies (4) (you have to produce the coefficients).

Exercise 4. Prove that if to the system of unit vectors we add any vector x\in R^n, the resulting system will be linearly dependent.

Jul 18

Solvability of an equation with a square matrix

Solvability of an equation with a square matrix

Everywhere in this section we assume that A is square of size k\times k. The main problem is about solvability of the equation Ax=y.

Dissecting the problem

Exercise 1. If A is invertible, then N(A)=\{0\} and \text{Img}(A)=R^k.

Proof. If Ax=0, then x=A^{-1}Ax=0, so N(A)=\{0\} (you could also say that A is one-to-one). From AA^{-1}=I we have AA^{-1}x=x. This tells us that x=Ay with y=A^{-1}x. We see that any x belongs to the image of A, so \text{Img}(A)=R^k.

Exercise 2. If N(A)=\{0\}, then A is invertible.

Proof. Since A is one-to-one, we can use the inverse f^{-1} of f(x)=Ax in the general sense: f^{-1}(Ax)=x. It is defined on the image of A. We know that f^{-1} is given by some matrix B: BAx=x for all x. Hence, BAe_i=e_i for i=1,...,k. Putting these equations side by side we get

(1) BA=BAI=I.

(In detail: the identity matrix is partitioned as I=(e_1,...,e_k), so BAI =(BAe_1,...,BAe_k)=(e_1,...,e_k)=I). Thus, B is the left inverse of A, and we've seen before that for square matrices this implies existence of the right inverse, invertibility of A and the equation B=A^{-1}.

Exercise 3. If \text{Img}(A)=R^k, then A is invertible.

Proof. If \text{Img}(A)=R^k, then by the second characterization of matrix image N(A^T)=(\text{Img}(A))^{\perp}=\{0\}. Applying Exercise 2 to A^T, we see that the transpose is invertible and hence \det A=\det A^{T}\neq 0. This implies invertibility of A.

Collecting the pieces together

Exercise 4. The following conditions are equivalent:

a) N(A)=\{0\}, b) \text{Img}(A)=R^k, c) \det A\neq 0, d) A is invertible.

Proof. The equivalence c) \Longleftrightarrow d) has been established before. For the implication d) \Longrightarrow a)+b) see Exercise 1. By Exercise 2 we have a) \Longrightarrow d). By Exercise 3, b) \Longrightarrow d.

Conclusion. The condition \det A\neq 0 is the easiest to check. If it holds, then the equation Ax=y has a unique solution for any y\in R^k.

Remark 1. By negating Exercise 4, we see that the following conditions are equivalent:

a^\prime) N(A)\neq \{0\}, b^\prime) \text{Img}(A)\neq R^k, c^\prime) \det A=0, d^\prime) A^{-1} does not exists.

Remark 2. By going through the proof one can check that the result of Exercise 4 holds if R^k is replaced by C^k (the set of vectors with complex coordinates).

Jul 18

Is the inverse of a linear mapping linear?

Is the inverse of a linear mapping linear?

Orthonormal basis

Exercise 1. I) Let e_j denote unit vectors. They possess properties

(1) e_i\cdot e_j=0 for all i\neq j, \left\Vert e_{i}\right\Vert =1 for all i.

II) For any x\in R^n we have the representation

(2) x=\sum_jx_je_j.

III) In (2) the coefficients can be found as x_{j}=x\cdot e_{j}:

(3) x=\sum_j(x\cdot e_j)e_j.

Proof. I) (1) is proved by direct calculation. II) To prove (2) we write


III) If we have (2), then it's easy to see that by (1) x\cdot e_i=\sum_jx_j(e_j\cdot e_i)=x_i.

Definition 1. Any system of vectors that satisfies (1) is called an orthonormal system. An orthonormal system is called complete if for any x\in R^n we have the decomposition (3). Exercise 1 shows that our system of unit vectors is complete orthonormal. A complete orthonormal system is called an orthonormal basis.

Analyzing a linear mapping

Exercise 2. Let A be a matrix of size n\times k. Suppose you don't know the elements of A but you know the products (Ax)\cdot y for all x,y. How can you reveal the elements of A from (Ax)\cdot y? How do you express Ax using the elements you define?

Solution. Let us partition A into rows and suppose the elements a_{ij} are known. Let us try unit vectors as x,y:

(4) (Ae_j)\cdot e_i=\left(\begin{array}{c}A_1e_j \\... \\A_ne_j\end{array}\right)\cdot e_i=A_ie_j=a_{ij}.

Using (2) and (4) one can check that (Ax)\cdot e_i=A_ix=\sum_ja_{ij}x_j. Hence, from (3) we have the answer to the second question:

(5) Ax=\sum_i[\left(Ax\right)\cdot e_i]e_i=\sum_i\left(\sum_ja_{ij}x_j\right)e_i=\sum_{i,j}x_ja_{ij}e_i.

The above calculation means that when a_{ij} are unknown, we can define them by a_{ij}=(Ae_j)\cdot e_i and then the action of A on x will be described by the last expression in (5).

We know that a mapping generated by a matrix is linear. The converse is also true: a linear mapping is given by a matrix:

Exercise 3. Suppose a mapping f:R^k\rightarrow R^n is linear: f(ax+by)=af(x)+bf(y) for any numbers a,b and vectors x,y. Then there exists a matrix A of size n\times k such that f(x)=Ax for all x.

Proof. Based on (4) in our case put a_{ij}=f(e_j)\cdot e_i. Applying (3) to f(x) we get

(6) f(x)=\sum_i[f(x)\cdot e_i]e_i

(the summation index j is replaced by i on purpose). Plugging (2) in (6)

f(x)=\sum_i\left[f\left(\sum_jx_je_j\right)\cdot e_i\right]e_i= (both f and scalar product are linear)

=\sum_i\left[\left(\sum_jx_jf\left(e_j\right)\right)\cdot e_i\right]e_i=\sum_{i,j}x_j(f\left(e_j\right)\cdot e_i)e_i=\sum_{i,j}x_ja_{ij}e_i=Ax.

The last equation is the definition of A.

Exercise 4. An inverse of a linear mapping is linear (and given by a matrix by Exercise 3).

Proof. Let f(x)=Ax be a linear mapping and suppose its inverse f^{-1} in the general sense exists. Then f^{-1}(Ax)=x for all x. Let us put x=ay+bz for arbitrary numbers a,b and vectors y,z. Then we have f^{-1}(A(ay+bz))=ay+bz or, using linearity of A,


Putting Ay=u, Az=v we get


Thus, f^{-1} is linear.

Remark. In all of the above it is important that e_j are unit vectors. For a different basis, the results drastically change.

Jul 18

Geometry of linear equations: questions for repetition

Geometry of linear equations: questions for repetition

This section evolves around the concepts of linearity, linear subspaces and orthogonality. As usual, you are expected to produce at least those proofs that I give.

  1. Prove linearity of the mapping generated by a matrix. This fundamental fact will have many implications. Do you think any linear mapping is generated by some matrix? Do you think a mapping inverse to a linear mapping should be linear?

  2. What is the first characterization of a matrix image? In other sources it is called a column space. Later on it will be linked to the notions of linear independence and rank.

  3. How would you write Ax (A is a matrix and x is a vector), if A is partitioned a) into rows and b) into columns?

  4. Show that a span of some vectors is a linear subspace. Note that one- and two-dimensional examples of linear subspaces we considered are special cases of subspaces described by linear systems of equations.

  5. Can spans of two different systems of vectors coincide? Give an example.

  6. Show that the image and null space of a matrix are linear subspaces. This facts indicate that the subspace notion is an adequate tool for problems at hand.

  7. What is the description of the set of solutions of Ax=y? It is pretty general. For example, the structure of solutions of an ordinary linear differential equation is the same.

  8. Illustrate geometrically the theorem on second orthocomplement and explain why it holds.

  9. What is the relationship between (Ax)\cdot y and x\cdot (A^Ty)?

  10. Derive the second characterization of matrix image.

  11. What are the geometric conditions for the solvability and uniqueness of solutions of an inhomogeneous equation?