7
Dec 19

## Review of mt3042 Optimization Guide by M. Baltovic

by Kairat Mynbaev, professor, ISE, KBTU, Almaty, Kazakhstan

This is a one-semester course in Optimization. It covers the following topics:

• Introduction and review of relevant parts from real analysis, with emphasis on higher dimensions.
• Weierstrass' Theorem on continuous functions on compact sets.
• Review with added rigour of unconstrained optimization of differentiable functions.
• Lagrange's Theorem on equality-constrained optimization.
• The Kuhn-Tucker Theorem on inequality-constrained optimization.
• Finite and infinite horizon dynamic programming.

The course is based mainly on the book by Sundaram, L.R. A First Course in Optimization Theory. (Cambridge University Press, 1996).

The course is given to fourth year students. I evaluate the guide on two points: how well it expands the students’ theoretical horizon and how well it prepares them to the challenges they may face while applying their knowledge in their careers, whether in industry or in academia.

1. The exposition is overly formal. Experience shows that most students don’t understand the definition of continuity in terms of epsilon-delta. Right after giving it on p.18, the author gives an alternative definition in terms of sequences. I think it’s better to omit the former definition altogether and rephrase everything in terms of sequences. After all, the compactness notion relies on sequences.
2. The differentiability definition on p.19 is simply indigestible. Why not define the matrix as a matrix of partial derivatives, which students know from previous courses?
3. The solution to Example 3.2 is overblown. A professional mathematician never thinks like that. A pro would explain the idea as follows: because of Condition 2, the function is close to zero in some neighborhood of infinity. Therefore, a maximum should be looked for in a bounded closed set. Since this is a compact, the Weierstrass theorem applies. With a proper graphical illustration, the students don’t need anything else. Remarks similar to this apply to many solutions in the guide. As a result, the students don’t see the forest behind the trees.
4. Regarding the theoretical conditions, the author refers to Sundaram without explaining why they are imposed. Explanations in Sundaram are far from intuitive (see the first of them on p.107). In all cases for n=2 it is possible to give relatively simple intuitive explanations. Specifically,
1. For first-order conditions see
http://raisingthebar.nl/2017/10/11/unconstrained-optimization-plane-1/.
2. For second-order conditions see
http://raisingthebar.nl/2017/10/14/unconstrained-optimization-on-the-plane-2/.
3. For the constraint qualification condition, see
http://raisingthebar.nl/2017/10/21/importance-implicit-function-theorem-optimization/.
The explanation on pp.58-60 is hard to understand because of dimensionality.
4. For the Lagrange method see
http://raisingthebar.nl/2017/10/23/lagrange-method-necessary-condition/
(necessary conditions),
http://raisingthebar.nl/2017/10/23/lagrange-method-sufficient-conditions/
(sufficient conditions),
http://raisingthebar.nl/2017/10/27/lagrange-method-case-many-equality-constraints/
(case of many constraints) and
http://raisingthebar.nl/2017/10/28/lagrangian-multiplier-interpretation/
(multiplier interpretation).
5. In case of the Kuhn-Tucker theorem, the most important point is that, once the binding constraints have been determined, the nonbinding ones can be omitted from the analysis
http://raisingthebar.nl/2017/10/29/the-kuhn-tucker-theorem-ins-and-outs/.
The proof of nonnegativity of the Lagrange multiplier for binding constraints is less than one page:
http://raisingthebar.nl/2017/10/28/kuhn-tucker-theorem-first-look/.
5. In solutions that rely on the Kuhn-Tucker theorem, the author suggests to check the constraint qualification condition for all possible combinations of constraints. Not only is this time consuming, but this is also misleading, given the fact that often it is possible to determine the binding constraints and use the Lagrange method instead of the Kuhn-Tucker theorem.
6. Minor remark: In Example 5.2, the function $f(x,1-x)$ is obviously nonnegative.
7. There are many optimization methods not covered in Sundaram’s book. One of them, Pontryagin’s maximum principle, is more general than the Bellman approach, see
https://en.wikipedia.org/wiki/Pontryagin%27s_maximum_principle.
It may be too advanced for this course. However, convexity is covered by Sundaram and omitted by Baltovic, while it can be successfully applied to solve some problems from the guide, see
http://raisingthebar.nl/2017/11/08/using-convexity-optimization/.
8. Another important area that is covered neither in the book nor in the guide is numerical optimization. Numerical optimization is as important as the Maximum Likelihood method in Statistics because ML realization in any software employs numerical optimization. People need to learn at least one method in numerical optimization to understand error messages produced on the computer.
9. Speaking of applications, I would eliminate all big exercises that have nothing to do with Economics or Finance, such as Exercise 6.2.
10. In the solution to Exercise 6.1, the author produces a 3-page analysis but then in one line discovers that all that was for nothing because the profit is infinite. What’s the pedagogical value of such an exposition?

Conclusion. This is a badly written guide for a badly designed course. I suggest removing it from our program.

19
Aug 19

## Elementary transformations

Here we look at matrix representations of transformations called elementary.

Exercise 1. Let $e_{i}$ denote the $i$-th unit row vector and let $A$ be an arbitrary matrix. Then

a) premultiplication of $A$ by $e_{j}$ cuts out of $A$ the $j$-th row $A_{j}.$

b) Premultiplication of $A$ by $e_{i}^{T}e_{j}$ puts the row $A_{j}$ into the $i$-th row of the null matrix.

c) Premultiplication of $A$ by $I+ce_{i}^{T}e_{j}$ adds row $A_{j}$ multiplied by $c$ to row $A_{i},$ without changing the other rows of $A.$

d) The matrix $I+ce_{i}^{T}e_{j}$ has determinant 1.

Proof. a) It's easy to see that

$e_{j}A=\left( 0...0~1~0...0\right) \left(\begin{array}{ccc}a_{11} & ... & a_{1n} \\ ... & ... & ... \\a_{j1} & ... & a_{jn} \\... & ... & ... \\a_{n1} & ... & a_{nn}\end{array} \right) =\left(\begin{array}{ccc}a_{j1} & ... & a_{jn}\end{array}\right) =A_{j}.$

b) Obviously,

$e_{i}^{T}e_{j}A=\left(\begin{array}{c}0 \\... \\0 \\1 \\0 \\... \\0\end{array} \right) \left(\begin{array}{ccc}a_{j1} & ... & a_{jn}\end{array} \right) =\left(\begin{array}{ccc}0 & ... & 0 \\... & ... & ... \\a_{j1} & ... & a_{jn} \\... & ... & ... \\ 0 & ... & 0\end{array} \right) =\left(\begin{array}{c}\Theta \\A_{j} \\\Theta\end{array} \right)$

($A_{j}$ in the $i$-th row, $\Theta$ denotes null matrices of conformable dimensions)

c) $(I+ce_{i}^{T}e_{j})A=A+ce_{i}^{T}e_{j}A=\left(\begin{array}{c} A_{1} \\... \\A_{i} \\... \\A_{n}\end{array} \right) +\left(\begin{array}{c}\Theta \\... \\cA_{j} \\... \\\Theta\end{array} \right) =\left(\begin{array}{c}A_{1} \\... \\A_{i}+cA_{j} \\ ... \\A_{n}\end{array}\right) .$

d) The matrix $A=I+ce_{i}^{T}e_{j}$ has ones on the main diagonal and only one nonzero element $a_{ij}=c$ outside it. By the Leibniz formula it's determinant is 1.

The reader can easily solve the next

Exercise 2. a) Postmultiplication of $A$ by $e_{j}^{T}$ cuts out of $A$ the $j$-th column $A^{(j)}.$

b) Postmultiplication of $A$ by $e_{j}^{T}e_{i}$ puts the column $A^{(j)}$ into the $i$-th column of the null matrix.

c) Postmultiplication of $A$ by $I+ce_{j}^{T}e_{i}$ adds column $A^{(j)}$ multiplied by $c$ to column $A^{(i)},$ without changing the other columns of $A.$

d) The matrix $I+ce_{j}^{T}e_{i}=(I+ce_{i}^{T}e_{j})^{T}$ has determinant 1.

Exercise 3. a) Premultiplication of $A$ by

(1) $\left(\begin{array}{c}e_{1} \\... \\e_{j} \\... \\e_{i} \\... \\e_{n}\end{array} \right)$

permutes rows $A_{i},A_{j}.$

b) Postmultiplication of $A$ by the transpose of (1) permutes columns $A^{(i)},A^{(j)}.$

This is a general property of permutation matrices. Recall also that their determinants can be only $\pm 1.$

Definition. 1) Adding some row multiplied by a constant to another row or 2) adding some column multiplied by a constant to another column or 3) permuting rows or columns is called an elementary operation. Accordingly, matrices that realize them are called elementary matrices.

15
Aug 18

## Law and order in the set of matrices

The law is to feel by touch every little fact. The order is discussed below.

### Why do complex numbers come right before this topic?

The analog of a conjugate number is the transpose $A^T.$ How can you tell this? Using Exercise 2 we see that $(cx)\cdot {y}=x\cdot (\bar{c}y).$ This looks similar to the identity $(Ax)\cdot y=x\cdot(A^Ty)$ (see Exercise 4). Therefore the mapping $c\rightarrow \bar{c}$ is similar to $A\rightarrow A^{T}.$

Once you know this, it's easy to come up with a couple of ideas.

Idea 1. From the characterization of real numbers (see Equation (4)) we see that matrices that satisfy $A=A^T$ (symmetric matrices) correspond to real numbers and should be better in some way than asymmetric ones.

Idea 2. From Equation (3) we see that the matrix $A^TA$ should be symmetric and non-negative.

### What is a non-negative matrix?

The set of real numbers is ordered in the sense that for any two real numbers $a,b$ we can say that either $a\geq b$ or $a is true. The most important property that we used in my class is this: if $a\geq b$ and $c>0,$ then $ac\geq bc$ (the sign of an inequality is preserved if the inequality is multiplied by a positive number). Since any two numbers can be compared like that, it is a complete order.

One way in which symmetric matrices are better than more general ones is that for symmetric matrices one can define order. The limitation caused by dimensionality is that this order is not complete (some symmetric matrices are not comparable).

Exercise 1. For the matrix $A=\left(\begin{array}{cc}a_{11}&a_{12}\\a_{12}&a_{22}\end{array}\right)$ and vector $x\in R^2$ find the expression $Q_A(x)=x^TAx.$ What is the value of this expression at $x=0?$

Solution$Q_A(x)=x_1^2+2a_{12}x_1x_2+a_{22}x_2^2$ and $Q_A(0)=0.$

Definition 1. The function $Q_A(x)=x^TAx$ is called a quadratic form of the matrix $A.$ Here $A$ is symmetric of size $n\times n$ and $x\in R^n.$

Discussion. 1) The facts that $A$ is in the subscript and the argument is $x$ mean that $A$ is fixed and $x$ is changing.

2) While the argument is a vector, the value of this function is a real number: $Q_A$ acts from $R^n$ to $R.$

3) $Q_A(x)$ does not contain constant or linear terms (of type $c$ and $ax_i).$ It contains only quadratic terms (write $x_1x_2=x_1^1x_2^1$ to see that the total power is 2), that's why it is called a quadratic form and not a quadratic function.

Definition 2. We say that $A$ is positive if $Q_A(x)>0$ for all nonzero $x\in R^n$ and non-negative if $Q_A(x)\geq 0$ for all $x\in R^n.$ (Most sources say positive definite instead of just positive and non-negative definite instead of just non-negative. I prefer a shorter terminology. If you don't understand why in the definition of positivity we require nonzero $x,$ go back to Exercise 1). As with numbers, for two symmetric matrices $A,B$ of the same size, we write $A>B$ or $A\geq B$ if $A-B$ is positive or non-negative, respectively. Continuing this idea, we can say that $A$ is negative if $-A$ is positive.

More on motivation. A legitimate definition of order $A\geq B$ would obtain if we compared the two matrices element-wise. Definition 2 is motivated by the fact that quadratic forms arise in the multivariate Taylor decomposition.

Sylvester's criterion is the only practical tool for determining positivity or non-negativity. However, in one case this is simple.

Exercise 2. Show that $A^TA$ is symmetric and non-negative.

Solution. The symmetry is straightforward and has been shown before. Non-negativity is not difficult either: $Q_{A^TA}(x)=x^TA^TAx=(Ax)^TAx=\|Ax\|^2\ge 0.$

### Geometry

The graph of a quadratic form in good cases is an elliptic paraboloid and has various other names in worse cases. Geometrically, the definition of the inequality $A\geq B$ means that the graph of $A$ is everywhere above the graph of $B$ (at the origin they always coincide). In particular, $A\geq 0$ means that the graph of $A$ is everywhere above the horizontal plane.

Examples. All examples are matrices of size $2\times 2$.

Figure 1. Quadratic form of identity (elliptic paraboloid)

1) The identity matrix $I$ is positive because $Q_I(x)=x_1^2+x_2^2$, see Figure 1.

Figure 2. Parabolic cylinder

2) The matrix $A=\left(\begin{array}{cc}1&0\\0&0\end{array}\right)$ is non-negative. Its quadratic form $Q_A(x)=x_1^2$ grows when $|x_1|$ grows and stays flat when $x_2$ changes and $x_1$ is fixed, see Figure 2.

Figure 3. Hyperbolic paraboloid

3) The matrix $B=\left(\begin{array}{cc}1&0\\0&-1\end{array}\right)$ is not positive or non-negative or negative or non-positive. Its quadratic form $Q_B(x)$$=x_1^2-x_2^2$ is a parabola with branches looking upward when the second argument is fixed and a parabola with branches looking downward when the first argument is fixed, see Figure 3. When a surface behaves like that around some point, that point is called a saddle point.

7
Aug 18

## Summary and questions for repetition

I was planning to cover the Moore-Penrose inverse which allows one to solve the equation $Ax=y$ for any $A$ (not necessarily square). Now my feeling is that it would be too much for a standard linear algebra course. This is the most easily accessible sourse.

1. Give the definition and example of an orthonormal system. Prove that elements of such a system are linearly independent.

2. "To know how a matrix acts on vectors, it is enough to know how it acts on the elements of an orthonormal basis." Explain.

3. How can you reveal the elements of a matrix $A$ from $(Ax)\cdot y$?

4. A linear mapping from one Euclidean space to another generates a matrix. Prove.

5. Prove that an inverse of a linear mapping is linear.

6. What is a linear mapping in the one-dimensional case?

7. In case of a square matrix, what are the four equivalent conditions for the equation $Ax=y$ to be good (uniquely solvable for all $y$)?

8. Give two equivalent definitions of linear independence.

9. List the simple facts about linear dependence that students need to learn first.

10. Prove the criterion of linear independence.

11. Let the vectors $x^{(1)},...,x^{(k)}$ be linearly dependent and consider the regression model $y=\beta_1x^{(1)}+...+\beta_kx^{(k)}+u.$ Show that here the coefficients $\beta_1,...,\beta_k$ cannot be uniquely determined (this is a purely algebraic fact, you don't need to know anything about multiple regression).

12. Define a basis. Prove that if $x^{(1)},...,x^{(k)}$ is a basis and $x$ is decomposed as $x=a_1x^{(1)}+...+a_kx^{(k)},$ then the coefficients $a_1,...,a_k$ are unique. Prove further that they are linear functions of $x.$

13. Prove that the terms in the orthogonal sum of two subspaces have intersection zero.

15. Prove that a matrix and its transpose have the same rank.

16. Prove the rank-nullity theorem.

17. Prove the upper bound on the matrix rank in terms of the matrix dimensions.

18. Vectors are linearly independent if and only if one of them can be expressed as a linear combination of the others.

19. What can be said about linear (in)dependence if some vectors are added to or removed from the system of vectors?

20. Prove that if the number of vectors in a system is larger than the space dimension, then such a system is linearly dependent.

21. Give a list of all properties of rank that you've learned so far.

22. Prove the inequality $\text{rank}(AB)\le \min\{\text{rank}(A),\text{rank}(B)\}$ and use it to give an alternative proof of the rank-nullity theorem.

23. Prove that the definition of the space dimension does not depend on the basis used.

2
Aug 18

## Rank of a matrix and the rank-nullity theorem

If you find proofs shorter than mine, shoot me an email.

### Rank and its properties

Definition 1. The dimension of the image $\text{Img}(A)$ is called a rank of the matrix $A$ and denoted $\text{rank}(A)$.

Exercise 1. For any matrix $A$

(1) $\text{rank}(A)=\text{rank}(A^{T}).$

Proof. Suppose $A$ is of size $n\times k$. We know that

(2) $R^k=N(A)\oplus\text{Img}(A^T).$

By (2) any $x\in R^k$ can be represented as

(3) $x=y+z,$ with $y\in N(A),$ $z\in\text{Img}(A^T)$ and $y\perp z.$

From (3) $Ax=Ay+Az=Az.$ By Definition 1, $\text{Img}(A^T)$ is spanned by some linearly independent vectors $z^{(1)},...,z^{(m)}$ with $m=\dim\text{Img}(A^T).$ Hence $z=\sum_{i=1}^ma_iz^{(i)}$ with some $a_i$ depending on $z$ and $Ax=\sum_{i=1}^ma_iAz^{(i)}.$ This shows that $\text{Img}(A)$ is spanned by $Az^{(1)},...,Az^{(m)}.$ If we prove linear independence of these vectors, (1) will follow.

Suppose $\sum_{i=1}^mb_iAz^{(i)}=0$ with some $b_{i}.$ Then $A\left(\sum_{i=1}^mb_iz^{(i)}\right)=0.$ This tells us that $\sum_{i=1}^mb_iz^{(i)}\in N(A).$ Since the vector $\sum_{i=1}^mb_iz^{(i)}$ also belongs to $\text{Img}(A^T),$ by Exercise 3 it is zero. By linear independence, $\sum_{i=1}^mb_iz^{(i)}=0$ is possible only when $b_1=...=b_m=0.$ This proves the linear independence of $Az^{(1)},...,Az^{(m)}$ and the statement.

Exercise 2 (rank-nullity theorem) If $A$ is $n\times k,$ then $\dim N(A)+\dim \text{Img}(A)=k.$

Proof. Exercise 4 and equation (2) imply $k=\dim N(A)+\dim\text{Img}(A^T).$ But we know from Exercise 1 that $\dim\text{Img}(A^T)=\dim\text{Img}(A).$

Exercise 3. Let $A$ be of size $n\times k.$ Then $\text{rank}(A)\leq\min\{n,k\}.$

Proof. From Exercise 2 we see that $\text{rank}(A)=\dim\text{Img}(A)\leq k.$ Applying this inequality to $A^T$ we get $\text{rank}(A)=\text{rank}(A^T)=\dim\text{Img}(A^T)\leq n.$

1
Aug 18

## Basis and dimension

### Basis and dimension

Definition 1. We say that vectors $x^{(1)},...,x^{(m)}$ form a basis in a subspace $L$ if 1) it is spanned by $x^{(1)},...,x^{(m)}$ and 2) these vectors are linearly independent. The number of vectors in the basis is called a dimension of $L$ and the notation is $\dim L.$

An orthogonal basis is a special type of a basis, when in addition to the above conditions 1)-2) the basis vectors are orthonormal. For the dimension definition to be correct, the number of vectors in any basis should be the same. We prove correctness in a separate post.

Exercise 1. In $R^n$ the unit vectors are linearly independent. Prove this fact 1) directly and 2) using the properties of an orthonormal system.

Direct proof. Any $x\in R^n$ can be represented as

(1) $x=x_1e_1+...+x_ne_n.$

If the right side is zero, then $x=0$ and all $x_i$ are zero.

Proof using orthonormality.  If the right side in (1) is zero, then $x_i=x\cdot e_i=0$ for all $i.$

Exercise 2. $\dim R^{n}=n.$

Proof. (1) shows that $R^n$ is spanned by $e_1,...,e_n.$ Besides, they are linearly independent by Exercise 1.

Definition 2. Let $L_1,\ L_2$ be two subspaces such that any element of one is orthogonal to any element of the other. Then the set $\{x_1+x_2:x_1\in L_1,\ x_2\in L_2\}$ is called an orthogonal sum of $L_1,\ L_2$ and denoted $L=L_1\oplus L_2.$

Exercise 3. If a vector $x$ belongs to both terms in the orthogonal sum of two subspaces $L=L_1\oplus L_2$, then it is zero. This means that $L_1\cap L_2=\{0\}.$

Proof. This is because any element of $L_1$ is orthogonal to any element of $L_2,$ so $x$ is orthogonal to itself, $0=x\cdot x=\|x\|^2$ and $x=0.$

Exercise 4 (dimension additivity) Let $L=L_1\oplus L_2$ be an orthogonal sum of two subspaces. Then $\dim L=\dim L_1+\dim L_2.$

Proof. Let $l_i=\dim L_i,$ $i=1,2.$ By definition, $L_1$ is spanned by some linearly independent vectors $y^{(1)},...,y^{(l_{1})}$ and $L_2$ is spanned by some linearly independent vectors $z^{(1)},...,z^{(l_{2})}.$ Any $x\in L$ can be decomposed as $x=y+z,$ $y\in L_1,$ $z\in L_2.$ Since $y,z$ can be further decomposed as $y=\sum_{i=1}^{l_1}a_iy^{(i)},$ $z=\sum_{i=1}^{l_2}b_iz^{(i)},$ the system $y^{(1)},...,y^{(l_1)}, z^{(1)},...,z^{(l_2)}$ spans $L.$

Moreover, this system is linearly independent. If

$\sum_{i=1}^{l_1}a_iy^{(i)}+\sum_{i=1}^{l_2}b_iz^{(i)}=0,$

then

$L_1\ni\sum_{i=1}^{l_1}a_iy^{(i)}=-\sum_{i=1}^{l_2}b_iz^{(i)}\in L_2.$

By Exercise 3 then $\sum_{i=1}^{l_1}a_iy^{(i)}=0,$ $\sum_{i=1}^{l_2}b_iz^{(i)}=0.$ By linear independence of the vectors in the two systems all coefficients $a_i,b_i$ must be zero.

The conclusion is that $\dim L=l_1+l_2.$

30
Jul 18

## Linear dependence of vectors: definition and principal result

This is a topic most students have trouble with. It's because there is a lot of logic involved, so skimming it is not a good idea. We start with the most common definition.

Definition 1. Let $x^{(1)},...,x^{(k)}\in R^n$ be some vectors. They are called linearly dependent if there exist numbers $a_1,...,a_k,$ not all of which are zero, such that

(1) $a_1x^{(1)}+...+a_kx^{(k)}=0.$

The sheer length of this definition scares some people and all they remember is equation (1). Stating just (1) instead of the whole definition is like skinning an animal.

### Seizing the bull by the horns

The first matter of business is to shorten the definition and relate it to what we already know.

It's better to join the set of numbers $a_1,...,a_k$ into a vector $a=(a_1,...,a_k).$ Then the requirement that "not all of $a_1,...,a_k$ are zero" is equivalent to a single equation $a\neq 0.$ Further, let us write the vectors $x^{(1)},...,x^{(k)}$ as columns and put them into a matrix $X=\left( x^{(1)},...,x^{(k)}\right) .$ Using multiplication of partitioned matrices we see that (1) is equivalent to

(2) $Xa=0$

and therefore Definition 1 is equivalent to the following.

Definition 2. Vectors $x^{(1)},...,x^{(k)}\in R^n$ are called linearly dependent if the homogeneous equation (2) has nonzero solutions $a\in R^k,$ that is, the null space of $X$ is not $\{0\}.$

Negating Definition 2 gives the definition of linear independence. Negating Definition 2 is easier than negating Definition 1.

Definition 3. Vectors $x^{(1)},...,x^{(k)}\in R^{n}$ are called linearly independent if $N(X)=\{0\}$ (for any nonzero $a\in R^k$ we have $Xa\neq 0$ or, alternatively, $Xa=0$ only for $a=0$).

Exercise 1. For any matrix $X,$ one has

(3) $N(X)=N(X^TX).$

Proof. 1) Proving that $N(X)\subset N(X^TX).$ If $a\in N(X),$ then $Xa=0$ and $X^TXa=0,$ so $a\in N(X^TX).$ 2) Proving that $N(X^TX)\subset N(X).$ If $a\in N(X^TX),$ then $X^TXa=0$ and $0=(X^TXa)\cdot a=(Xa)\cdot(Xa)=\|Xa\|^2,$ so $a\in N(X).$

Exercise 2. $X^TX$ is a square symmetric matrix, for any $X.$

Proof. If $X$ is of size $n\times k,$ then $X^TX$ is $k\times k.$ It is symmetric: $(X^TX)^T=X^T(X^T)^T=X^TX.$ By the way, some students write $(X^TX)^{-1}=X^{-1}(X^T)^{-1}.$ You cannot do this if $X$ is not square.

Criterion of linear independence. Vectors $x^{(1)},...,x^{(k)}\in R^n$ are linearly independent if and only if $\det X^TX\neq 0.$

Proof. We are going to use (3). By Exercise 2, $A=X^TX$ is a square matrix and for a square matrix we have the equivalence $N(A)=\{0\}\Longleftrightarrow \det A\neq 0.$ Application of this result proves the statement.

Direct application of Definition 1 can be problematic. To prove that some vectors are linearly dependent, you have to produce $a_1,...,a_k$ such that Definition 1 is satisfied. This usually involves some guesswork, see exercises below. The criterion above doesn't require guessing and can be realized on the computer. The linear independence requirement is common in multiple regression analysis but not all econometricians know this criterion.

### Putting some flesh on the bones

These are simple facts you need to know in addition to the above criterion.

Exercise 3. 1) Why do we exclude the case when all $a_1,...,a_k$ are zero?

2) What happens if among $x^{(1)},...,x^{(k)}$ there are zero vectors?

3) Show that in case of two non-zero vectors Definition 1 is equivalent to just proportionality of one vector to another.

Solution. 1) If all $a_1,...,a_k$ are zero, (1) is trivially satisfied, no matter what the vectors are.

2) If one of the vectors $x^{(1)},...,x^{(k)}$ is zero, the coefficient of that vector can be set to one and all others to zero, so such vectors will be linearly dependent by Definition 1.

3) Consider two non-zero vectors $x^{(1)},x^{(2)}.$ If they are linearly dependent, then

(4) $a_1x^{(1)}+a_2x^{(2)}=0$

where at least one of $a_1,a_2$ is not zero. Suppose $a_1\neq 0.$ Then $a_2\neq 0$ because otherwise $x^{(1)}$ would be zero. Hence

(5) $x^{(1)}=-a_2/a_1x^{(2)}=cx^{(2)}$

where the proportionality coefficient $c$ is not zero. Conversely, (5) implies (4) (you have to produce the coefficients).

Exercise 4. Prove that if to the system of unit vectors we add any vector $x\in R^n$, the resulting system will be linearly dependent.

28
Jul 18

## Solvability of an equation with a square matrix

Everywhere in this section we assume that $A$ is square of size $k\times k.$ The main problem is about solvability of the equation $Ax=y$.

### Dissecting the problem

Exercise 1. If $A$ is invertible, then $N(A)=\{0\}$ and $\text{Img}(A)=R^k.$

Proof. If $Ax=0,$ then $x=A^{-1}Ax=0,$ so $N(A)=\{0\}$ (you could also say that $A$ is one-to-one). From $AA^{-1}=I$ we have $AA^{-1}x=x.$ This tells us that $x=Ay$ with $y=A^{-1}x.$ We see that any $x$ belongs to the image of $A,$ so $\text{Img}(A)=R^k.$

Exercise 2. If $N(A)=\{0\},$ then $A$ is invertible.

Proof. Since $A$ is one-to-one, we can use the inverse $f^{-1}$ of $f(x)=Ax$ in the general sense: $f^{-1}(Ax)=x$. It is defined on the image of $A.$ We know that $f^{-1}$ is given by some matrix $B$: $BAx=x$ for all $x.$ Hence, $BAe_i=e_i$ for $i=1,...,k.$ Putting these equations side by side we get

(1) $BA=BAI=I.$

(In detail: the identity matrix is partitioned as $I=(e_1,...,e_k),$ so $BAI$ $=(BAe_1,...,BAe_k)=(e_1,...,e_k)=I$). Thus, $B$ is the left inverse of $A,$ and we've seen before that for square matrices this implies existence of the right inverse, invertibility of $A$ and the equation $B=A^{-1}$.

Exercise 3. If $\text{Img}(A)=R^k,$ then $A$ is invertible.

Proof. If $\text{Img}(A)=R^k,$ then by the second characterization of matrix image $N(A^T)=(\text{Img}(A))^{\perp}=\{0\}.$ Applying Exercise 2 to $A^T,$ we see that the transpose is invertible and hence $\det A=\det A^{T}\neq 0.$ This implies invertibility of $A.$

### Collecting the pieces together

Exercise 4. The following conditions are equivalent:

a) $N(A)=\{0\},$ b) $\text{Img}(A)=R^k,$ c) $\det A\neq 0,$ d) $A$ is invertible.

Proof. The equivalence c) $\Longleftrightarrow$ d) has been established before. For the implication d) $\Longrightarrow$ a)+b) see Exercise 1. By Exercise 2 we have a) $\Longrightarrow$ d). By Exercise 3, b) $\Longrightarrow$ d.

Conclusion. The condition $\det A\neq 0$ is the easiest to check. If it holds, then the equation $Ax=y$ has a unique solution for any $y\in R^k.$

Remark 1. By negating Exercise 4, we see that the following conditions are equivalent:

$a^\prime) N(A)\neq \{0\},$ $b^\prime) \text{Img}(A)\neq R^k,$ $c^\prime) \det A=0,$ $d^\prime) A^{-1}$ does not exists.

Remark 2. By going through the proof one can check that the result of Exercise 4 holds if $R^k$ is replaced by $C^k$ (the set of vectors with complex coordinates).

26
Jul 18

## Is the inverse of a linear mapping linear?

### Orthonormal basis

Exercise 1. I) Let $e_j$ denote unit vectors. They possess properties

(1) $e_i\cdot e_j=0$ for all $i\neq j,$ $\left\Vert e_{i}\right\Vert =1$ for all $i.$

II) For any $x\in R^n$ we have the representation

(2) $x=\sum_jx_je_j.$

III) In (2) the coefficients can be found as $x_{j}=x\cdot e_{j}:$

(3) $x=\sum_j(x\cdot e_j)e_j.$

Proof. I) (1) is proved by direct calculation. II) To prove (2) we write

$x=(x_1,...,x_n)=x_1(1,0,...,0)+...+x_n(0,...,0,1)=\sum_jx_je_j.$

III) If we have (2), then it's easy to see that by (1) $x\cdot e_i=\sum_jx_j(e_j\cdot e_i)=x_i.$

Definition 1. Any system of vectors that satisfies (1) is called an orthonormal system. An orthonormal system is called complete if for any $x\in R^n$ we have the decomposition (3). Exercise 1 shows that our system of unit vectors is complete orthonormal. A complete orthonormal system is called an orthonormal basis.

### Analyzing a linear mapping

Exercise 2. Let $A$ be a matrix of size $n\times k.$ Suppose you don't know the elements of $A$ but you know the products $(Ax)\cdot y$ for all $x,y.$ How can you reveal the elements of $A$ from $(Ax)\cdot y?$ How do you express $Ax$ using the elements you define?

Solution. Let us partition $A$ into rows and suppose the elements $a_{ij}$ are known. Let us try unit vectors as $x,y:$

(4) $(Ae_j)\cdot e_i=\left(\begin{array}{c}A_1e_j\\...\\A_ne_j\end{array}\right)\cdot e_i=A_ie_j=a_{ij}.$

Using (2) and (4) one can check that $(Ax)\cdot e_i=A_ix=\sum_ja_{ij}x_j.$ Hence, from (3) we have the answer to the second question:

(5) $Ax=\sum_i[\left(Ax\right)\cdot e_i]e_i=\sum_i\left(\sum_ja_{ij}x_j\right)e_i=\sum_{i,j}x_ja_{ij}e_i.$

The above calculation means that when $a_{ij}$ are unknown, we can define them by $a_{ij}=(Ae_j)\cdot e_i$ and then the action of $A$ on $x$ will be described by the last expression in (5).

We know that a mapping generated by a matrix is linear. The converse is also true: a linear mapping is given by a matrix:

Exercise 3. Suppose a mapping $f:R^k\rightarrow R^n$ is linear: $f(ax+by)=af(x)+bf(y)$ for any numbers $a,b$ and vectors $x,y.$ Then there exists a matrix $A$ of size $n\times k$ such that $f(x)=Ax$ for all $x.$

Proof. Based on (4) in our case put $a_{ij}=f(e_j)\cdot e_i$. Applying (3) to $f(x)$ we get

(6) $f(x)=\sum_i[f(x)\cdot e_i]e_i$

(the summation index $j$ is replaced by $i$ on purpose). Plugging (2) in (6)

$f(x)=\sum_i\left[f\left(\sum_jx_je_j\right)\cdot e_i\right]e_i=$ (both $f$ and scalar product are linear)

$=\sum_i\left[\left(\sum_jx_jf\left(e_j\right)\right)\cdot e_i\right]e_i=\sum_{i,j}x_j(f\left(e_j\right)\cdot e_i)e_i$$=\sum_{i,j}x_ja_{ij}e_i=Ax.$

The last equation is the definition of $A$.

Exercise 4. An inverse of a linear mapping is linear (and given by a matrix by Exercise 3).

Proof. Let $f(x)=Ax$ be a linear mapping and suppose its inverse $f^{-1}$ in the general sense exists. Then $f^{-1}(Ax)=x$ for all $x.$ Let us put $x=ay+bz$ for arbitrary numbers $a,b$ and vectors $y,z.$ Then we have $f^{-1}(A(ay+bz))=ay+bz$ or, using linearity of $A,$

$f^{-1}(aAy+bAz)=ay+bz.$

Putting $Ay=u,$ $Az=v$ we get

$f^{-1}(au+bv)=af^{-1}(u)+bf^{-1}(v).$

Thus, $f^{-1}$ is linear.

Remark. In all of the above it is important that $e_j$ are unit vectors. For a different basis, the results drastically change.

23
Jul 18

## Geometry of linear equations: questions for repetition

This section evolves around the concepts of linearity, linear subspaces and orthogonality. As usual, you are expected to produce at least those proofs that I give.

1. Prove linearity of the mapping generated by a matrix. This fundamental fact will have many implications. Do you think any linear mapping is generated by some matrix? Do you think a mapping inverse to a linear mapping should be linear?

2. What is the first characterization of a matrix image? In other sources it is called a column space. Later on it will be linked to the notions of linear independence and rank.

3. How would you write $Ax$ ($A$ is a matrix and $x$ is a vector), if $A$ is partitioned a) into rows and b) into columns?

4. Show that a span of some vectors is a linear subspace. Note that one- and two-dimensional examples of linear subspaces we considered are special cases of subspaces described by linear systems of equations.

5. Can spans of two different systems of vectors coincide? Give an example.

6. Show that the image and null space of a matrix are linear subspaces. This facts indicate that the subspace notion is an adequate tool for problems at hand.

7. What is the description of the set of solutions of $Ax=y$? It is pretty general. For example, the structure of solutions of an ordinary linear differential equation is the same.

8. Illustrate geometrically the theorem on second orthocomplement and explain why it holds.

9. What is the relationship between $(Ax)\cdot y$ and $x\cdot (A^Ty)$?

10. Derive the second characterization of matrix image.

11. What are the geometric conditions for the solvability and uniqueness of solutions of an inhomogeneous equation?