7
Jul 18

## Euclidean space geometry: Cauchy-Schwarz inequality

At first glance, the scalar product and the distance notion are largely independent entities. As a matter of fact, they are intimately related, and the Cauchy-Schwarz inequality provides one of the links.

### Cauchy-Schwarz inequality

Statement (Cauchy-Schwarz inequality) (I) For any vectors $x,y$ one has $|{x\cdot y}|\leq \Vert x\Vert\Vert y\Vert$.

(II) If the inequality sign turns into equality, $|{x\cdot y}|=\Vert x\Vert\Vert y\Vert$, then $y$ is proportional to $x$: $y=ax$.

Proof. (I) If at least one of the vectors is zero, both sides of the inequality are 0 and there is nothing to prove. To exclude the trivial case, suppose that none of $x,y$ is zero and, therefore, $\Vert x\Vert,\Vert y\Vert$ are positive. Consider a real-valued function of a real number $t$ defined by $f(t)=\Vert tx+y\Vert^2$. Here we have a norm of a linear combination $f(t)=t^2\Vert x\Vert ^2+2tx\cdot y+\Vert y\Vert^2$.

We see that $f(t)$ is a parabola with branches looking upward (because the senior coefficient $\Vert x\Vert^2$ is positive). By nonnegativity of the squared norm, $f(t)\geq 0$ and the parabola lies above the horizontal axis in the $(f,t)$ plane. Hence, the quadratic equation $f(t)=0$ may have at most one real root. This means that the discriminant of the equation is non-positive: $D=[x\cdot y]^2-\Vert x\Vert^2\Vert y\Vert^2\leq 0.$ Applying square roots to both sides of $[x\cdot y]^2\leq\Vert x\Vert^2\Vert y\Vert^2$ we finish the proof of the first part.

(II) In case of the equality sign the discriminant is 0. Therefore the parabola touches the horizontal axis where $f(t)=\Vert tx+y\Vert^2=0$. But we know that this implies $tx+y=0$ which is just another way of writing $y=ax$.

Do you think this proof is tricky? During the long history of development of mathematics, mathematicians have invented many tricks, small and large. No matter how smart you are, you cannot reinvent all of them. Studying the existing tricks is a necessity. By the way, the definition of what is tricky and what is not is strictly personal and time-dependent.

### Consequences of the Cauchy-Schwarz inequality

Exercise 1. Prove the triangle inequality.

Proof. From the expression for the norm of a linear combination

$\Vert x+y\Vert^2=\Vert x\Vert^2+2x\cdot y+\Vert y\Vert^2$           (using Cauchy-Schwarz)

$\leq\Vert x\Vert^2+2\Vert x\Vert\Vert y\Vert+\Vert y\Vert^2=(\Vert x\Vert+\Vert y\Vert )^2$

which gives $\Vert x+y\Vert\leq\Vert x\Vert+\Vert y\Vert.$

Definition. For any two nonzero vectors from the Cauchy-Schwarz inequality we have $\left|\frac{x\cdot y}{\Vert x\Vert\Vert y\Vert}\right|\leq 1.$ Therefore we can define the cosine of the angle $\widehat{x,y}$ between $x,y$ by

$\cos (\widehat{x,y})=\frac{x\cdot y}{\Vert x\Vert\Vert y\Vert}.$

This definition agrees with the definition of orthogonality: ${x\cdot y=0}$ means that the angle between $x,y$ is $\pi/2$ and $\cos (\widehat{x,y})$ should be zero.

Remark. Every bit of information about scalar products, norms and the Cauchy-Schwarz is true in the infinite-dimensional case. For applications in Optimization see this post and that post.

### Playing with balls

Once we have a norm, we can define the distance between $x,y$ by $\text{dist}(x,y)=\Vert x-y\Vert .$ For any $c\in R^n$ and $r>0$ let us put $B(c,r)=\{ x\in R^n:\Vert x-c\Vert This is a set of points whose distance from $c$ is less than $r.$ Therefore it is a ball centered at $c$ and with radius $r.$

Exercise 2. (Application of the triangle inequality) Consider two balls $B(C,R)$ and $B(c,r),$ where $R>r,$ so the first ball is larger. Under what condition the smaller ball is contained in the larger ball:

(1) $B(c,r)\subset B(C,R)?$

Solution. There are two possible ways to solve this exercise. One is to look at the geometry first and then try to translate it to math. The other is just try to see the solution from calculations. We follow the second way.

The inclusion relationship in terms of sets (1) is equivalent to a point-wise statement: any element $x$ of the smaller ball belongs to the larger ball. This means that we have to start with the assumption $\Vert x-c\Vert and arrive to $\Vert x-C\Vert Using the triangle inequality we have

$\Vert x-C\Vert=\Vert x-c+c-C\Vert\leq \Vert x-c\Vert +\Vert c-C\Vert

We want $\Vert x-C\Vert$ to be smaller than $R.$ This is achieved if we require $r+\Vert c-C\Vert\leq R.$

Exercise 3. A set $A\subset R^n$ is called open if any its element $a\in A$ belongs to $A$ together with some ball $B(a,\varepsilon ).$ Prove that $B(c,r)$ is open.

Proof. Take any $a\in B(c,r).$ We have to produce $\varepsilon$ such $B(a,\varepsilon )\subset B(c,r).$ From the previous exercise we have this inclusion if $\varepsilon =r-\Vert a-c\Vert$.

27
Dec 17

## How to study mt3042 Optimisation: a guide to a guide

Section and examples numbering follows that of mt3042 Optimization Guide by M. Baltovic.

## Main idea: look at geometry in the two-dimensional case

Here is an example. The norm of a vector $x\in R^n$ is defined by $\left\Vert x\right\Vert =\sqrt{\sum_{i=1}^nx_i^2}.$ The combination of squaring and extracting a square root often makes it difficult to understand how this construction works. Here is a simple inequality that allows one to do without this norm (or, put it differently, replace it with another norm). Take $n=2.$

$\max \{|x_1|,|x_2|\}=\max \{\sqrt{x_1^2},\sqrt{x_2^2}\}\leq\max\{\sqrt{x_1^2+x_2^2},\sqrt{x_1^2+x_2^2}\}=\left\Vert x\right\Vert =\sqrt{x_1^2+x_2^2}\leq\sqrt{\max\{|x_1|,|x_2|\}^2+\max \{|x_1|,|x_2|\}^2}=\sqrt{2}\max\{|x_1|,|x_2|\}.$

We have proved that $\max \{|x_1|,|x_2|\}\leq\left\Vert x\right\Vert\leq\sqrt{2}\max\{|x_1|,|x_2|\}.$ This easily generalizes to $R^{n}$:

(1) $\max \{|x_1|,...,|x_n|\}\leq\left\Vert x\right\Vert\leq\sqrt{n}\max\{|x_1|,...,|x_n|\}.$

Application. The set $A\subset R^n$ is called bounded if there is a constant $C$ such that $\left\Vert x\right\Vert \leq C$ for all $x\in A.$ (1) implies an equivalent definition: the set $A\subset R^n$ is called bounded if there is a constant $C$ such that $\max\{|x_1|,...,|x_n|\}\leq C$ for all $x\in A.$ See p.35 of Baltovic's guide, where the inequality $y_{i}\leq \frac{f(\hat{x})}{p_{i}}$ is sufficient for proving boundedness of the set $Y.$

Theorem 2.2 (The Cauchy-Schwarz Inequality). This inequality does not have serious applications in the guide. For a nontrivial application of the Cauchy-Schwarz inequality see my post.

2.1.8. Avoid using the definition of continuity in terms of $\varepsilon-\delta$ (Definition 2.18). Use Definition 2.19 in terms of sequences instead.

2.6.2. Definition 2.21 for many students is indigestible. Just say that the matrix $A$ consists of partial derivatives of components of $f=(f_1,...,f_m):$

$A=\left(\begin{array}{ccc} \frac{\partial f_1}{\partial x_1}&...&\frac{\partial f_m}{\partial x_1} \\...&...&...\\ \frac{\partial f_1}{\partial x_n}&...&\frac{\partial f_m}{\partial x_n}\end{array}\right) .$

Theorem 2.11. The proof is really simple in the one-dimensional case. By the definition of the derivative, $\frac{f(x_n)-f(x)}{x_n-x}\rightarrow f^{\prime }(x)$ for any sequence $x_n\rightarrow x.$ Multiplying this equation by $x_n-x\rightarrow 0$ we get $f(x_{n})-f(x)\rightarrow (x_n-x)f^{\prime }(x)\rightarrow 0$, which proves continuity of $f$ at $x.$

3.3.1. There is Math that happens on paper (formulas) and the one that happens in the head (logic). Many students see the formulas and miss the logic. Carefully read this section and see if the logic happens in your head.

3.4. The solution to Example 3.2 is overblown. A professional mathematician never thinks like that. A pro would explain the idea as follows: because of Condition 2, the function is close to zero in some neighborhood of infinity $\{x:|x|>N\}$. Therefore, a maximum should be looked for in the set $\{x:|x|\leq N\}$. Since this is a compact, the Weierstrass theorem applies. With a proper graphical illustration, the students don't need anything else.

4.2 First-order conditions for optima. See the proof.

4.4 Second-order conditions for optima. See explanation using the Taylor decomposition.

5.3 The Theorem of Lagrange. For the Lagrange method see necessary conditionssufficient conditions, and case of many constraints.

5.4 Proof of Lagrange's Theorem. See a simple explanation of the constraint qualification condition. The explanation on pp.58-60 is hard to understand because of dimensionality.

5.6 The Lagrangian multipliers. See simpler derivation.

6.4 Proof of the Kuhn-Tucker Theorem. In case of the Kuhn-Tucker theorem, the most important point is that, once the binding constraints have been determined, the nonbinding ones can be omitted from the analysis. The proof of nonnegativity of the Lagrange multiplier for binding constraints is less than one page.

Example 6.4. In solutions that rely on the Kuhn-Tucker theorem, the author suggests to check the constraint qualification condition for all possible combinations of constraints. Not only is this time consuming, but this is also misleading, given the fact that often it is possible to determine the binding constraints and use the Lagrange method instead of the Kuhn-Tucker theorem or, alternatively, to use the Kuhn-Tucker theorem for eliminating simple cases. The same problem can be solved using the convexity theory.

Example 6.5. In this case Baltovic makes a controversial experiment: what happens if we go the wrong way (expectedly, bad things happen), without providing the correct solution.

Solution to Exercise 6.1. In this exercise, the revenue is homogeneous of degree 2 and the cost is homogeneous of degree 1, which indicates that the profit is infinite. No need to do a three-page analysis!

7.6 The Bellman equations. There are many optimization methods not covered in Sundaram's book. One of them, Pontryagin's maximum principle, is more general that the Bellman approach.

p. 172. The bound $\sum \delta ^{t}|r(s_{t},a_{t})|\leq K\sum \delta ^{t}$ is obvious and does not require the Cauchy-Schwarz inequality.

Example 8.1. See the solution of this example using the Cauchy-Schwarz inequality.

26
Nov 17

## Cauchy-Schwarz inequality and optimization 2

Here we consider a problem from mt3042 Optimization Guide by M. Baltovic:

Example 8.1. Given the initial state $s_{1}=W$ for some $W>0$ and a discount factor $\delta\in(0,1),$ we need to maximize the discounted stream of future rewards

(1) $\sum_{i=1}^{\infty }\delta^t\sqrt{a_t}.$

The maximization is over actions $(a_1,a_2,...)$ (you can think of them as consumptions). The initial state $s_1=W$ is the initial wealth; at each step, the consumption is chosen between zero and the wealth left from the previous period, $0\leq a_t\leq s_t.$ This reduces the amount available for consumption in the next period:

(2) $s_{t+1}=s_t-a_t,$ $t\geq 1.$

Step 1. Here we reformulate the problem to remove the states from the constraints. From (2) we have $a_t=s_t-s_{t+1}.$ Fixing an integer $T>0$ then we obtain

(3) $a_1+a_2+...+a_{T-1}+a_T=s_1-s_2+s_2-s_3+...+s_{T-1}-s_T+s_T-s_{T+1}=s_1-s_{T+1}.$

Since in each state $s_t$ consumption $a_t$ is nonnegative, the states can only decrease over time. On the other hand, they cannot be negative. Hence, the sequence $\{s_T\}$ is nonincreasing and bounded from below by zero. Therefore the limit $S=\lim_{T\rightarrow\infty}s_T$ exists and is nonnegative. Letting $T\rightarrow\infty$ in (3) we get $\sum_{t=1}^{\infty}a_t=W-S\leq W.$

The problem becomes: maximize (1) subject to

(4) $\sum_{t=1}^{\infty}a_t\leq W.$

Step 2. Getting rid of square roots. Denoting $x_t=\sqrt{a_t},$ we have to maximize $\sum_{i=1}^{\infty }\delta^tx_t$ subject to $\sum_{t=1}^{\infty }x_t^2\leq W.$

Step 3. Applying the result on the Cauchy-Schwarz inequality. The result from the previous post does not depend on the number of variables. For our present needs it is formulated as follows: Let $a_1,a_2,...\geq 0$ be fixed numbers and let $x_1,x_2,...\geq 0$ be variables satisfying the restriction $x_1^2+x_2^2+...\leq W.$ Then the infinite weighted sum $a_1x_1+a_2x_2+...$ is maximized at the point

$(x_1,x_2,...)=\frac{W^{1/2}}{\left( a_1^2+a_2^2+...\right)^{1/2}}(a_1,a_2,...)$

and the maximized value is $a_1x_1+a_2x_2+...=W^{1/2}\left(a_1^2+a_2^2+...\right)^{1/2}.$

Letting $a_{t}=\delta^t$ we get $a_1^2+a_2^2+...=1/(1-\delta^2)$ and

$(x_1,x_2,...)=[(1-\delta^2)W]^{1/2}(1,\delta,\delta^2,...)$ with the maximized value $\sqrt{W/(1-\delta^2)}.$

And no clumsy terminology and notation from the Infinite Horizon Dynamic Programming method!

26
Nov 17

## Cauchy-Schwarz inequality and optimization 1

Between Cauchy and Schwarz there was also Bunyakovsky who contributed to this inequality.

Inequalities, if they are precise, can be used for optimization, even in the infinite-dimensional case. The method explained here does not rely on the Lagrange method or the Kuhn-Tucker theorem or convexity. It illustrates the point that it's not a good idea to try fit all problems into one framework. The solution method should depend on the problem.

Problem. Let $a_1,a_2\geq 0$ be fixed numbers and let $x_1,x_2\geq 0$ be variables satisfying the restriction

(1) $x_1^2+x_2^2\leq W.$

The problem is to maximize the weighted sum $a_1x_1+a_2x_2$ subject to the above restriction. This can be done using the convexity method. However, our future application will contain an infinite number of variables in which case the convexity method becomes conceptually complex. Therefore here we provide a different method based on the Cauchy-Schwarz inequality. In our case it looks like this

(2) $\left\vert a_1x_1+a_2x_2\right\vert \leq \left(a_1^2+a_2^2\right)^{1/2}\left(x_1^2+x_2^2\right)^{1/2}.$

We solve the problem in two steps. First we bound $a_1x_1+a_2x_2$ from above and then we find the point $(x_1,x_2)$ at which the upper bound is attained.

Step 1. (1) and (2) obviously imply

(3) $a_1x_1+a_2x_2\leq\left(a_1^2+a_2^2\right)^{1/2}W^{1/2}.$

Step 2. We need to find $(x_1,x_2)$ such that (1) is satisfied and (3) turns into an equality. This involves a little guessing. The guiding idea is that on the right in (3) we have squares of $a_1,a_2.$ For them to appear at the left, we choose $x_1=ca_1,x_2=ca_2$ with some constant $c$. Then we want

$ca_1^2+ca_2^2=\left(a_1^2+a_2^2\right)^{1/2}W^{1/2}$

to be true. This gives the value of the constant $c=\frac{W^{1/2}}{\left(a_1^2+a_2^2\right)^{1/2}}.$ The point we are looking for becomes

(4) $(x_1,x_2)=\frac{W^{1/2}}{\left(a_1^2+a_2^2\right)^{1/2}}(a_1,a_2).$ It satisfies (1):

$x_1^2+x_2^2=\frac{W}{\left(a_1^2+a_2^2\right)}\left(a_1^2+a_2^2\right)=W.$

Further, the maximized value of $a_1x_1+a_2x_2$ is $\left(a_1^2+a_2^2\right)^{1/2}W^{1/2}$.

26
Nov 16

## Properties of correlation

### Correlation coefficient: the last block of statistical foundation

Correlation has already been mentioned in

Statistical measures and their geometric roots

Properties of standard deviation

The pearls of AP Statistics 35

Properties of covariance

The pearls of AP Statistics 33

### The hierarchy of definitions

Suppose random variables $X,Y$ are not constant. Then their standard deviations are not zero and we can define their correlation as in Chart 1.

Chart 1. Correlation definition

### Properties of correlation

Property 1. Range of the correlation coefficient: for any $X,Y$ one has $- 1 \le \rho (X,Y) \le 1$.
This follows from the Cauchy-Schwarz inequality, as explained here.

Recall from this post that correlation is cosine of the angle between $X-EX$ and $Y-EY$.
Property 2. Interpretation of extreme cases. (Part 1) If $\rho (X,Y) = 1$, then $Y = aX + b$ with $a > 0.$

(Part 2) If $\rho (X,Y) = - 1$, then $Y = aX + b$ with $a < 0$.

Proof. (Part 1) $\rho (X,Y) = 1$ implies
(1) $Cov (X,Y) = \sigma (X)\sigma (Y)$
which, in turn, implies that $Y$ is a linear function of $X$: $Y = aX + b$ (this is the second part of the Cauchy-Schwarz inequality). Further, we can establish the sign of the number $a$. By the properties of variance and covariance
$Cov(X,Y)=Cov(X,aX+b)=aCov(X,X)+Cov(X,b)=aVar(X)$,

$\sigma (Y)=\sigma(aX + b)=\sigma (aX)=|a|\sigma (X)$.
Plugging this in Eq. (1) we get $aVar(X) = |a|\sigma^2(X)$ and see that $a$ is positive.

The proof of Part 2 is left as an exercise.

Property 3. Suppose we want to measure correlation between weight $W$ and height $H$ of people. The measurements are either in kilos and centimeters ${W_k},{H_c}$ or in pounds and feet ${W_p},{H_f}$. The correlation coefficient is unit-free in the sense that it does not depend on the units used: $\rho (W_k,H_c)=\rho (W_p,H_f)$. Mathematically speaking, correlation is homogeneous of degree $0$ in both arguments.
Proof. One measurement is proportional to another, $W_k=aW_p,\ H_c=bH_f$ with some positive constants $a,b$. By homogeneity
$\rho (W_k,H_c)=\frac{Cov(W_k,H_c)}{\sigma(W_k)\sigma(H_c)}=\frac{Cov(aW_p,bH_f)}{\sigma(aW_p)\sigma(bH_f)}=\frac{abCov(W_p,H_f)}{ab\sigma(W_p)\sigma (H_f)}=\rho (W_p,H_f).$

12
Nov 16

## Properties of standard deviation

Properties of standard deviation are divided in two parts. The definitions and consequences are given here. Both variance and standard deviation are used to measure variability of values of a random variable around its mean. Then why use both of them? The why will be explained in another post.

### Properties of standard deviation: definitions and consequences

Definition. For a random variable $X$, the quantity $\sigma (X) = \sqrt {Var(X)}$ is called its standard deviation.

#### Digression about square roots and absolute values

In general, there are two square roots of a positive number, one positive and the other negative. The positive one is called an arithmetic square root. The arithmetic root is applied here to $Var(X) \ge 0$ (see properties of variance), so standard deviation is always nonnegative.
Definition. An absolute value of a real number $a$ is defined by
(1) $|a| =a$ if $a$ is nonnegative and $|a| =-a$ if $a$ is negative.
This two-part definition is a stumbling block for many students, so making them plug in a few numbers is a must. It is introduced to measure the distance from point $a$ to the origin. For example, $dist(3,0) = |3| = 3$ and $dist(-3,0) = |-3| = 3$. More generally, for any points $a,b$ on the real line the distance between them is given by $dist(a,b) = |a - b|$.

By squaring both sides in Eq. (1) we obtain $|a|^2={a^2}$. Application of the arithmetic square root gives

(2) $|a|=\sqrt {a^2}.$

This is the equation we need right now.

### Back to standard deviation

Property 1. Standard deviation is homogeneous of degree 1. Indeed, using homogeneity of variance and equation (2), we have

$\sigma (aX) =\sqrt{Var(aX)}=\sqrt{{a^2}Var(X)}=|a|\sigma(X).$

Unlike homogeneity of expected values, here we have an absolute value of the scaling coefficient $a$.

Property 2. Cauchy-Schwarz inequality. (Part 1) For any random variables $X,Y$ one has

(3) $|Cov(X,Y)|\le\sigma(X)\sigma(Y)$.

(Part 2) If the inequality sign in (3) turns into equality, $|Cov(X,Y)|=\sigma (X)\sigma (Y)$, then $Y$ is a linear function of $X$: $Y = aX + b$, with some constants $a,b$.
Proof. (Part 1) If at least one of the variables is constant, both sides of the inequality are $0$ and there is nothing to prove. To exclude the trivial case, let $X,Y$ be non-constant and, therefore, $Var(X),\ Var(Y)$ are positive. Consider a real-valued function of a real number $t$ defined by $f(t) = Var(tX + Y)$. Here we have variance of a linear combination

$f(t)=t^2Var(X)+2tCov(X,Y)+Var(Y)$.

We see that $f(t)$ is a parabola with branches looking upward (because the senior coefficient $Var(X)$ is positive). By nonnegativity of variance, $f(t)\ge 0$ and the parabola lies above the horizontal axis in the $(f,t)$ plane. Hence, the quadratic equation $f(t) = 0$ may have at most one real root. This means that the discriminant of the equation is non-positive:

$D=Cov(X,Y)^2-Var(X)Var(Y)\le 0.$

Applying square roots to both sides of $Cov(X,Y)^2\le Var(X)Var(Y)$ we finish the proof of the first part.

(Part 2) In case of the equality sign the discriminant is $0$. Therefore the parabola touches the horizontal axis where $f(t)=Var(tX + Y)=0$. But we know that this implies $tX + Y = constant$ which is just another way of writing $Y = aX + b$.

Comment. (3) explains one of the main properties of the correlation:

$-1\le\rho(X,Y)=\frac{Cov(X,Y)}{\sigma(X)\sigma(Y)}\le 1$.