Jul 18

Geometry of linear equations: structure of image and null space

Geometry of linear equations: structure of image and null space

Definition 1. The subspace notion allows us to describe the algebraic structure of the set of solutions of Ax=y. The special case Ax=0 is called a homogeneous equation. Obviously, x=0 satisfies it but there may be other solutions. The equation Ax=y is called an inhomogeneous equation. We address the questions of existence and uniqueness of its solutions.

Structure of the image of A

Recall Basic observation 1: The image \text{Img}(A)=\{ Ax:x\in R^k\} is the set of y=Ax for which the inhomogeneous equation has solutions.

Exercise 1. \text{Img}(A) is a linear subspace in R^n.

Proof. This follows from the first characterization of the matrix image. Here is a direct proof. Suppose y_1,y_2\in \text{Img}(A). Then there exist x_1,x_2\in R^k such that Ax_1=y_1, Ax_2=y_2. By linearity this gives ay_1+by_2=aAx_1+bAx_2=A(ax_1+bx_2). Thus for y=ay_1+by_2 we have found x=ax_1+bx_2 such that y=Ax, which means ay_1+by_2\in \text{Img}(A) for any a,b and \text{Img}(A) is a subspace.

Structure of the null space of A

Definition 2. The set of solutions x of the homogeneous equation Ax=0 is called the null space of A. It is denoted N(A)=\{ x:Ax=0\}.

Exercise 2. The null space of A is a linear subspace of D(A).

Proof. Suppose x_1,x_2\in N(A) so that Ax_1=0, Ax_2=0. Then by linearity A(ax_1+bx_2)=aAx_1+bAx_2=0, so ax_1+bx_2\in N(A) for any a,b and N(A) is a linear subspace.

Description of the set of solutions of Ax=y

Intuition. In R^{3}, straight lines and planes that don't contain the origin can be obtained by shifting straight lines and planes that do (geometry should dominate the algebra at this point, see Figure 1). This is generalized in the next definition.

Figure 1. Shifting a subspace gives a hyperplane

Figure 1. Shifting a subspace gives a hyperplane

Definition 3. Let x\in R^n be any vector and let L be a subspace. x+L=\{x+l:l\in L\} denotes a shift of L by x and it is obtained by adding to x all elements of L. Some people call x+Lhyperplane.

Exercise 3. As we know, the equation Ax=y has solutions if and only if y\in \text{Img}(A). Let us fix y\in \text{Img}(A) and let x_{0} be some solution of Ax=y. Then the set of all solutions of this equation is x_0+N(A) (in detail: any other solution x of that equation can be obtained by adding an element z of the null space to x_0: x=x_0+z). This is written as \{ x:Ax=y\}=x_0+N(A).

Proof. Let Ax_0=y and let x be any solution of Ax=y. Then A(x-x_0)=0, x-x_0\in N(A) and x-x_{0}=z for some z\in N(A). We obtain x=x_0+z which proves the statement.

Conclusion. If N(A)=\{0\}, then A is one-to-one and we have uniqueness of solutions of the inhomogeneous equation; otherwise, N(A) can serve as a measure of non-uniqueness.

Jul 18

Geometry of linear equations: linear spaces and subspaces

Geometry of linear equations: linear spaces and subspaces

R^n is a linear space in the following sense: for any x,y\in R^n and any a,b\in R the linear combination ax+by=(ax_1+by_1,...,ax_n+by_n) belongs to R^{n}. From earlier exercises we know that \{ ax:a\in R\} and \{ by:b\in R\} are straight lines drawn through the vectors x,y. Therefore \{ ax+by:a,b\in R\} is a plane drawn through those vectors. We can say that R^n is a linear space in the following sense: for any x,y\in R^n, the whole plane drawn through x,y is contained in R^n.

Subsets of R^n may have this property.

Example 1. On the plane, take any straight line passing through the origin (the slope doesn't matter). The equation of such a line is

(1) ax+by=0,

where at least one of the coefficients a,b is different from zero. If b\neq 0, we can solve the equation for y and get a more familiar form y=kx. If b=0, we get a vertical line x=0. If two points (x_1,y_1), (x_2,y_2) satisfy (1), then you can check that their linear combination c(x_1,y_1)+d(x_2,y_2) also satisfies (1). A straight line that does not pass through the origin is described by ax+by=c with c\neq 0.

Example 2. In R^3, take any straight line or any 2D plane passing through the origin. A straight line is described by a system of two equations a_1x+b_1y+c_1z=0, a_2x+b_2y+c_2z=0 and a plane is described by one equation ax+by+cz=0. You can do the algebra as above to show that linear combinations of elements of these sets belong to these sets. However, I suggest using the geometric interpretation of linear operations to show that these are examples of subspaces. A straight line or a plane that does not pass through the origin is not a subspace, see Figure 1. The straight line L does not pass through the origin. If we take vectors X,Y from it, their sum, found by the parallelogram rule, does not belong to L.

Figure 1. A hyperplane is not a subspace

Figure 1. A hyperplane is not a subspace

Definition 1. The set L\subset R^n is called a linear subspace of R^{n} if for any x,y\in L it contains the whole plane drawn through x,y. By induction, any linear combination a_1x^{(1)}+...+a_mx^{(m)} of any elements of a subspace belongs to it.

The definition does not exclude the extreme cases L=\{0\} and L=R^n. These two cases are called trivial.

Sometimes we are interested in how a subspace is generated.

Definition 2. Take any vectors x^{(1)},...,x^{(k)}\in R^n and consider the set L of all linear combinations

(2) a_1x^{(1)}+...+a_kx^{(k)} with a_1,...,a_k\in R.

This set is a linear subspace (because a linear combination of expressions of type (2) is again of this type) and it is called a span of x^{(1)},...,x^{(k)}. We also say that L is spanned by x^{(1)},...,x^{(k)}.

In this terminology, the first characterization of the matrix image says that it is spanned by that matrix' columns.

Jul 18

Geometry of linear equations: matrix as a mapping

Geometry of linear equations: matrix as a mapping

General idea. Let f(x) be a function with the domain D(f). We would like to know for which y the equation f(x)=y has solutions, and if it does, then how many (one or more). For the existence part, let the argument x run over D(f) and see what set the values f(x) belong to. It is the image \text{Img}(f)=\{ f(x):x\in D(f)\}. This definition directly implies

Basic observation 1. The equation f(x)=y has solutions if and only if y\in\text{Img}(f).

For the uniqueness part, fix y\in\text{Img}(f) and see for which arguments x\in D(f) the value f(x) equals that y. This is the counter-image f^{-1}(y)=\{ x\in D(f):f(x)=y\} of y.

Basic observation 2. If f^{-1}(y) consists of one point, you have uniqueness.

See how this works for the function f(x)=0, x\leq 0, f(x)=x, x>0. This function is not linear. The function generated by a matrix is linear, and a lot more can be said in addition to the above observations.

Matrix as a mapping

Definition 1. Let A be a matrix of size n\times k. It generates a mapping f:R^k\rightarrow R^n according to f(x)=Ax (x is written as a k\times 1 column). Following the common practice, we identify the mapping f with the matrix A.

Exercise 1. The mapping we just defined is linear: for any vectors x,y and numbers a,b one has

(1) A(ax+by)=aAx+bAy.

Proof. Partitioning A into rows and using linearity of scalar products we have

A(ax+by)=\left(\begin{array}{c}A_1(ax+by)\\...\\A_n(ax+by)\end{array}\right)=\left(\begin{array}{c}aA_1x+bA_1y\\  ...\\aA_nx+bA_ny\end{array}\right)

=a\left(\begin{array}{c}A_1x\\...\\A_nx\end{array}\right)+b\left(\begin{array}{c}  A_1y\\...\\A_ny\end{array}\right)=aAx+bAy.

Exercise 2 (first characterization of matrix image) Show that the image \text{Img}(A) consists of linear combinations of the columns of A.

Solution. Partitioning A into columns, for any x we have

(2) Ax=\left(A^1,...,A^k\right)\left(\begin{array}{c}x_1\\...\\x_k\end{array}\right)=x_1A^1+...+x_kA^k.

This means that when x runs over D(A)=R^k, the images Ax are linear combinations of the column-vectors A^1,...,A^k.

Remark. In (2) we silently used the multiplication rule for partitioned matrices. Here is the statement of the rule in a simple situation. Let A,B be two matrices compatible for multiplication. Let us partition them into smaller matrices

A=\left(\begin{array}{cc}A_{11}&A_{12}\\A_{21}&A_{22}\end{array}\right), B=\left(\begin{array}{cc}B_{11}&B_{12}\\B_{21}&B_{22}\end{array}\right).

Then the product AB can be found as if those blocks were numbers:


The only requirement for this to be true is that the blocks A_{ij}, B_{ij} be compatible for muliplication. You will not be bored with the proof. In (2) the multiplication is performed as if A^1,...,A^k were numbers.

Jul 18

Euclidean space geometry: questions for repetition

Euclidean space geometry: questions for repetition

How to study definitions and statements

For a definition, choose the shortest and most geometrical formulation. If it's still difficult to remember, repeat it aloud several times. Speaking aloud sends more energy to your brain. Think out equivalent definitions and consequences. As you do this, new neurons will be created in your brain, and that requires time.

Unlike definitions, statements require proofs. If you use a particular fact in your proof, state it explicitly. My criterion of understanding is this: how could I guess a result like this? Can I generalize it or find an analogy? Repeat the theory in large chunks. Explain clearly, as if lecturing. This will allow you to concentrate on logic, as opposed to memorizing.

Roll up your sleeves

Prove or solve the next statements and exercises.

Statement 1. A scalar product is a function of two vector arguments f(x,y)=x\cdot y such that

1) it is linear in the first argument when the second argument is fixed and linear in the second argument when the first one is fixed (this property is called bilinearity);

2) it is symmetric,

3) the scalar product of a vector by itself is non-negative: x\cdot x\geq0.

4) x\cdot x=0 if and only if x=0.

Statement 2. Once we have a scalar product, we can define an associated norm by \Vert x\Vert =\sqrt{x\cdot x}. It has the properties:

1) \Vert x\Vert \geq 0 for all x,

2) \Vert x\Vert =0 if and only if x=0,

3) the norm is homogeneous of degree 1,

4) Cauchy-Schwarz inequality: |x\cdot y|\leq\Vert x\Vert\Vert y\Vert,

5) triangle inequality: \Vert x+y\Vert \leq\Vert x\Vert +\Vert y\Vert .

Statement 3. Once we have a norm, we can define a distance between two points by \text{dist}(x,y)=\Vert x-y\Vert . It satisfies the triangle inequality \text{dist}(x,y)\leq\text{dist}(x,z)+\text{dist}(z,y) for any three points x,y,z.

Statement 4. With a scalar product and the associated norm at hand we can define the notions of orthogonality and cosine of an angle between two vectors. The Pythagoras theorem becomes a simple consequence of definitions.

Exercise 1. Derive the expression for the norm of a linear combination.

Exercise 2. What condition on the centers and radii is sufficient for inclusion of balls B(c,r)\subset B(C,R)?

Exercise 3. Prove that the upper part \{(x_1,x_2)\in R^2: x_2>0\} of the plane is an open set.

Exercise 4. Describe the geometry related to the quadratic equation ax^2+bx+c=0:

1) How do we know if the branches of the parabola y=ax^2+bx+c look upward or downward?

2) What is the link between the discriminant and the following situations: a) the parabola does not touch or cross the x axis, b) the parabola touches (is tangent to) the x axis and c) the parabola crosses the x axis at two different points?

Exercise 5. Prove and interpret the parallelogram law:

\|x+y\|^2+\|x-y\|^2=2\| x\|^2+2\| y\|^2.
Jul 18

Euclidean space geometry: Cauchy-Schwarz inequality

Euclidean space geometry: Cauchy-Schwarz inequality

At first glance, the scalar product has nothing to do with the distance notion. It has, and the Cauchy-Schwarz inequality provides one of the links.

Cauchy-Schwarz inequality

Statement (Cauchy-Schwarz inequality) (I) For any vectors x,y one has |{x\cdot y}|\leq \Vert x\Vert\Vert y\Vert .

(II) If the inequality sign turns into equality, |{x\cdot y}|=\Vert x\Vert\Vert y\Vert , then y is proportional to x: y=ax.

Proof. (I) If at least one of the vectors is zero, both sides of the inequality are 0 and there is nothing to prove. To exclude the trivial case, suppose that none of x,y is zero and, therefore, \Vert x\Vert,\Vert y\Vert are positive. Consider a real-valued function of a real number t defined by f(t)=\Vert tx+y\Vert^2. Here we have a norm of a linear combination f(t)=t^2\Vert x\Vert  ^2+2tx\cdot y+\Vert y\Vert^2.

We see that f(t) is a parabola with branches looking upward (because the senior coefficient \Vert x\Vert^2 is positive). By nonnegativity of the norm, f(t)\geq 0 and the parabola lies above the horizontal axis in the (f,t) plane. Hence, the quadratic equation f(t)=0 may have at most one real root. This means that the discriminant of the equation is non-positive: D=[x\cdot y]^2-\Vert x\Vert^2\Vert y\Vert^2\leq 0. Applying square roots to both sides of [x\cdot y]^2\leq\Vert x\Vert^2\Vert y\Vert^2 we finish the proof of the first part.

(II) In case of the equality sign the discriminant is 0. Therefore the parabola touches the horizontal axis where f(t)=\Vert tx+y\Vert^2=0. But we know that this implies tx+y=0 which is just another way of writing y=ax.

Do you think this proof is tricky? During the long history of development of mathematics, mathematicians have invented many tricks, small and large. No matter how smart you are, you cannot reinvent all of them. Studying the existing tricks is a necessity. By the way, the definition of what is tricky and what is not is strictly personal and time-dependent.

Consequences of the Cauchy-Schwarz inequality

Exercise 1. Prove the triangle inequality.

Proof. From the expression for the norm of a linear combination

\Vert x+y\Vert^2=\Vert x\Vert^2+2x\cdot y+\Vert y\Vert^2           (using Cauchy-Schwarz)

\leq\Vert x\Vert^2+2\Vert x\Vert\Vert y\Vert+\Vert y\Vert^2=(\Vert x\Vert+\Vert y\Vert )^2

which gives \Vert x+y\Vert\leq\Vert x\Vert+\Vert y\Vert.

Definition. For any two nonzero vectors from the Cauchy-Schwarz inequality we have \left|\frac{x\cdot y}{\Vert x\Vert\Vert y\Vert}\right|\leq 1. Therefore we can define the cosine of the angle \widehat{x,y} between x,y by

\cos (\widehat{x,y})=\frac{x\cdot y}{\Vert x\Vert\Vert y\Vert}.

This definition agrees with the definition of orthogonality: {x\cdot y=0} means that the angle between x,y is \pi/2 and \cos (\widehat{x,y}) should be zero.

Remark. Every bit of information about scalar products, norms and the Cauchy-Schwarz is true in the infinite-dimensional case. For applications in Optimization see this post and that post.

Playing with balls

Once we have a norm, we can define the distance between x,y by \text{dist}(x,y)=\Vert x-y\Vert . For any c\in R^n and r>0 let us put B(c,r)=\{ x\in R^n:\Vert x-c\Vert <r\}. This is a set of points whose distance from c is less than r. Therefore it is a ball centered at c and with radius r.

Exercise 2. (Application of the triangle inequality) Consider two balls B(C,R) and B(c,r), where R>r, so the first ball is larger. Under what condition the smaller ball is contained in the larger ball:

(1) B(c,r)\subset B(C,R)?

Solution. There are two possible ways to solve this exercise. One is to look at the geometry first and then try to translate it to math. The other is just try to see the solution from calculations. We follow the second way.

The inclusion relationship in terms of sets (1) is equivalent to a point-wise statement: any element x of the smaller ball belongs to the larger ball. This means that we have to start with the assumption \Vert  x-c\Vert<r and arrive to \Vert x-C\Vert<R. Using the triangle inequality we have

\Vert x-C\Vert=\Vert x-c+c-C\Vert\leq  \Vert x-c\Vert +\Vert c-C\Vert <r+\Vert  c-C\Vert.

We want \Vert x-C\Vert to be smaller than R. This is achieved if we require r+\Vert c-C\Vert\leq R.

Exercise 3. A set A\subset R^n is called open if any its element a\in A belongs to A together with some ball B(a,\varepsilon ). Prove that B(c,r) is open.

Proof. Take any a\in B(c,r). We have to produce \varepsilon such B(a,\varepsilon )\subset B(c,r). From the previous exercise we have this inclusion if \varepsilon =r-\Vert a-c\Vert.

Jul 18

Euclidean space geometry: scalar product, norm and distance

Euclidean space geometry: scalar product, norm and distance

Learning this material has spillover effects for Stats because everything in this section has analogs for means, variances and covariances.

Scalar product

Definition 1. The scalar product of two vectors x,y\in R^n is defined by x\cdot y=\sum_{i=1}^nx_iy_i.

Remark. If matrix notation is of essence and x,y are written as column vectors, we have x\cdot y=x^Ty. The first notation is better when we want to emphasize symmetry x\cdot y=y\cdot x.

Linearity. The scalar product is linear in the first argument when the second argument is fixed: for any vectors x,y,z and numbers a,b one has

(1) (ax+by)\cdot z=ax\cdot z+by\cdot z.

Proof. (ax+by)\cdot z=\sum_{i=1}^n(ax_i+by_i)z_i=\sum_{i=1}^n(ax_iz_i+by_iz_i)

=a\sum_{i=1}^nax_iz_i+b\sum_{i=1}^ny_iz_i=ax\cdot z+by\cdot z.

Special cases. 1) Homogeneity: by setting b=0 we get (ax)\cdot z=ax\cdot  z. 2) Additivity: by setting a=b=1 we get (x+y)\cdot z=x\cdot z+y\cdot  z.

Exercise 1. Formulate and prove the corresponding properties of the scalar product with respect to the second argument.

Definition 2. The vectors x,y are called orthogonal if x\cdot y=0.

Exercise 2. 1) The zero vector is orthogonal to any other vector. 2) If x,y are orthogonal, then any vectors proportional to them are also orthogonal. 3) The unit vectors in R^n are defined by e_i=(0,...,1,...,0) (the unit is in the ith place, all other components are zeros), i=1,...,n. Check that they are pairwise orthogonal.


Exercise 3. On the plane find the distance between a point x and the origin.

Figure 1. Pythagoras theorem

Figure 1. Pythagoras theorem

Once I introduce the notation on a graph (Figure 1), everybody easily finds the distance to be \text{dist}(0,x)=\sqrt{x_1^2+x_2^2} using the Pythagoras theorem. Equally easily, almost everybody fails to connect this simple fact with the ensuing generalizations.

Definition 3. The norm in R^n is defined by \left\Vert x\right\Vert=\sqrt{\sum_{i=1}^nx_i^2}. It is interpreted as the distance from point x to the origin and also the length of the vector x.

Exercise 4. 1) Can the norm be negative? We know that, in general, there are two square roots of a positive number: one is positive and the other is negative. The positive one is called an arithmetic square root. Here we are using the arithmetic square root.

2) Using the norm can you define the distance between points x,y\in R^n?

3) The relationship between the norm and scalar product:

(2) \left\Vert x\right\Vert =\sqrt{x\cdot x}.

True or wrong?

4) Later on we'll prove that \Vert x+y\Vert\leq\Vert x\Vert+\Vert{ y}\Vert . Explain why this is called a triangle inequality. For this, you need to recall the parallelogram rule.

5) How much is \left\Vert 0\right\Vert ? If \left\Vert x\right\Vert =0, what can you say about x?

Norm of a linear combination. For any vectors x,y and numbers a,b one has

(3) \left\Vert ax+by\right\Vert^2=a^2\left\Vert x\right\Vert^2+2ab(x\cdot y)+b^2\left\Vert y\right\Vert^2.

From (2) we have

\left\Vert ax+by\right\Vert^2=\left(ax+by\right)\cdot\left(ax+by\right)     (using linearity in the first argument)

=ax\cdot\left(ax+by\right)+by\cdot\left(ax+by\right)         (using linearity in the second argument)

=a^2x\cdot x+abx\cdot y+bay\cdot x+b^2y\cdot y (applying symmetry of the scalar product and (2))

=a^2\left\Vert x\right\Vert^2+2ab(x\cdot y)+b^2\left\Vert y\right\Vert^2.

Pythagoras theorem. If x,y are orthogonal, then \left\Vert x+y\right\Vert^2=\left\Vert x\right\Vert^2+\left\Vert y\right\Vert^2.

This is immediate from (3).

Norm homogeneity. Review the definition of the absolute value and the equation |a|=\sqrt{a^2}. The norm is homogeneous of degree 1:

\left\Vert ax\right\Vert=\sqrt{(ax)\cdot (ax)}=\sqrt{{a^2x\cdot x}}=|a|\left\Vert x\right\Vert.

Jul 18

Euclidean space geometry: vector operations

Euclidean space geometry: vector operations

The combination of these words may sound frightening. In fact, if you want to succeed with matrix algebra, you need to start drawing inspiration from geometry as early as possible.

Sum of vectors

Definition. The set of all n-dimensional vectors x=(x_1,...,x_n) with x_{i}\in R is denoted R^n and is called a Euclidean space.

R^2 is a plane. The space we live in is R^3. Our intuition doesn't work in dimensions higher than 3 but most facts we observe in real life on the plane and in the 3-dimensional space have direct analogs in higher dimensions. Keep in mind that x=(x_1,...,x_n) can be called a vector or a point in R^{n}, depending on the context. When we think of it as a vector, we associate with it an arrow that starts at the origin (0,...,0) and ends at the point (x_1,...,x_n).

Careful inspection shows that the sum of two vectors x+y=(x_1+y_1,...,x_n+y_n) is found using the parallelogram rule in Figure 1. The rule itself comes from physics: if two forces are applied to a point, their resultant force is found by the parallelogram rule. Whatever works in real life is guaranteed to work in Math.

Figure 1. Sum of vectors

Figure 1. Sum of vectors

Exercise 1. Let e_1=(1,0) be the unit vector of the x axis and e_2=(0,1) the unit vector of the y axis. Find the sum e_1+e_2. Generalization: if x runs over the whole x axis and y runs over the whole y axis, what is the set of resulting sums x+y? Further generalization: on the plane take two straight lines L_1 and L_2 that pass through the origin and are not parallel to one another. If x runs over L_1 and y runs over L_2, what is the set of resulting sums x+y?

This seemingly innocuous exercise leads to profound ideas, to be considered later. The answer for the last question is that the sums x+y will cover the whole R^{2}. This fact is written like this: \{x+y:x\in L_1,\ y\in L_2\}=R^2. Note that I intentionally use words that emphasize movement and geometry: "runs over" and "covers the whole". One of the differences between elementary and higher mathematics is that the former deals with fixed elements and the latter with sets within which there are movement and change.

Multiplication of a vector by a number

If a is a number and x=(x_1,...,x_n) is a vector, we put ax=(ax_1,...,ax_n) (scaling or multiplication by a number). Scaling x by a positive number a means lengthening it in case a>1 and shortening in case a<1. Scaling by a negative number means, additionally, reverting the direction of x,  see Figure 2.

Scaling a vector

Figure 2. Scaling a vector


Exercise 2. Take a nonzero vector x=(x_1,x_2) and find the set of all products ax, where a runs over R.

The answer is that \left\{ ax:a\in R\right\} is the straight line drawn through the vector x or, alternatively, the straight line drawn through the origin and point x.

The expression ax+by is called a linear combination of vectors x,y with coefficients a,b\in R. To find ax+by, you first scale x and y and then add the results.

Exercise 3. Let x,y be two nonparallel vectors on the plane. Describe the set \left\{ ax+by:a,b\in R\right\} verbally and geometrically.

Verbally, this is the set of all linear combinations ax+by that result when a,b run over R. Geometrically, this will be the whole plane R^2.

Apr 18

From minimum to infimum: Math is just a logical game

From minimum to infimum: Math is just a logical game

The true Math is a continuous exercise in logic. A good teacher makes that logic visible and tangible. A genius teacher does the logic mentally and only displays the result. I am trying to be as good a teacher as humanly possible.

Steps to develop a new definition

Step 1. Start with a simple example.

What is the minimum m of the set A=[1,\infty)? In my class everybody says m=1.

Step 2. Give the most visual definition.

Here is a good candidate: the minimum m of a set A is its leftmost point.

Step 3. Formalize the definition you gave.

We said "its point". This means m should be an element of A. "Leftmost" is formalized as m\le a for any a\in A. Thus the formal definition should be:  the number m is called a minimum of a set A if

1) m\le a for any a\in A and

2) m\in A.

Step 4. Look at bad cases when the definition fails. Try to come up with a generalized definition that would cover the bad cases.

Here is a bad case: meet the infinity

Let A=(1,\infty). In my class, some students suggested m=0 and m=2. Can you see why neither is good? See the explanation below if you can't. I would try m=1. It still satisfies part 1) of the above definition but does not belong to A. This is the place to stretch one's imagination.

Statement 1. No element of A satisfies the formal definition above.

Proof. Take any element a\in A. Then it is larger than 1 and the number a_1=\frac{a+1}{2} is halfway between a and 1. We have shown that to the left of any a\in A there is another element of A. So no element of A is leftmost.

In the game of chess, the one who thinks several moves ahead wins. A stronger version of Statement 1 is the following.

Statement 2. To the left of any a\in A there are infinitely many elements of A.

Proof. Indeed, set a_n=1+1/n,\ n=1,2,... The numbers a_n approach 1 from the right. As n increases, they become closer and closer to 1. For any given a\in A, infinitely many of these numbers will satisfy 1<a_n<a.

Step 5. When thinking about a new definition, try to exclude undesirable outcomes.

Since for m=1 condition 2) is not satisfied, one might want to omit it altogether. However, leaving only part 1) would give a bad result. For example, m=0 would satisfy such a "definition", and it would be bad for two reasons. Firstly, there is no uniqueness. We could take any number m<1. Secondly, if you take any m<1, in the interval (m,1) there are no elements of A, so there is no reason to call such a number a minimum of A.

Definition. A number m is called an infimum of A (denoted m=\inf A) if

I) m\le a for any a\in A and

II) in A, there is a sequence \{a_n\} approaching m.

The above discussion shows that \inf (1,\infty)=1.

Exercise 1. Repeat all of the above for A=(-\infty,-1). First define the maximum of a set. Then modify the definition to obtain what is called a supremum.

Exercise 2. Let A=(e,\pi), where e=2.71828... is the basis of the natural log and \pi=3.1415... is the ratio Length of circumference/Diameter of that circumference. Find the infimum and supremum of A. Simply naming them is not enough; you have to prove that both parts of the definitions are satisfied.

Exercise 3. The infimum is also defined as the largest lower bound. Can you prove equivalence of the two definitions?

Exercise 4. The supremum is also defined as the least upper bound. Can you prove equivalence of the two definitions?

Exercise 5. Consider any two sets and their union. What is the relationship between the infimums of the three sets?


Mar 18

See if this definition of a function is better than others

See if this definition of a function is better than others

Try the definitions from the Khan Academy, Math is Fun, WikipediaPaul's Online Math Notes, and there are some others. Most of them are obsessed with the input-output terminology, which is not used by professionals. I think my explanation is more visual.

"Sending letters" is associated with arrows

Situation: every citizen a from the city of A sends letters to some citizen(s) b of the city of B. In Figure 1, the arrows show that citizen a_1 sends letters to three citizens of B: they are b_1,b_2,b_3, while citizen a_2 sends letters to only b_1. Let us write f(a)=b if a sends letters to b. Here, a is called an argument (some people say an input) and b is called a value (some people say an output).

This is not a function

Figure 1. This is not a function

Definitionf is called a function if for each argument, the corresponding value is unique.

The situation depicted in Figure 1 does not describe a function, because a_1 sends letters to more than one person. The visualization of a function is given in Figure 2, where arrows going from one citizen of A to more that one citizen of B are excluded.

This is a function

Figure 2. This is a function

We finalize the function definition with some remarks. In theory, of course, instead of cities, we talk about sets. Also, instead of "sending letters" we say "the function f maps a to b". f(a) is called an image of a.

The set A is called the domain of the function f. Go to the beginning of this section and see that we said "every citizen a from the city of A..." This means that our function is defined everywhere in the domain. In practice, when you are given a function, you may have to describe the domain by removing all elements where your function is not defined.

Here is a detective story

One citizen of B mysteriously dies. Before dying, he manages to tell those around him that he had visited his correspondent in A and was probably poisoned during the visit. The local detective's task will be simpler if the poor guy had been receiving letters from only one person. We write f^{-1}(b)=a if b receives letters from a. If nobody in B receives letters from more than one sender, then f^{-1} satisfies the definition of a function.

Definition. We say that the inverse function exists if f^{-1} satisfies the definition of a function. Equivalently,

(1) whenever a_1\ne a_2, their images should be different: f(a_1)\ne f(a_2).

In Figure 2, the inverse does not exist. In Figure 3, it does. The function f takes us along the red arrow from a to its image f(a)=b. The inverse function f^{-1} takes us back along the yellow arrow from b to a=f^{-1}(b), which is called a counterimage (or preimage) of b.

This function has an inverse

Figure 3. This function has an inverse

How to find the derivative of the inverse

From now on the arguments and values will be real.

The direct function f maps a to b and the inverse function maps b back to a. This gives an obvious identity: f^{-1}(f(a))=a. Similarly, f(f^{-1}(b))=b. Since this is true for all b, we can differentiate this equation using the chain rule to get


Hence, \frac{df^{-1}(b)}{db}=\frac{1}{f'(f^{-1}(b))}, which is often written as \frac{df^{-1}(b)}{db}=\frac{1}{f'(a)} with a=f^{-1}(b).

Conditions sufficient for existence of the inverse

Condition 1. Let the function f be strictly monotone. If it is increasing, for example, then this condition means that x_1<x_2 implies f(x_1)<f(x_2). Obviously, this implies (1) and the inverse exists.

Condition 2 (condition sufficient for Condition 1). Suppose the derivative of f does not change sign. For example, if the derivative is everywhere positive (think about the exponential function), then the function is strictly increasing and we can apply Condition 1.

Example. The function f(x)=x^2 is not one-to-one on the real line. If we take as its domain [0,\infty), it will be one-to-one (it is strictly monotone and its derivative is positive on the set of positive numbers). On this set, it's inverse is f^{-1}(y)=\sqrt{y}. The equations f^{-1}(f(x))=x and f(f^{-1}(y))=y take the form \sqrt{x^2}=x and (\sqrt{y})^2=y. The equations y=x^2 and x=\sqrt{y} are equivalent.

Mar 18

Logical structure of definitions

Logical structure of definitions

Roughly speaking, there are two types of Math: formula Math, where all calculations can be seen, and abstract (or invisible) Math, which happens in the head. In my Optimization class last year, students had serious problems with abstract Math. The problems started at the definitions level, which meant the students couldn’t move any further. This post is my attempt to help them with abstract Math. There is established terminology that people use to describe rules of logic. Good luck with that terminology. Below I introduce my own. Trying to explain something at an elementary level is always a thankless task, so I apologize in advance for lapses.

Math uses the usual human logic

Example 1. In each subject, our students have to take two exams: one at KBTU and another (at a later date) at the University of London (UoL). Experience shows that the UoL grade on average is 30 points lower than mine. The passing grade at the UoL is 40. Therefore I call my student successful if he/she gets at least 70 points (out of 100) in my class.

Each definition has a preamble, which provides logical grounds for the definition. All objects under consideration are elements of one large set, which I call an encompassing set. The set we are defining is a subset of the encompassing set. We use a certain property to separate elements in the set we are defining from those which are not its elements. This defining property should make sense in the encompassing set.

In Example 1, the preamble is “In each subject … passing grade at the UoL is 40.” The encompassing set is “Students taking my class”, the set we are defining is “Successful students”. The defining property is “Getting at least 70 in my class”. This property wouldn’t make sense if we used “All KZ citizens” as the encompassing set.

Let E denote the encompassing set, D the set we are defining and \bar{D} its complement. The original definition (of D) is called a direct definition. The definition of \bar{D} is its opposite. The direct definition and its opposite should be stated in such a way that D and \bar{D} a) do not intersect and b) together cover the whole encompassing set. In proofs by contradiction often it is necessary to formulate the opposite definition.

The set \bar{D} can be defined using two types of definitions. In the first type we directly negate the property P that is used to define D. I call such a definition a negative opposite. In the second type we use the opposite of the property P. I call such a definition a positive opposite. The opposite of “Getting at least 70 in my class” is “Getting less than 70 in my class”.

Example 2. I call a student unsuccessful if he/she does not get a grade of at least 70 in my class (negative opposite). I call a student unsuccessful if he/she gets a grade lower than 70 in my class (positive opposite).

In the negative opposite we don’t change the defining property; we simply say that it is not satisfied. In the positive opposite we pass from the property to its opposite and say that the elements satisfy the opposite property. The negative opposite contains negation “does not”. The positive opposite does not contain it (it may contain negations in more complex cases).

... but applies it to unusual objects

One and the same definition can be formulated in several different (however, equivalent) ways. Always try to find the most geometric form and then establish equivalence of different definitions.

Example 3. A set A is called bounded from above if there is a number M such that A is a subset of a half-infinite interval (-\infty,M]A\subset(-\infty,M]. Any such M is called an upper bound for A.

I start with this version of the definition because it is the most visual. Note that we require existence of a number M with a certain property. A slightly shorter version of the above definition is:

A set A is called bounded from above if A is a subset of a half-infinite interval (-\infty,M], for some M.

In this version, the word “some” indicates existence. When we say “for all”, we affirm universality. A replacement of an existence requirement by a universal requirement drastically changes the definition. See what we get from Example 3 by such a replacement.

Example 4. A set A is called bounded from above if for any number M, A is a subset of a half-infinite interval (-\infty,M]. Equivalently, a set A is called bounded from above if A is a subset of a half-infinite interval (-\infty,M], for any M.

Can you tell for which A this definition holds? Some people can answer this question without hesitation. If you are not one of them, try to move the number M. Many mathematical arguments require a choice of an object that would allow the researcher to prove or disprove a statement. In Example 4, we say “for any M”, and it is UP TO YOU to try any M and select the ones which show what is going on.

Example 5. An inclusion relation in terms of sets A\subset(-\infty,M] equivalently in terms of set elements takes the form x\le M for all x\in A. If we replace the set relation by the element-wise relation in Example 3, we obtain a longer definition: a set A is called bounded from above if there is a number M such that x\le M for all x\in A.

This is the place people start having problems. The errors that I saw include confusion of “there is” (existence) and “for all” (universality). See what is wrong with the following three definitions (what does not correspond to the intuition of being “bounded from above”; we also don't want A to be empty).

Example 6. A set A is called bounded from above if for any number M there exists x\in A such that x\le M.


Example 7. A set A is called bounded from above if for any number M and for any x\in A one has x\le M.


Example 8. A set A is called bounded from above if there is x\in A such that for any number M one has x\le M.
Example 9. A set A is called bounded from above if for any x\in A there exists a number M such that x\le M.


Remark. In Example 6, first M is chosen and x depends on it. In Example 7, the choices of M and x are independent. In Example 8, x is fixed and M is arbitrary. Finally, in Example 9, first x is chosen and the choice of M depends on x.

In the next exercises, start with the most graphic version. In case of doubt, do everything step by step (describe the encompassing set, defining property etc.).

Exercise 1. Define a set bounded from below using a) sets and b) element-wise terminology. In both cases formulate the opposite definition and the definition of a lower bound.

Exercise 2. Define a bounded set (that is, bounded from below and above) using a) sets and b) element-wise terminology. In both cases formulate the opposite definition.