18
Oct 20

## People need real knowledge

### Traffic analysis

The number of visits to my website has exceeded 206,000. This number depends on what counts as a visit. An external counter, visible to everyone, writes cookies to the reader's computer and counts many visits from one reader as one. The number of individual readers has reached 23,000. The external counter does not give any more statistics. I will give all the numbers from the internal counter, which is visible only to the site owner.

I have a high percentage of complex content. After reading one post, the reader finds that the answer he is looking for depends on the preliminary material. He starts digging it and then has to go deeper and deeper. Hence the number 206,000, that is, one reader visits the site on average 9 times on different days. Sometimes a visitor from one post goes to another by link on the same day. Hence another figure: 310,000 readings.

I originally wrote simple things about basic statistics. Then I began to write accompanying materials for each advanced course that I taught at Kazakh-British Technical University (KBTU). The shift in the number and level of readership shows that people need deep knowledge, not bait for one-day moths.

For example, my simple post on basic statistics was read 2,300 times. In comparison, the more complex post on the Cobb-Douglas function has been read 7,100 times. This function is widely used in economics to model consumer preferences (utility function) and producer capabilities (production function). In all textbooks it is taught using two-dimensional graphs, as P. Samuelson proposed 85 years ago. In fact, two-dimensional graphs are obtained by projection of a three-dimensional graph, which I show, making everything clear and obvious.

The answer to one of the University of London (UoL) exam problems attracted 14,300 readers. It is so complicated that I split the answer into two parts, and there are links to additional material. On the UoL exam, students have to solve this problem in 20-30 minutes, which even I would not be able to do.

### Why my site is unique

My site is unique in several ways. Firstly, I tell the truth about the AP Statistics books. This is a basic statistics course for those who need to interpret tables, graphs and simple statistics. If you have a head on your shoulders, and not a Google search engine, all you need to do is read a small book and look at the solutions. I praise one such book in my reviews. You don't need to attend a two-semester course and read an 800-page book. Moreover, one doesn't need 140 high-quality color photographs that have nothing to do with science and double the price of a book.

Many AP Statistics consumers believe that learning should be fun. Such people are attracted by a book with anecdotes that have no relation to statistics or the life of scientists. In the West, everyone depends on each other, and therefore all the reviews are written in a superlative degree and streamlined. Thank God, I do not depend on the Western labor market, and therefore I tell the truth. Part of my criticism, including the statistics textbook selected for the program "100 Textbooks" of the Ministry of Education and Science of Kazakhstan (MES), is on Facebook.

Secondly, I have the world's only online, free, complete matrix algebra tutorial with all the proofs. Free courses on Udemy, Coursera and edX are not far from AP Statistics in terms of level. Courses at MIT and Khan Academy are also simpler than mine, but have the advantage of being given in video format.

The third distinctive feature is that I help UoL students. It is a huge organization spanning 17 universities and colleges in the UK and with many branches in other parts of the world. The Economics program was developed by the London School of Economics (LSE), one of the world's leading universities.

The problem with LSE courses is that they are very difficult. After the exams, LSE puts out short recommendations on the Internet for solving problems like: here you need to use such and such a theory and such and such an idea. Complete solutions are not given for two reasons: they do not want to help future examinees and sometimes their problems or solutions contain errors (who does not make errors?). But they also delete short recommendations after a year. My site is the only place in the world where there are complete solutions to the most difficult problems of the last few years. It is not for nothing that the solution to one problem noted above attracted 14,000 visits.

Fourthly, my site is unique in terms of the variety of material: statistics, econometrics, algebra, optimization, finance.

The average number of visits is about 100 per day. When it's time for students to take exams, it jumps to 1-2 thousand. The total amount of materials created in 5 years is equivalent to 5 textbooks. It takes from 2 hours to one day to create one post, depending on the level. After I published this analysis of the site traffic on Facebook, my colleague Nurlan Abiev decided to write posts for the site. I pay for the domain myself, $186 per year. It would be nice to make the site accessible to students and schoolchildren of Kazakhstan, but I don't have time to translate from English. Once I was looking at the requirements of the MES for approval of electronic textbooks. They want several copies of printouts of all (!) materials and a solid payment for the examination of the site. As a result, all my efforts to create and maintain the site so far have been a personal initiative that does not have any support from the MES and its Committee on Science. 26 May 20 ## My book in Basic Statistics ## Number of reads of my book on October 29, 2020 For more information see my site 17 Mar 19 ## AP Statistics the Genghis Khan way 2 ## AP Statistics the Genghis Khan way 2 Last semester I tried to explain theory through numerical examples. The results were terrible. Even the best students didn't stand up to my expectations. The midterm grades were so low that I did something I had never done before: I allowed my students to write an analysis of the midterm at home. Those who were able to verbally articulate the answers to me received a bonus that allowed them to pass the semester. This semester I made a U-turn. I announced that in the first half of the semester we will concentrate on theory and we followed this methodology. Out of 35 students, 20 significantly improved their performance and 15 remained where they were. ### Midterm exam, version 1 #### 1. General density definition (6 points) a. Define the density $p_X$ of a random variable $X.$ Draw the density of heights of adults, making simplifying assumptions if necessary. Don't forget to label the axes. b. According to your plot, how much is the integral $\int_{-\infty}^0p_X(t)dt?$ Explain. c. Why the density cannot be negative? d. Why the total area under the density curve should be 1? e. Where are basketball players on your graph? Write down the corresponding expression for probability. f. Where are dwarfs on your graph? Write down the corresponding expression for probability. This question is about the interval formula. In each case students have to write the equation for the probability and the corresponding integral of the density. At this level, I don't talk about the distribution function and introduce the density by the interval formula. #### 2. Properties of means (8 points) a. Define a discrete random variable and its mean. b. Define linear operations with random variables. c. Prove linearity of means. d. Prove additivity and homogeneity of means. e. How much is the mean of a constant? f. Using induction, derive the linearity of means for the case of $n$ variables from the case of two variables (3 points). #### 3. Covariance properties (6 points) a. Derive linearity of covariance in the first argument when the second is fixed. b. How much is covariance if one of its arguments is a constant? c. What is the link between variance and covariance? If you know one of these functions, can you find the other (there should be two answers)? (4 points) #### 4. Standard normal variable (6 points) a. Define the density $p_z(t)$ of a standard normal. b. Why is the function $p_z(t)$ even? Illustrate this fact on the plot. c. Why is the function $f(t)=tp_z(t)$ odd? Illustrate this fact on the plot. d. Justify the equation $Ez=0.$ e. Why is $V(z)=1?$ f. Let $t>0.$ Show on the same plot areas corresponding to the probabilities $A_1=P(0 $A_2=P(z>t),$ $A_3=P(z<-t),$ $A_4=P(-t Write down the relationships between $A_1,...,A_4.$ #### 5. General normal variable (3 points) a. Define a general normal variable $X.$ b. Use this definition to find the mean and variance of $X.$ c. Using part b, on the same plot graph the density of the standard normal and of a general normal with parameters $\sigma =2,$ $\mu =3.$ ### Midterm exam, version 2 #### 1. General density definition (6 points) a. Define the density $p_X$ of a random variable $X.$ Draw the density of work experience of adults, making simplifying assumptions if necessary. Don't forget to label the axes. b. According to your plot, how much is the integral $\int_{-\infty}^0p_X(t)dt?$ Explain. c. Why the density cannot be negative? d. Why the total area under the density curve should be 1? e. Where are retired people on your graph? Write down the corresponding expression for probability. f. Where are young people (up to 25 years old) on your graph? Write down the corresponding expression for probability. #### 2. Variance properties (8 points) a. Define variance of a random variable. Why is it non-negative? b. Define the formula for variance of a linear combination of two variables. c. How much is variance of a constant? d. What is the formula for variance of a sum? What do we call homogeneity of variance? e. What is larger: $V(X+Y)$ or $V(X-Y)$? (2 points) f. One investor has 100 shares of Apple, another - 200 shares. Which investor's portfolio has larger variability? (2 points) #### 3. Poisson distribution (6 points) a. Write down the Taylor expansion and explain the idea. How are the Taylor coefficients found? b. Use the Taylor series for the exponential function to define the Poisson distribution. c. Find the mean of the Poisson distribution. What is the interpretation of the parameter $\lambda$ in practice? #### 4. Standard normal variable (6 points) a. Define the density $p_z(t)$ of a standard normal. b. Why is the function $p_z(t)$ even? Illustrate this fact on the plot. c. Why is the function $f(t)=tp_z(t)$ odd? Illustrate this fact on the plot. d. Justify the equation $Ez=0.$ e. Why is $V(z)=1?$ f. Let $t>0.$ Show on the same plot areas corresponding to the probabilities $A_1=P(0 $A_2=P(z>t),$ $A_{3}=P(z<-t),$ $A_4=P(-t Write down the relationships between $A_{1},...,A_{4}.$ #### 5. General normal variable (3 points) a. Define a general normal variable $X.$ b. Use this definition to find the mean and variance of $X.$ c. Using part b, on the same plot graph the density of the standard normal and of a general normal with parameters $\sigma =2,$ $\mu =3.$ 16 Mar 19 ## AP Statistics the Genghis Khan way 1 ## AP Statistics the Genghis Khan way 1 Recently I enjoyed reading Jack Weatherford's "Genghis Khan and the Making of the Modern World" (2004). I was reading the book with a specific question in mind: what were the main reasons of the success of the Mongols? Here you can see the list of their innovations, some of which were in fact adapted from the nations they subjugated. But what was the main driving force behind those innovations? The conclusion I came to is that Genghis Khan was a genial psychologist. He used what he knew about individual and social psychology to constantly improve the government of his empire. I am no Genghis Khan but I try to base my teaching methods on my knowledge of student psychology. ### Problems and suggested solutions Steven Krantz in his book (How to teach mathematics : Second edition, 1998, don't remember the page) says something like this: If you want your students to do something, arrange your classes so that they do it in the class. Problem 1. Students mechanically write down what the teacher says and writes. Solution. I don't allow my students to write while I am explaining the material. When I explain, their task is to listen and try to understand. I invite them to ask questions and prompt me to write more explanations and comments. After they all say "We understand", I clean the board and then they write down whatever they understood and remembered. Problem 2. Students are not used to analyze what they read or write. Solution. After students finish their writing, I ask them to exchange notebooks and check each other's writings. It's easier for them to do this while everything is fresh in their memory. I bought and distributed red pens. When they see that something is missing or wrong, they have to write in red. Errors or omissions must stand out. Thus, right there in the class students repeat the material twice. Problem 3. Students don't study at home. Solution. I let my students know in advance what the next quiz will be about. Even with this knowledge, most of them don't prepare at home. Before the quiz I give them about half an hour to repeat and discuss the material (this is at least the third repetition). We start the quiz when they say they are ready. Problem 4. Students don't understand that active repetition (writing without looking at one's notes) is much more productive than passive repetition (just reading the notes). Solution. Each time before discussion sessions I distribute scratch paper and urge students to write, not just read or talk. About half of them follow my recommendation. Their desire to keep their notebooks neat is not their last consideration. The solution to Problem 1 also hinges upon active repetition. Problem 5. If students work and are evaluated individually, usually there is no or little interaction between them. Solution. My class is divided in teams (currently I have teams of two to six people). I randomly select one person from each team to write the quiz. That person's grade is the team's grade. This forces better students to coach others and weaker students to seek help. Problem 6. Some students don't want to work in teams. They are usually either good students, who don't want to suffer because of weak team members, or weak students, who don't want their low grades to harm other team members. Solution. The good students usually argue that it's not fair if their grade becomes lower because of somebody else's fault. My answer to them is that the meaning of fairness depends on the definition. In my grading scheme, 30 points out of 100 is allocated for team work and the rest for individual achievements. Therefore I never allow good students to work individually. I want them to be my teaching assistants and help other students. While doing so, I tell them that I may reward good students with a bonus in the end of the semester. In some cases I allow weak students to write quizzes individually but only if the team so requests. The request of the weak student doesn't matter. The weak student still has to participate in team discussions. Problem 7. There is no accumulation of theoretical knowledge (flat learning curve). Solution. a) Most students come from high school with little experience in algebra. I raise the level gradually and emphasize understanding. Students never see multiple choice questions in my classes. They also know that right answers without explanations will be discarded. b) Normally, during my explanations I fill out the board. The amount of the information the students have to remember is substantial and increases over time. If you know a better way to develop one's internal vision, let me know. c) I don't believe in learning the theory by doing applied exercises. After explaining the theory I formulate it as a series of theoretical exercises. I give the theory in large, logically consistent blocks for students to see the system. Half of exam questions are theoretical (students have to provide proofs and derivations) and the other half - applied. d) The right motivation can be of two types: theoretical or applied, and I never substitute one for another. Problem 8. In low-level courses you need to conduct frequent evaluations to keep your students in working shape. Multiply that by the number of students, and you get a serious teaching overload. Solution. Once at a teaching conference in Prague my colleague from New York boasted that he grades 160 papers per week. Evaluating one paper per team saves you from that hell. ### Outcome In the beginning of the academic year I had 47 students. In the second semester 12 students dropped the course entirely or enrolled in Stats classes taught by other teachers. Based on current grades, I expect 15 more students to fail. Thus, after the first year I'll have about 20 students in my course (if they don't fail other courses). These students will master statistics at the level of my book. 8 Oct 17 ## Reevaluating probabilities based on piece of evidence ## Reevaluating probabilities based on piece of evidence This actually has to do with the Bayes' theorem. However, in simple problems one can use a dead simple approach: just find probabilities of all elementary events. This post builds upon the post on Significance level and power of test, including the notation. Be sure to review that post. Here is an example from the guide for Quantitative Finance by A. Patton (University of London course code FN3142). Activity 7.2 Consider a test that has a Type I error rate of 5%, and power of 50%. Suppose that, before running the test, the researcher thinks that both the null and the alternative are equally likely. 1. If the test indicates a rejection of the null hypothesis, what is the probability that the null is false? 2. If the test indicates a failure to reject the null hypothesis, what is the probability that the null is true? Denote events R = {Reject null}, A = {fAil to reject null}; T = {null is True}; F = {null is False}. Then we are given: (1) $P(F)=0.5;\ P(T)=0.5;$ (2) $P(R|T)=\frac{P(R\cap T)}{P(T)}=0.05;\ P(R|F)=\frac{P(R\cap F)}{P(F)}=0.5;$ (1) and (2) show that we can find $P(R\cap T)$ and $P(R\cap F)$ and therefore also $P(A\cap T)$ and $P(A\cap F).$ Once we know probabilities of elementary events, we can find everything about everything. Figure 1. Elementary events Answering the first question: just plug probabilities in $P(F|R)=\frac{P(R\cap F)}{P(R)}=\frac{P(R\cap F)}{P(R\cap T)+P(A\cap T)}.$ Answering the second question: just plug probabilities in $P(T|A)=\frac{P(A\cap T)}{P(A)}=\frac{P(A\cap T)}{P(A\cap T)+P(A\cap F)}.$ Patton uses the Bayes' theorem and the law of total probability. The solution suggested above uses only additivity of probability. 6 Oct 17 ## Significance level and power of test ## Significance level and power of test In this post we discuss several interrelated concepts: null and alternative hypotheses, type I and type II errors and their probabilities. Review the definitions of a sample space and elementary events and that of a conditional probability. ## Type I and Type II errors Regarding the true state of nature we assume two mutually exclusive possibilities: the null hypothesis (like the suspect is guilty) and alternative hypothesis (the suspect is innocent). It's up to us what to call the null and what to call the alternative. However, the statistical procedures are not symmetric: it's easier to measure the probability of rejecting the null when it is true than other involved probabilities. This is why what is desirable to prove is usually designated as the alternative. Usually in books you can see the following table.  Decision taken Fail to reject null Reject null State of nature Null is true Correct decision Type I error Null is false Type II error Correct decision This table is not good enough because there is no link to probabilities. The next video does fill in the blanks. Video. Significance level and power of test ## Significance level and power of test The conclusion from the video is that $\frac{P(T\bigcap R)}{P(T)}=P(R|T)=P\text{(Type I error)=significance level}$ $\frac{P(F\bigcap R)}{P(F)}=P(R|F)=P\text{(Correctly rejecting false null)=Power}$ 4 Sep 17 ## Geometry related to derivatives # Geometry related to derivatives In a sequence of videos I explain the main ideas pursued by the fathers of Calculus - Isaac Newton and Gottfried Wilhelm Leibniz. ## Derivative equals speed Here we assume that a point moves along a straight line and try to find its speed as usual. We divide the distance traveled by the time it takes to travel it. The ratio is an average speed over a time interval. As we reduce the length of the time interval, we get a better and better approximation to the exact (instantaneous) speed at a point in time. Video 1. Derivative is speed ## Position of point as a function of time Working with the visualization of the point movement on a straight line is inconvenient because it is difficult to correlate the point position to time. It is much better to visualize the movement on the space-time plane where the horizontal axis is for time and the vertical axis is for the point position. Video 2. Position of point as function of time ## Measuring the slope of a straight line A little digression: how do you measure the slope of a straight line, if you know the values of the function at different points? Video 3. Measuring the slope of a straight line ## Derivative as the slope of a tangent line This is like putting two and two together: we apply the previous definition to the slope of a secant drawn through two points on a graph. Then it remains to notice that the secant approaches the tangent line, as the second point approaches the first. Video 4. Derivative as tangent slope ## From function to its derivative This is a very useful exercise that allows later to come up with the optimization conditions, called first order and second order conditions. Video 5. From function to its derivative ## Conclusion Let $P(t)$ be some function and fix an initial point $t_1$. The derivative $P^\prime(t_1)$ is defined as the limit $P^\prime(t_1)=\lim_{t_2\rightarrow t_1}\frac{P(t_2)-P(t_1)}{t_2-t_1}.$ When $P(t)$ describes the movement of a point along a straight line, the derivative gives the speed of that point. When $P(t)$ is drawn on a plane, the derivative gives the slope of the tangent line to the graph. 11 Aug 17 ## Violations of classical assumptions 2 ## Violations of classical assumptions This will be a simple post explaining the common observation that "in Economics, variability of many variables is proportional to those variables". Make sure to review the assumptions; they tend to slip from memory. We consider the simple regression (1) $y_i=a+bx_i+e_i.$ One of classical assumptions is Homoscedasticity. All errors have the same variances$Var(e_i)=\sigma^2$ for all $i$. We discuss its opposite, which is Heteroscedasticity. Not all errors have the same variance. It would be wrong to write it as $Var(e_i)\ne\sigma^2$ for all $i$ (which means that all errors have variance different from $\sigma^2$). You can write that not all $Var(e_i)$ are the same but it's better to use the verbal definition. Remark about Video 1. The dashed lines can represent mean consumption. Then the fact that variation of a variable grows with its level becomes more obvious. Video 1. Case for heteroscedasticity Figure 1. Illustration from Dougherty: as x increases, variance of the error term increases Homoscedasticity was used in the derivation of the OLS estimator variance; under heteroscedasticity that expression is no longer valid. There are other implications, which will be discussed later. Companies example. The Samsung Galaxy Note 7 battery fires and explosions that caused two recalls cost the smartphone maker at least$5 billion. There is no way a small company could have such losses.

GDP example. The error in measuring US GDP is on the order of \$200 bln, which is comparable to the Kazakhstan GDP. However, the standard deviation of the ratio error/GDP seems to be about the same across countries, if the underground economy is not too big. Often the assumption that the standard deviation of the regression error is proportional to one of regressors is plausible.

To see if the regression error is heteroscedastic, you can look at the graph of the residuals or use statistical tests.

7
Aug 17

## Violations of classical assumptions

This is a large topic which requires several posts or several book chapters. During a conference in Sweden in 2010, a Swedish statistician asked me: "What is Econometrics, anyway? What tools does it use?" I said: "Among others, it uses linear regression." He said: "But linear regression is a general statistical tool, why do they say it's a part of Econometrics?" My answer was: "Yes, it's a general tool but the name Econometrics emphasizes that the motivation for its applications lies in Economics".

Both classical assumptions and their violations should be studied with this point in mind: What is the Economics and Math behind each assumption?

## Violations of the first three assumptions

We consider the simple regression

(1) $y_i=a+bx_i+e_i$

Make sure to review the assumptions. Their numbering and names sometimes are different from what Dougherty's book has. In particular, most of the time I omit the following assumption:

A6. The model is linear in parameters and correctly specified.

When it is not linear in parameters, you can think of nonlinear alternatives. Instead of saying "correctly specified" I say "true model" when a "wrong model" is available.

A1. What if the existence condition is violated? If variance of the regressor is zero, the OLS estimator does not exist. The fitted line is supposed to be vertical, and you can regress $x$ on $y$. Violation of the existence condition in case of multiple regression leads to multicollinearity, and that's where economic considerations are important.

A2. The convenience condition is called so because when it is violated, that is, the regressor is stochastic, there are ways to deal with this problem:  finite-sample theory and large-sample theory.

A3. What if the errors in (1) have means different from zero? This question can be divided in two: 1) the means of the errors are the same: $Ee_i=c\ne 0$ for all $i$ and 2) the means are different. Read the post about centering and see if you can come up with the answer for the first question. The means may be different because of omission of a relevant variable (can you do the math?). In the absence of data on such a variable, there is nothing you can do.

Violations of A4 and A5 will be treated later.

26
Jul 17

## Nonlinear least squares

Here we explain the idea, illustrate the possible problems in Mathematica and, finally, show the implementation in Stata.

## Idea: minimize RSS, as in ordinary least squares

Observations come in pairs $(x_1,y_1),...,(x_n,y_n)$. In case of ordinary least squares, we approximated the y's with linear functions of the parameters, possibly nonlinear in x's. Now we use a function $f(a,b,x_i)$ which may be nonlinear in $a,b$. We still minimize RSS which takes the form $RSS=\sum r_i^2=\sum(y_i-f(a,b,x_i))^2$. Nonlinear least squares estimators are the values $a,b$ that minimize RSS. In general, it is difficult to find the formula (closed-form solution), so in practice software, such as Stata, is used for RSS minimization.

## Simplified idea and problems in one-dimensional case

Suppose we want to minimize $f(x)$. The Newton algorithm (default in Stata) is an iterative procedure that consists of steps:

1. Select the initial value $x_0$.
2. Find the derivative (or tangent) of RSS at $x_0$. Make a small step in the descent direction (indicated by the derivative), to obtain the next value $x_1$.
3. Repeat Step 2, using $x_1$ as the starting point, until the difference between the values of the objective function at two successive points becomes small. The last point $x_n$ will approximate the minimizing point.

Problems:

1. The minimizing point may not exist.
2. When it exists, it may not be unique. In general, there is no way to find out how many local minimums there are and which ones are global.
3. The minimizing point depends on the initial point.

See Video 1 for illustration in the one-dimensional case.

Video 1. NLS geometry

## Problems illustrated in Mathematica

Here we look at three examples of nonlinear functions, two of which are considered in Dougherty. The first one is a power functions (it can be linearized applying logs) and the second is an exponential function (it cannot be linearized). The third function gives rise to two minimums. The possibilities are illustrated in Mathematica.

Video 2. NLS illustrated in Mathematica

## Finally, implementation in Stata

Here we show how to 1) generate a random vector, 2) create a vector of initial values, and 3) program a nonlinear dependence.

Video 3. NLS implemented in Stata