AP Statistics the Genghis Khan way 2
AP Statistics the Genghis Khan way 2
Last semester I tried to explain theory through numerical examples. The results were terrible. Even the best students didn't stand up to my expectations. The midterm grades were so low that I did something I had never done before: I allowed my students to write an analysis of the midterm at home. Those who were able to verbally articulate the answers to me received a bonus that allowed them to pass the semester.
This semester I made a Uturn. I announced that in the first half of the semester we will concentrate on theory and we followed this methodology. Out of 35 students, 20 significantly improved their performance and 15 remained where they were.
Midterm exam, version 1
1. General density definition (6 points)
a. Define the density of a random variable Draw the density of heights of adults, making simplifying assumptions if necessary. Don't forget to label the axes.
b. According to your plot, how much is the integral Explain.
c. Why the density cannot be negative?
d. Why the total area under the density curve should be 1?
e. Where are basketball players on your graph? Write down the corresponding expression for probability.
f. Where are dwarfs on your graph? Write down the corresponding expression for probability.
This question is about the interval formula. In each case students have to write the equation for the probability and the corresponding integral of the density. At this level, I don't talk about the distribution function and introduce the density by the interval formula.
2. Properties of means (8 points)
a. Define a discrete random variable and its mean.
b. Define linear operations with random variables.
c. Prove linearity of means.
d. Prove additivity and homogeneity of means.
e. How much is the mean of a constant?
f. Using induction, derive the linearity of means for the case of variables from the case of two variables (3 points).
3. Covariance properties (6 points)
a. Derive linearity of covariance in the first argument when the second is fixed.
b. How much is covariance if one of its arguments is a constant?
c. What is the link between variance and covariance? If you know one of these functions, can you find the other (there should be two answers)? (4 points)
4. Standard normal variable (6 points)
a. Define the density of a standard normal.
b. Why is the function even? Illustrate this fact on the plot.
c. Why is the function odd? Illustrate this fact on the plot.
d. Justify the equation
e. Why is
f. Let Show on the same plot areas corresponding to the probabilities Write down the relationships between
5. General normal variable (3 points)
a. Define a general normal variable
b. Use this definition to find the mean and variance of
c. Using part b, on the same plot graph the density of the standard normal and of a general normal with parameters
Midterm exam, version 2
1. General density definition (6 points)
a. Define the density of a random variable Draw the density of work experience of adults, making simplifying assumptions if necessary. Don't forget to label the axes.
b. According to your plot, how much is the integral Explain.
c. Why the density cannot be negative?
d. Why the total area under the density curve should be 1?
e. Where are retired people on your graph? Write down the corresponding expression for probability.
f. Where are young people (up to 25 years old) on your graph? Write down the corresponding expression for probability.
2. Variance properties (8 points)
a. Define variance of a random variable. Why is it nonnegative?
b. Define the formula for variance of a linear combination of two variables.
c. How much is variance of a constant?
d. What is the formula for variance of a sum? What do we call homogeneity of variance?
e. What is larger: or ? (2 points)
f. One investor has 100 shares of Apple, another  200 shares. Which investor's portfolio has larger variability? (2 points)
3. Poisson distribution (6 points)
a. Write down the Taylor expansion and explain the idea. How are the Taylor coefficients found?
b. Use the Taylor series for the exponential function to define the Poisson distribution.
c. Find the mean of the Poisson distribution. What is the interpretation of the parameter in practice?
4. Standard normal variable (6 points)
a. Define the density of a standard normal.
b. Why is the function even? Illustrate this fact on the plot.
c. Why is the function odd? Illustrate this fact on the plot.
d. Justify the equation
e. Why is
f. Let Show on the same plot areas corresponding to the probabilities Write down the relationships between
5. General normal variable (3 points)
a. Define a general normal variable
b. Use this definition to find the mean and variance of
c. Using part b, on the same plot graph the density of the standard normal and of a general normal with parameters
AP Statistics the Genghis Khan way 1
AP Statistics the Genghis Khan way 1
Recently I enjoyed reading Jack Weatherford's "Genghis Khan and the Making of the Modern World" (2004). I was reading the book with a specific question in mind: what were the main reasons of the success of the Mongols? Here you can see the list of their innovations, some of which were in fact adapted from the nations they subjugated. But what was the main driving force behind those innovations? The conclusion I came to is that Genghis Khan was a genial psychologist. He used what he knew about individual and social psychology to constantly improve the government of his empire.
I am no Genghis Khan but I try to base my teaching methods on my knowledge of student psychology.
Problems and suggested solutions
Steven Krantz in his book (How to teach mathematics : Second edition, 1998, don't remember the page) says something like this: If you want your students to do something, arrange your classes so that they do it in the class.
Problem 1. Students mechanically write down what the teacher says and writes.
Solution. I don't allow my students to write while I am explaining the material. When I explain, their task is to listen and try to understand. I invite them to ask questions and prompt me to write more explanations and comments. After they all say "We understand", I clean the board and then they write down whatever they understood and remembered.
Problem 2. Students are not used to analyze what they read or write.
Solution. After students finish their writing, I ask them to exchange notebooks and check each other's writings. It's easier for them to do this while everything is fresh in their memory. I bought and distributed red pens. When they see that something is missing or wrong, they have to write in red. Errors or omissions must stand out. Thus, right there in the class students repeat the material twice.
Problem 3. Students don't study at home.
Solution. I let my students know in advance what the next quiz will be about. Even with this knowledge, most of them don't prepare at home. Before the quiz I give them about half an hour to repeat and discuss the material (this is at least the third repetition). We start the quiz when they say they are ready.
Problem 4. Students don't understand that active repetition (writing without looking at one's notes) is much more productive than passive repetition (just reading the notes).
Solution. Each time before discussion sessions I distribute scratch paper and urge students to write, not just read or talk. About half of them follow my recommendation. Their desire to keep their notebooks neat is not their last consideration. The solution to Problem 1 also hinges upon active repetition.
Problem 5. If students work and are evaluated individually, usually there is no or little interaction between them.
Solution. My class is divided in teams (currently I have teams of two to six people). I randomly select one person from each team to write the quiz. That person's grade is the team's grade. This forces better students to coach others and weaker students to seek help.
Problem 6. Some students don't want to work in teams. They are usually either good students, who don't want to suffer because of weak team members, or weak students, who don't want their low grades to harm other team members.
Solution. The good students usually argue that it's not fair if their grade becomes lower because of somebody else's fault. My answer to them is that the meaning of fairness depends on the definition. In my grading scheme, 30 points out of 100 is allocated for team work and the rest for individual achievements. Therefore I never allow good students to work individually. I want them to be my teaching assistants and help other students. While doing so, I tell them that I may reward good students with a bonus in the end of the semester. In some cases I allow weak students to write quizzes individually but only if the team so requests. The request of the weak student doesn't matter. The weak student still has to participate in team discussions.
Problem 7. There is no accumulation of theoretical knowledge (flat learning curve).
Solution. a) Most students come from high school with little experience in algebra. I raise the level gradually and emphasize understanding. Students never see multiple choice questions in my classes. They also know that right answers without explanations will be discarded.
b) Normally, during my explanations I fill out the board. The amount of the information the students have to remember is substantial and increases over time. If you know a better way to develop one's internal vision, let me know.
c) I don't believe in learning the theory by doing applied exercises. After explaining the theory I formulate it as a series of theoretical exercises. I give the theory in large, logically consistent blocks for students to see the system. Half of exam questions are theoretical (students have to provide proofs and derivations) and the other half  applied.
d) The right motivation can be of two types: theoretical or applied, and I never substitute one for another.
Problem 8. In lowlevel courses you need to conduct frequent evaluations to keep your students in working shape. Multiply that by the number of students, and you get a serious teaching overload.
Solution. Once at a teaching conference in Prague my colleague from New York boasted that he grades 160 papers per week. Evaluating one paper per team saves you from that hell.
Outcome
In the beginning of the academic year I had 47 students. In the second semester 12 students dropped the course entirely or enrolled in Stats classes taught by other teachers. Based on current grades, I expect 15 more students to fail. Thus, after the first year I'll have about 20 students in my course (if they don't fail other courses). These students will master statistics at the level of my book.
Little tricks for AP Statistics
Little tricks for AP Statistics
This year I am teaching AP Statistics. If the things continue the way they are, about half of the class will fail. Here is my diagnosis and how I am handling the problem.
On the surface, the students lack algebra training but I think the problem is deeper: many of them have underdeveloped cognitive abilities. Their perception is slow, memory is limited, analytical abilities are rudimentary and they are not used to work at home. Limited resources require careful allocation.
Terminology
Short and intuitive names are better than twoword professional names.
Instead of "sample space" or "probability space" say "universe". The universe is the widest possible event, and nothing exists outside it.
Instead of "elementary event" say "atom". Simplest possible events are called atoms. This corresponds to the theoretical notion of an atom in measure theory (an atom is a measurable set which has positive measure and contains no set of smaller positive measure).
Then the formulation of classical probability becomes short. Let denote the number of atoms in the universe and let be the number of atoms in event If all atoms are equally likely (have equal probabilities), then
The clumsy "mutually exclusive events" are better replaced by more visual "disjoint sets". Likewise, instead of "collectively exhaustive events" say "events that cover the universe".
The combination "mutually exclusive" and "collectively exhaustive" events is beyond comprehension for many. I say: if events are disjoint and cover the universe, we call them tiles. To support this definition, play onscreen one of jigsaw puzzles (Video 1) and produce the picture from Figure 1.
Video 1. Tiles (disjoint events that cover the universe)
The philosophy of team work
We are in the same boat. I mean the big boat. Not the class. Not the university. It's the whole country. We depend on each other. Failure of one may jeopardize the wellbeing of everybody else.
You work in teams. You help each other to learn. My lectures and your presentations are just the beginning of the journey of knowledge into your heads. I cannot control how it settles there. Be my teaching assistants, share your big and little discoveries with your classmates.
I don't just preach about you helping each other. I force you to work in teams. 30% of the final grade is allocated to team work. Team work means joint responsibility. You work on assignments together. I randomly select a team member for reporting. His or her grade is what each team member gets.
This kind of team work is incompatible with the Western obsession with grades privacy. If I say my grade is nobody's business, by extension I consider the level of my knowledge a private issue. This will prevent me from asking for help and admitting my errors. The situation when students hide their errors and weaknesses from others also goes against the ethics of many workplaces. In my class all grades are public knowledge.
In some situations, keeping the grade private is technically impossible. Conducting a competition without announcing the points won is impossible. If I catch a student cheating, I announce the failing grade immediately, as a warning to others.
To those of you who think teambased learning is unfair to better students I repeat: 30% of the final grade is given for team work, not for personal achievements. The other 70% is where you can shine personally.
Breaking the wall of silence
Team work serves several purposes.
Firstly, joint responsibility helps breaking communication barriers. See in Video 2 my students working in teams on classroom assignments. The situation when a weaker student is too proud to ask for help and a stronger student doesn't want to offend by offering help is not acceptable. One can ask for help or offer help without losing respect for each other.
Video 2. Teams working on assignments
Secondly, it turns on resources that are otherwise idle. Explaining something to somebody is the best way to improve your own understanding. The better students master a kind of leadership that is especially valuable in a modern society. For the weaker students, feeling responsible for a team improves motivation.
Thirdly, I save time by having to grade less student papers.
On exams and quizzes I mercilessly punish the students for Yes/No answers without explanations. There are no halfpoints for halfunderstanding. This, in combination with the team work and open grades policy allows me to achieve my main objective: students are eager to talk to me about their problems.
Set operations and probability
After studying the basics of set operations and probabilities we had a midterm exam. It revealed that about onethird of students didn't understand this material and some of that misunderstanding came from high school. During the review session I wanted to see if they were ready for a frank discussion and told them: "Those who don't understand probabilities, please raise your hands", and about onethird raised their hands. I invited two of them to work at the board.
Video 3. Translating verbal statements to sets, with accompanying probabilities
Many teachers think that the Venn diagrams explain everything about sets because they are visual. No, for some students they are not visual enough. That's why I prepared a simple teaching aid (see Video 3) and explained the task to the two students as follows:
I am shooting at the target. The target is a square with two circles on it, one red and the other blue. The target is the universe (the bullet cannot hit points outside it). The probability of a set is its area. I am going to tell you one statement after another. You write that statement in the first column of the table. In the second column write the mathematical expression for the set. In the third column write the probability of that set, together with any accompanying formulas that you can come up with. The formulas should reflect the relationships between relevant areas.
Table 1. Set operations and probabilities
Statement  Set  Probability 
1. The bullet hit the universe  
2. The bullet didn't hit the universe  
3. The bullet hit the red circle  
4. The bullet didn't hit the red circle  
5. The bullet hit both the red and blue circles  (in general, this is not equal to )  
6. The bullet hit or (or both) 
(additivity rule) 

7. The bullet hit but not  
8. The bullet hit but not  
9. The bullet hit either or (but not both) 
During the process, I was illustrating everything on my teaching aid. This exercise allows the students to relate verbal statements to sets and further to their areas. The main point is that people need to see the logic, and that logic should be repeated several times through similar exercises.
Law of total probability  you could have invented this
Law of total probability  you could have invented this
A knight wants to kill (event ) a dragon. There are two ways to do this: by fighting (event ) the dragon or by outwitting () it. The choice of the way ( or ) is random, and in each case the outcome ( or not ) is also random. For the probability of killing there is a simple, intuitive formula:
.
Its derivation is straightforward from the definition of conditional probability: since and cover the whole sample space and are disjoint, we have by additivity of probability
.
This is easy to generalize to the case of many conditioning events. Suppose are mutually exclusive (that is, disjoint) and collectively exhaustive (that is, cover the whole sample space). Then for any event one has
.
This equation is call the law of total probability.
Application to a sum of continuous and discrete random variables
Let be independent random variables. Suppose that is continuous, with a distribution function , and suppose is discrete, with values . Then for the distribution function of the sum we have
(by independence conditioning on can be omitted)
.
Compare this to the much more complex derivation in case of two continuous variables.
Distribution function estimation
Distribution function estimation
The relativity theory says that what initially looks absolutely difficult, on closer examination turns out to be relatively simple. Here is one such topic. We start with a motivating example.
Large cloud service providers have huge data centers. A data center, being a large group of computer servers, typically requires extensive air conditioning. The intensity and cost of air conditioning depend on the temperature of the surrounding environment. If, as in our motivating example, we denote by the temperature outside and by a cutoff value, then a cloud service provider is interested in knowing the probability for different values of . This is exactly the distribution function of temperature: . So how do you estimate it?
It comes down to usual sampling. Fix some cutoff, for example, and see for how many days in a year the temperature does not exceed 20. If the number of such days is, say, 200, then 200/365 will be the estimate of the probability .
It remains to dress this idea in mathematical clothes.
Empirical distribution function
If an observation belongs to the event , we count it as 1, otherwise we count it as zero. That is, we are dealing with a dummy variable
(1)
The total count is and this is divided by the total number of observations, which is 365, to get 200/365.
It is important to realize that the variable in (1) is a coin (Bernoulli variable). For an unfair coin with probability of 1 equal to and probability of zero equal to the mean is
and the variance is
.
For the variable in (1) , so the mean and variance are
(2) .
Generalizing, the probability is estimated by
(3)
where is the number of observations. (3) is called an empirical distribution function because it is a direct empirical analog of .
Applying expectation to (3) and using an equation similar to (2), we prove unbiasedness of our estimator:
(4) .
Further, assuming independent observations we can find variance of (3):
(5) (using homogeneity of degree 2)
(applying an equation similar to (2))
Corollary. (4) and (5) can be used to prove that (3) is a consistent estimator of the distribution function, i.e., (3) converges to in probability.
Reevaluating probabilities based on piece of evidence
Reevaluating probabilities based on piece of evidence
This actually has to do with the Bayes' theorem. However, in simple problems one can use a dead simple approach: just find probabilities of all elementary events. This post builds upon the post on Significance level and power of test, including the notation. Be sure to review that post.
Here is an example from the guide for Quantitative Finance by A. Patton (University of London course code FN3142).
Activity 7.2 Consider a test that has a Type I error rate of 5%, and power of 50%.
Suppose that, before running the test, the researcher thinks that both the null and the alternative are equally likely.
 If the test indicates a rejection of the null hypothesis, what is the probability that the null is false?

If the test indicates a failure to reject the null hypothesis, what is the probability that the null is true?
Denote events R = {Reject null}, A = {fAil to reject null}; T = {null is True}; F = {null is False}. Then we are given:
(1)
(2)
(1) and (2) show that we can find and and therefore also and Once we know probabilities of elementary events, we can find everything about everything.
Answering the first question: just plug probabilities in
Answering the second question: just plug probabilities in
Patton uses the Bayes' theorem and the law of total probability. The solution suggested above uses only additivity of probability.
Significance level and power of test
Significance level and power of test
In this post we discuss several interrelated concepts: null and alternative hypotheses, type I and type II errors and their probabilities. Review the definitions of a sample space and elementary events and that of a conditional probability.
Type I and Type II errors
Regarding the true state of nature we assume two mutually exclusive possibilities: the null hypothesis (like the suspect is guilty) and alternative hypothesis (the suspect is innocent). It's up to us what to call the null and what to call the alternative. However, the statistical procedures are not symmetric: it's easier to measure the probability of rejecting the null when it is true than other involved probabilities. This is why what is desirable to prove is usually designated as the alternative.
Usually in books you can see the following table.
Decision taken  
Fail to reject null  Reject null  
State of nature  Null is true  Correct decision  Type I error 
Null is false  Type II error  Correct decision 
This table is not good enough because there is no link to probabilities. The next video does fill in the blanks.
Significance level and power of test
The conclusion from the video is that
Nonlinear least squares: idea, geometry and implementation in Stata
Nonlinear least squares
Here we explain the idea, illustrate the possible problems in Mathematica and, finally, show the implementation in Stata.
Idea: minimize RSS, as in ordinary least squares
Observations come in pairs . In case of ordinary least squares, we approximated the y's with linear functions of the parameters, possibly nonlinear in x's. Now we use a function which may be nonlinear in . We still minimize RSS which takes the form . Nonlinear least squares estimators are the values that minimize RSS. In general, it is difficult to find the formula (closedform solution), so in practice software, such as Stata, is used for RSS minimization.
Simplified idea and problems in onedimensional case
Suppose we want to minimize . The Newton algorithm (default in Stata) is an iterative procedure that consists of steps:
 Select the initial value .
 Find the derivative (or tangent) of RSS at . Make a small step in the descent direction (indicated by the derivative), to obtain the next value .
 Repeat Step 2, using as the starting point, until the difference between the values of the objective function at two successive points becomes small. The last point will approximate the minimizing point.
Problems:
 The minimizing point may not exist.
 When it exists, it may not be unique. In general, there is no way to find out how many local minimums there are and which ones are global.
 The minimizing point depends on the initial point.
See Video 1 for illustration in the onedimensional case.
Problems illustrated in Mathematica
Here we look at three examples of nonlinear functions, two of which are considered in Dougherty. The first one is a power functions (it can be linearized applying logs) and the second is an exponential function (it cannot be linearized). The third function gives rise to two minimums. The possibilities are illustrated in Mathematica.
Finally, implementation in Stata
Here we show how to 1) generate a random vector, 2) create a vector of initial values, and 3) program a nonlinear dependence.
Alternatives to simple regression in Stata
Alternatives to simple regression in Stata
In this post we looked at dependence of EARNINGS on S (years of schooling). In the end I suggested to think about possible variations of the model. Specifically, could the dependence be nonlinear? We consider two answers to this question.
Quadratic regression
This name is used for the quadratic dependence of the dependent variable on the independent variable. For our variables the dependence is
.
Note that the dependence on S is quadratic but the righthand side is linear in the parameters, so we still are in the realm of linear regression. Video 1 shows how to run this regression.
Nonparametric regression
The general way to write this model is
The beauty and power of nonparametric regression consists in the fact that we don't need to specify the functional form of dependence of on . Therefore there are no parameters to interpret, there is only the fitted curve. There is also the estimated equation of the nonlinear dependence, which is too complex to consider here. I already illustrated the difference between parametric and nonparametric regression. See in Video 2 how to run nonparametric regression in Stata.