What is cointegration? The discussions here and here are bad because they link the definition to differencing a time series. In fact, to understand cointegration, you need two notions: stationary processes (please read before continuing) and linear dependence.
Definition. We say that vectors are linearly dependent if there exist numbers
, not all of which are zero, such that the linear combination
is a zero vector.
Recall from this post that stationary processes play the role of zero in the set of all processes. Replace in the above definition "vectors" with "processes" and "a zero vector" with "a stationary process" and - voilà - you have the definition of cointegration:
Definition. We say that processes are cointegrated if there exist numbers
, not all of which are zero, such that the linear combination
is a stationary process. Remembering that each process is a collection of random variables indexed with time moments
, we obtain a definition that explicitly involves time: processes
are cointegrated if there exist numbers
, not all of which are zero, such that
where
is a stationary process.
To fully understand the implications, you need to know all the intricacies of linear dependence. I do not want to plunge into this lengthy discussion here. Instead, I want to explain how this definition leads to a regression in case of two processes.
If are cointegrated, then there exist numbers
, at least one of which is not zero, such that
where
is a stationary process. If
, we can solve for
obtaining
with
and
. This is almost a regression, except that the mean of
may not be zero. We can represent
, where
,
. Then the above equation becomes
, which is simple regression. The case
leads to a similar result.
Practical recommendation. To see if are cointegrated, regress one of them on the other and test the residuals for stationarity.