Sims and Uhlig argue: although classical (frequentist) \(p\)-values are asymptotically equivalent to Bayesian posteriors, they should not be interpreted as probabilities. This is because the equivalence breaks down in non-stationary models.
The paper uses small sample sizes, with \(T=100\) - This post examines how the results change with \(T=10,000\), when the asymptotic behavior kicks in.

The Setup

Consider a simpler AR(1) model: \begin{equation} y_t=\rho y_{t-1} + \epsilon_t \end{equation} To simplify things, suppose \(\epsilon_t \sim N(0,1)\). Classical inference suggests that for \(|\rho|<1\), the estimator is asymptotically normal and converges at rate \(\sqrt{T}\): \begin{equation} \sqrt{T}(\hat{\rho}-\rho) \rightarrow^{L} N(0,(1-\rho^2)) \end{equation} For \(\rho=1\), however, we get a totally different distribution, which converges at rate \(T\), instead of rate \(\sqrt{T}\): \begin{equation} T(\hat{\rho}-\rho)=T(\hat{\rho}-1)= \rightarrow^{L} \frac{(1/2)([W(1)]^2-1)}{\int_0^1 [W(r)]^2 dr} \end{equation} where \(W(1)\) is a Brownian motion. Although it looks complicated, it is easier to visualize when you see \(W(1)^2\) is actually a \(\chi^2(1)\) variable. This is left skewed, as the probability that a \(\chi^2(1)\) is less than one is 0.68 and large realizations of \([W(1)]^2\) in the numerator get down-weighted by a large denominator (it is the same Brownian motion in the numerator and denominator). In the paper, the authors choose 31 values of \(\rho\) between 0.8 to 1.1 in increments of 0.01. For each \(\rho\) they simulate 10,000 samples of the AR(1) model described above with \(T=100\). Finally, they run an OLS regression of \(y_t\) on \(y_{t-1}\) to get the distributions for \(\hat{\rho}\) (the OLS estimator of \(\rho\)). Below I show the distribution of \(\hat{\rho}\) for selected values of \(\rho\):


Another way to think about the data is to look at the distribution of \(\rho\) given observed values of \(\hat{\rho}\). This is symmetric about 0.95:


Their problem with using \(p\)-values as probabilities is that if we observe \(\hat{\rho}=0.95\), we can reject the null of \(\rho=0.9\), but we fail to reject the null of \(\rho=1\) (think about the area in the tails after normalizing the distribution to integrate to 1), even though \(\rho\) given \(\hat{\rho}\) is roughly symmetric about 0.95:


The problem is distortion by irrelevant information: Values of \(\hat{\rho}\) much below 0.95 are more likely given \(\rho=1\) than are values of \(\hat{\rho}\) much above 0.95 given \(\rho=0.9\). This is irrelevant as we have already observed \(\hat{\rho}=0.95\), so we know it is not far above or below.

The prior required to generate these results (i.e. the prior that would let us interpret \(p\)-values as posterior probabilities) is sample dependent. Usually, classical inference is asymptotically equivalent to Bayesian inference using a flat prior, but it is not the case here. The authors show that classical analysis is implicitly putting progressively more weight on values of \(\rho\) above one as \(\hat{\rho}\) gets closer to 1.

Testing with Larger Samples

At first, I found the results counter-intuitive. The first figure above shows that the skewness arrives gradually in finite samples. This is strange, because the asymptotic properties of \(\hat{\rho}\) are only non-normal for \(\rho=1\). I figured this was the result of using small samples of \(T=100\). Under a flat prior, the distribution of \(\rho\) given the data and \(\epsilon_t\) having variance of 1 is: \begin{equation} \rho \sim N(\hat{\rho}, (\sum\limits_{t=1}^T y_{t-1})^{-1}) \end{equation} This motivates my intuition for why the skewness arrives slowly: even for small samples, as \(\rho\) gets close to 1, \(\sum\limits_{t=1}^T y_{t-1}\) can be very large.

I repeat their analysis, except instead of \(T=100\), I use \(T=10,000\). As you can see, the asymptotic behavior kicks in and the skewness arrives only at \(\rho=1\):


I also found that for \(T=10,000\) the distribution of \(\rho\) conditional on \(\hat{\rho}\), does not spread out more for smaller values of \(\hat{\rho}\), that is a small sample result.


The point of this paper is to show that the classical way of dealing with unit roots implicitly makes undesirable assumptions - you need a sample-dependent prior which puts more weight on high values of \(\rho\). To a degree, the authors’ results are driven by the short length of the simulated series. The example where you reject \(\rho=0.9\) but fail to reject \(\rho=1\) wouldn’t happen in large samples, as the asymptotic kick in and faster rate of convergence for \(\rho=1\) gives the distribution less spread.

For now, however, the authors’ criticism is still valid. With quarterly data from 1950-Present you get about 260 observations. Macroeconomics will have to survive until year 4,450 for there to be 10,000 observations, and that’s a long way off.