Skewed sampling from heteroscedastic distributions
Here we consider the problems that arise when a heteroscedastic distribution (one where the variance changes along the x-axis) is analysed by a simple straight line fit. We use Monte Carlo simulations to show that skewed sampling from such a distribution will lead to statistically significant trends which are entirely artefactual. Even the minor skew that arises naturally in any random sample can be large enough to generate artefactual trends.
As an illustration, suppose a dataset is sampled evenly from within the blue area in the left-hand plot in Figure 1. The y variable is heteroscedastic with respect to x, that is, the variance of y is higher at low values of x. Regression of y on x gives a horizontal line (dotted line), which is not affected by the uneven variance. Now consider non-random sampling of y, in which more samples are collected at high values of y (shaded area of the right-hand plot). Now, the heteroscedasticity matters, as the regression line is no longer horizontal, but is pulled up at the left-hand side by the greater number of samples at high values of y. That is exactly what happened in the LOG13-blogs case.
To show that this effect works in practice we generated 1000 points drawn at random from a heteroscedastic distribution (in which the variation in p is a function of q) for two cases: unevenly and evenly sampled.
We begin with the unevenly sampled dataset. Figure 2 shows scatter plots in both directions (p predicting q and vice versa), with summary lines drawn using a simple linear fit and using loess.
In Figure 2, the simple linear fit appears to reveal a clear linear relationship (left-hand plots), but that is simply a consequence of combining skewed sampling and heteroscedasticity. As the sampling becomes more skewed the linear trend becomes stronger (not shown).
The loess fit for p predicting q (top right-hand plot) shows a clear quadratic relationship reflecting the shape of the error distribution. Note that the loess fit for q predicting p does not show a non-linear relationship.
Next we turn to unskewed (uniform) sampling and show the same four plots for the evenly sampled dataset in Figure 3.
When the same heteroscedastic distribution was evenly sampled (Figure 3), the simple linear fits often showed a marginally significant relationship between p and q, even though the underlying relationship was simply symmetric random noise. The weak trend in the ‘unskewed’ data reflects the minor non-uniformity that naturally arises in a random sample (and looks very like the trend exhibited by the LGO13-panel dataset). The direction and significance of this trend varied from run to run as it depends on the random numbers that happened to be sampled each time.
Note that the reason for the skew in the data does not matter (it may correctly or incorrectly reflect the underlying population). Any uneven distribution of data with this type of heteroscedastic error distribution will generate spurious linear trends. The LOG13-panel data showed a (slight) negative trend because there was a slight preponderance of data on the high side of CLIM.