Many predictors, few observations – Dimension reduction and regularization
In our last blog we touched upon the challenging situation where we want to study the relation between some consequence y and many candidate predictors, but, at the same time, have relatively few observations. The main goal is to study the cause-and-effect relationship, y = f (x1,…,xp). We handle this by answering two consecutive questions: is there evidence for a relation, and is it reasonably strong? This blog focusses on two possible approaches based on dimension reduction and regularization.
Approach 1: Dimension reduction
Possibly, your large set of predictors can be reduced to a lower number. For instance, we once had a dataset with about 100 predictors that represent the intensities of a spectrum analysis of chemical compounds in soya beans: the 100 predictors ran along the wave lengths of the analysis. Doing a principle component analysis gave a shorter description of the data. The first principle component was basically an average of all xi ’s and describes the general intensity over all wave lengths in the soya beans analysis. The second component was a contrast that describes whether there is more intensity at the lower or at the higher frequencies. This should be verified with the local expert to see whether this makes sense, using the relation between frequencies and chemistry. Note that a danger is that an expert can always find reasons why the data findings are reasonable. If we are convinced, we have reduced our problem to few predictors, and classical methods such as regression may be helpful.
Approach 2: Regularization and cross validation to be future proof
In our pervious blog, we hinted on the danger of overfitting. To reduce this risk, one can use methods that feature regularization (a.k.a. weight decay in machine learning). Regularization pushes the regression coefficients bi towards 0, so that the xi have a less prominent effect in y = f (x1,…,xp) = bo + b1x1 + ... + bpxp. In other models and more generally, it pushes the model towards the constant model. The amount of regularization is chosen so that the average prediction errors of future observations are minimal. The consequence is that predictions tend to be slightly off on average (called bias) but the predictions will be less sensitive seen from the point where you could have done the same exercise with an alternative collection of your observations (less variance). This is known as the bias-variance trade-off.
But, how do you know what the future prediction errors will be? One way is to split the data into a training set and test set, where the regression model fitting is done solely on the training set and the test set’s purpose is just to assess the performance. With few observations it is not attractive to make such a split. An alternative is to use k-fold cross validation, which is often available or even the default method in several implementations. In about half of our projects we have to do an elaborate version of this: if the data appears in groups or has time order dependencies, the splitting and grouping of data should respect those groups. This can be the topic of another blog.
Evidence of a relation
Methods that use regularization give a model fit quality for future observations. For continuous y, this can be expressed in the classical R-squared, the percentage of variation in future observations explained by the model. In cases with really few observations, the regularization tends to choose the safe side and explain only a little. This can be disappointing, but is also a consequence of the choice to look only at safe, regularized models.
A parallel can be made of regularization and a classical setting. Suppose there is an old and new version of a product or process step. A test is performed collecting observations for both old and new, say n = 40 each. Here is a technical example.
We are mainly interested in comparing the means µold and µnew. If management is convinced that µnew > µold they are willing to make the sizable investment. Here, the classical techniques are statistical significance and confidence intervals. Let’s say that an improvement of 3 units is minor, 6 units is interesting, and 9 units impressive. Our small dataset brings uncertainty to the comparison. The analysist performs a two-sample t-test, sees p-value = 0.04 and concludes there is a statistically significant difference between the old and new situation. More importantly, she looks at the 95% confidence interval for µnew - µold and this is 3 units to 9 units, i.e., from minor to impressive, with point estimate = 6 units, interesting.
One good way of answering the question “do we have evidence of an improvement?” would be to say “We have evidence of at least a minor improvement, but it could be more.”
The predicted performance for future observations can be compared to this focus on the least favorable end of the 95% confidence interval. The results would be more favorable with more data in both cases too. For example, if n = 100 and p = 50, this may give a very modest predicted R squared of 10%. With n = 2500, you could have a much higher predicted R squared, but it could vanish as well.
Strength of a relation
Suppose we have seen evidence of a relationship, the logical questions becomes which xi's are dominant and what is f? This is a major focus in statistical learning and machine learning, and many techniques are available. We name Ridge regression, Lasso, and random forests, but you could plug in others as well. In our next blog these methods will be reviewed.
- Blog 1 (van 3) - Many predictors and few observations
- Blog 3 (van 3) - Many predictors, few observations - Lasso versus Ridge regression
- Neurale netwerken zijn knap maar zijn ze ook betrouwbaar?
- Predictive maintenance - Big Data versus Small Data in voorspelmodellen
- Voorbeelden uit onze praktijk... Nederlandse cases en cases in English