Many predictors, few observations – Lasso versus Ridge regression

In this series of blogs we touch upon the challenging situation where we want to study the relation between some consequence and many candidate predictors, but, at the same time, have relatively few observations. In our previous blogs we came to the point where we have seen evidence of a relationship. So, the logical questions becomes which xi’s are dominant and what is f? This is a major focus in statistical learning and machine learning, and many techniques are available. In this blog, some methods will be reviewed.

 

High level approach and the best technique

For the last decade or so, new techniques and modelling approaches have been proposed at an increasing pace. It is easy to get the impression that you just have to take the most advanced technique. However, the more basic techniques perform reasonably well for most problems. Instead, the steps surrounding the modelling should be chosen carefully. So we assume that after some work we have arrived at a sensible set of cleaned predictors, x1, ..., xp.

 

Examples of techniques

Lasso and Ridge regression are variants on classical regression that feature regularization. Lasso will typically select a subset of xi to be important. Stepwise regression is a classical technique with the same aim, but some regard Lasso to be superior.

 

Ridge shares characteristics with the procedure of first doing PCA and then doing regression on the components.

 

Random forests are really a bunch of trees. In a special way, many decision trees are constructed and the final model is an average of the outcome of these trees. The advantage compared to ridge and lasso is that random forests automatically can model non-linear and interaction effects; also they are able to take categorical variables as predictors. Random forests are still relatively simple to use but quite flexible in what it can do.

 

All of these techniques give an indication of the importance of each xbut still the relation to true cause-and-effect is non-trivial. These techniques have profound and deep mathematics behind them, see e.g. Hastie’s book Elements of Statistical Learning. Also, they are not completely black boxes, so that interpretation is possible, especially whether the results are consistent with the domain knowledge.

 

Lasso and bet on sparsity – if you have really few observations

In the first blog, the most extreme practical example is this:

A production process in the food industry has 10 steps where in total about 600 variables are monitored (pH value, concentrations of metabolons, temperature, durations). However, in the course of years only 20 relevant batches have been produced. For those batches, what is the relation between the production steps and a quality measure?

 

Note that this situation is extremely challenging in terms of what we expect of the data. The real question is whether it is reasonable to get an answer at all.

 

Lasso has the property to select only a subset of x’s and set the regression coefficients of the others to zero. This can be useful for understanding the relation.

 

The nature of the problem at hand could take on two extremes. If the predictor effects are dense, all x’s contribute a little bit. In contrast, if only a few predictors are important, they are called sparse. Different techniques work out differently in case of very few observations.

 

 

  Technique

   Dense

   Sparse

 Very few   observations

   Lasso

   Very poor performance

   Reasonable performance

  Ridge, random forests

   Poor performance

   Poor performance

 Moderate number     of observations

   Lasso

   Reasonable

   Good

   Ridge  

   Reasonable

   Reasonable

   Random forests

   Reasonable (and automatic

 non-linearities)

   Reasonable

 

The message is that in the very challenging situations, you might as well take the Lasso model: the others probably perform poorly anyway, and if you are lucky that the situation is sparse, Lasso will save the day: the bet on sparsity.

 

Summary

In our series of blogs, we elaborated on how to deal with many predictors and few observations: we wish to understand a relation between y and many predictors but not very many observations. Once all necessary steps are taken to arrive at a clean dataset, the following are some possible approaches.

  • Approach 1: Dimension reduction.
    Using PCA and domain knowledge, you gain more understanding, reduce the input space to only a limited number of variables, and use classical techniques.
  • Approach 2: Regularization and cross validation
    • Investigate the evidence that the inputs can predict and how strong this relation is.
      • Example techniques are Lasso, Ridge, random forests. You should only use techniques that give a realistic idea of the performance of future observations.
      • Lasso is one of the few techniques that might be useful in case of really few observations (“bet on sparsity”)
    • Study the relation between and the xi  if there is evidence for a reasonably strong relation. This can still be a challenge, and domain knowledge should play an important role at this point.

 

The list of approaches is probably incomplete and likely to be refined as our understanding grows. If you are interested in the latest status of our understanding, or want to contribute to this topic, please get in touch with me.

Ook interessant:

Op de hoogte blijven van het laatste nieuws? Volg ons op LinkedIn of meld u aan en beheer hier de mailing die u van CQM wilt ontvangen.

Drs. Jan Willem Bikker

Drs. Jan Willem Bikker

Senior Consultant