Many predictors, few observations – Lasso versus Ridge regression
In this series of blogs we touch upon the challenging situation where we want to study the relation between some consequence and many candidate predictors, but, at the same time, have relatively few observations. In our previous blogs we came to the point where we have seen evidence of a relationship. So, the logical questions becomes which x_{i}’s are dominant and what is f? This is a major focus in statistical learning and machine learning, and many techniques are available. In this blog, some methods will be reviewed.
High level approach and the best technique
For the last decade or so, new techniques and modelling approaches have been proposed at an increasing pace. It is easy to get the impression that you just have to take the most advanced technique. However, the more basic techniques perform reasonably well for most problems. Instead, the steps surrounding the modelling should be chosen carefully. So we assume that after some work we have arrived at a sensible set of cleaned predictors, x_{1, ..., }x_{p.}
Examples of techniques
Lasso and Ridge regression are variants on classical regression that feature regularization. Lasso will typically select a subset of x_{i }to be important. Stepwise regression is a classical technique with the same aim, but some regard Lasso to be superior.
Ridge shares characteristics with the procedure of first doing PCA and then doing regression on the components.
Random forests are really a bunch of trees. In a special way, many decision trees are constructed and the final model is an average of the outcome of these trees. The advantage compared to ridge and lasso is that random forests automatically can model nonlinear and interaction effects; also they are able to take categorical variables as predictors. Random forests are still relatively simple to use but quite flexible in what it can do.
All of these techniques give an indication of the importance of each x_{i }but still the relation to true causeandeffect is nontrivial. These techniques have profound and deep mathematics behind them, see e.g. Hastie’s book Elements of Statistical Learning. Also, they are not completely black boxes, so that interpretation is possible, especially whether the results are consistent with the domain knowledge.
Lasso and bet on sparsity – if you have really few observations
In the first blog, the most extreme practical example is this:
A production process in the food industry has 10 steps where in total about 600 variables are monitored (pH value, concentrations of metabolons, temperature, durations). However, in the course of years only 20 relevant batches have been produced. For those batches, what is the relation between the production steps and a quality measure?
Note that this situation is extremely challenging in terms of what we expect of the data. The real question is whether it is reasonable to get an answer at all.
Lasso has the property to select only a subset of x_{i }’s and set the regression coefficients of the others to zero. This can be useful for understanding the relation.
The nature of the problem at hand could take on two extremes. If the predictor effects are dense, all x_{i }’s contribute a little bit. In contrast, if only a few predictors are important, they are called sparse. Different techniques work out differently in case of very few observations.

Technique 
Dense 
Sparse 
Very few observations 
Lasso 
Very poor performance 
Reasonable performance 
Ridge, random forests 
Poor performance 
Poor performance 

Moderate number of observations 
Lasso 
Reasonable 
Good 
Ridge 
Reasonable 
Reasonable 

Random forests 
Reasonable (and automatic nonlinearities) 
Reasonable 
The message is that in the very challenging situations, you might as well take the Lasso model: the others probably perform poorly anyway, and if you are lucky that the situation is sparse, Lasso will save the day: the bet on sparsity.
Summary
In our series of blogs, we elaborated on how to deal with many predictors and few observations: we wish to understand a relation between y and many predictors but not very many observations. Once all necessary steps are taken to arrive at a clean dataset, the following are some possible approaches.
 Approach 1: Dimension reduction.
Using PCA and domain knowledge, you gain more understanding, reduce the input space to only a limited number of variables, and use classical techniques.  Approach 2: Regularization and cross validation
 Investigate the evidence that the inputs can predict y and how strong this relation is.
 Example techniques are Lasso, Ridge, random forests. You should only use techniques that give a realistic idea of the performance of future observations.
 Lasso is one of the few techniques that might be useful in case of really few observations (“bet on sparsity”)
 Study the relation between y and the x_{i }if there is evidence for a reasonably strong relation. This can still be a challenge, and domain knowledge should play an important role at this point.
 Investigate the evidence that the inputs can predict y and how strong this relation is.
The list of approaches is probably incomplete and likely to be refined as our understanding grows. If you are interested in the latest status of our understanding, or want to contribute to this topic, please get in touch with me.
Ook interessant:
 Blog 2 (van 3)  Many predictors, few observations – Dimension reduction and regularization
 Blog 1 (van 3)  Many predictors and few observations
 Neurale netwerken zijn knap maar zijn ze ook betrouwbaar?
 Predictive maintenance  Big Data versus Small Data in voorspelmodellen
 Voorbeelden uit onze praktijk... Nederlandse cases en cases in English
Op de hoogte blijven van het laatste nieuws? Volg ons op LinkedIn of meld u aan en beheer hier de mailing die u van CQM wilt ontvangen.