2020 ESA Annual Meeting (August 3 - 6)

COS 195 Abstract - Using the right tool for the job: Understanding the difference between unsupervised and supervised analyses of multivariate ecological data

Eric R. Scott, Biology, Tufts University, Medford, MA and Elizabeth Crone, Department of Biology, Tufts University, Medford, MA
Background/Question/Methods

Ecologists often collect multivariate data with the aim of determining which of many possible predictor variables are associated with a response. Unsupervised analyses (e.g. principal components analysis, PCA) find axes that explain variation in predictor variables, whereas supervised analyses (e.g., partial least squares, PLS) explain co-variation between predictor variables and one or more response variables. These approaches are not interchangeable, especially when the predictors most responsible for variation in the response are not the greatest source of overall variation in the data—a situation that ecologists are likely to encounter.

Results/Conclusions

To illustrate the differences between PCA and PLS, we used PLS to re-analyze data from a case study that originally used PCA. The original study used leaf traits of several Solanum species and asked if the main axis of variation in the leaf traits (i.e. the first principal component axis) varied with habitat temperature and precipitation. They found a significant relationship with temperature, but not precipitation. However, when we instead asked the question "do leaf traits vary with temperature or precipitation?" and analyzed the data with PLS, we found a highly significant relationship between leaf traits and both temperature and precipitation. Examining the loadings of the PCA and PLS showed that the leaf traits that contributed to the overall variation in the data (PCA) differed from those that explained variation in habitat temperature or precipitation (PLS).

We also used simulated datasets generated with different covariance structures to further illustrate differences between unsupervised and supervised analyses. When there were many predictor variables that strongly co-varied but were unrelated to the response, PLS greatly outperformed PCA at identifying which of the many predictors were most closely associated with the response.

There are many applications for both unsupervised and supervised approaches in ecology. However, PCA is currently overused, at least in part because supervised approaches such as PLS are less familiar.