Balancing the scales: addressing class imbalance in your omics datasets

February 14th, 2023, by Emily Hashimoto-Roth
Figure1

There are a handful of foundational concepts that every machine learning enthusiast learns early on. Get rid of non-informative and redundant features. Optimizing your model’s hyperparameters is time well-spent. Finally, and of importance here, construct your model with a training and testing set that equally represents each class in your dataset. The effects of class balance (or lack thereof) can drastically influence a model’s performance, by biasing its predictive power for an over-represented class. This issue is pronounced in biological research fields, particularly when researching rare conditions, wherein collecting data can only be done to a limited degree. These circumstances often lead to class imbalance that favours control groups (which are generally healthy patients, or animal models that represent a control group).

Welcome, META-BOA. META-BOA (METAbolomics data Balancing with Over-sampling Algorithms) is online web application designed to address class imbalance in a manner accessible to all users. This tool, available on the CompLiMet (Computational Lipidomics and Metabolomics) suite, provides the user with four over-sampling algorithms to augment the minority class(es) within their dataset to, ultimately, generate a dataset better suited for building a machine learning model.

  1. Synthetic minority over-sampling technique (SMOTE)
  2. Borderline synthetic minority over-sampling technique (BSMOTE)
  3. Adaptive synthetic over-sampling (ADASYN)
  4. Random over-sampling examples (ROSE)

To help users understand the effects of over-sampling, META-BOA also visualizes the user’s data before and after over-sampling via principal component analysis (PCA) and t-distributed stochastic neighbour embedding (t-SNE). META-BOA also performs a simple random forest classification, whose results are visualized by a ROC plot, to compare sample classification performance before and after over-sampling. These visualizations methods are implemented so user may pick the most appropriate method for their specific dataset.

Want to learn more?
Hashimoto-Roth, E.*, Surendra, A., Lavallée-Adam, M., Bennett, S. A. L., and Čuperlović-Culf, M. (2022) METAbolomics data Balancing with Over-sampling Algorithms (META-BOA): an online resource for addressing class imbalance. Bioinformatics, 38(23), 5326–5327. https://doi.org/10.1093/bioinformatics/btac649