There are a handful of foundational concepts that every machine learning enthusiast learns early on, such as getting rid of non-informative and redundant features. Optimizing your model's hyperparameters is time well-spent. Finally, and of importance here, construct your model with a training and testing set that equally represents each class in your dataset. The effects of class balance (or lack thereof) can drastically influence a model’s performance, by biasing its predictive power for an over-represented class. This issue is pronounced in biological research fields, particularly when researching rare conditions, wherein collecting data can only be done to a limited degree. These circumstances often lead to class imbalance that favours control groups (which are generally healthy patients, or animal models that represent a control group).
Welcome, META-BOA. META-BOA (METAbolomics data Balancing with Over-sampling Algorithms) is online web application designed to address class imbalance in a manner accessible to all users. This tool, available on the CompLiMet (Computational Lipidomics and Metabolomics) suite, provides the user with four over-sampling algorithms to augment the minority class(es) within their dataset to, ultimately, generate a dataset better suited for building a machine learning model.
To help users understand the effects of over-sampling, META-BOA also visualizes the user's data before and after over-sampling via principal component analysis (PCA) and t-distributed stochastic neighbour embedding (t-SNE). META-BOA also performs a simple random forest classification, whose results are visualized by a ROC plot, to compare sample classification performance before and after over-sampling. These visualizations methods are implemented so user may pick the most appropriate method for their specific dataset.
Hashimoto-Roth, E.*, Surendra, A., Lavallée-Adam, M., Bennett, S. A. L., and Čuperlović-Culf, M. (2022) METAbolomics data Balancing with Over-sampling Algorithms (META-BOA): an online resource for addressing class imbalance. Bioinformatics, 38(23), 5326–5327. doi: 10.1093/bioinformatics/btac649.