Intro to Machine Learning: Obesity Prediction

Federico Gerardi

In this post we will explore how to predict obesity using machine learning. We will use the Obesity Dataset to try different machine learning models and see which one performs best.

The Dataset

We have 17 features and 2111 samples. For more informations about the dataset, you can check the Kaggle page.

The Models

Linear Regression

Linear regression is a linear approach to modeling the relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression.

Regression Equation: y = 0.00x + 150.00
Loss: 5625.00

With linear regression we aim to find the best line that fits the data. We can then use this line to predict new values.

Trying to predict the weight of a person based on the other features, we get the following results:

Decision Tree

BMI ≥ 25?NormalOverweight
Select Field to Build Decision Tree:

A decision tree makes decisions based on splitting data by chosen fields. The root node shows the primary decision, with branch nodes representing outcomes.

Decision trees split the data into branches based on feature values to make predictions. They are easy to interpret and can capture non-linear relationships in the data.

For obesity classification using decision trees, we achieved:

Logistic Regression

Logistic regression is a statistical model that uses a logistic function to model a binary dependent variable (in this case, whether someone is overweight or not).

Height (cm)Weight (kg)160170180607080
Decision Boundary Equation: 0.001*x + 0.001*y + 0.00 = 0
Accuracy: 50.0%
Normal Weight Overweight

Unlike linear regression which predicts continuous values, logistic regression estimates probabilities for classification tasks. The S-shaped sigmoid curve transforms linear combinations of features into probabilities between 0 and 1.

For our obesity dataset, logistic regression achieved:

Multi-Layer Perceptron (Neural Network)

A multi-layer perceptron (MLP) is a class of feedforward artificial neural network that consists of at least three layers of nodes: an input layer, a hidden layer, and an output layer.

BMIAgeGenderActivityInputHidden 1Hidden 2NormalOverweightObeseOutputPhase: Input Processing

MLPs can learn non-linear relationships in the data and are capable of modeling complex patterns. Each neuron processes its inputs and passes the result through an activation function to the next layer.

For our obesity prediction task, the MLP model achieved:

Model Comparison

After evaluating all models, we can observe that:

  1. Linear Regression: Good for predicting continuous variables like weight, but not ideal for classification.
  2. Decision Tree: Provides interpretable rules but may not capture all patterns.
  3. Logistic Regression: Better than linear regression for classification, with good interpretability.
  4. Multi-Layer Perceptron: Achieved the highest accuracy but works as a "black box" with less interpretability.

The choice of model depends on whether you prioritize performance or interpretability. For clinical applications, a balance of both might be preferred.