From Hypothesis to Insights: A Step-by-Step Guide for people interested in Machine Learning Engineering

As a data scientist and machine learning (ML) engineer, transforming raw data into actionable insights is an exhilarating journey. This journey involves a series of well-defined steps, each crucial for the success of the project. In this blog, I’ll walk you through the entire process, from formulating a hypothesis to reporting results to stakeholders. Let’s start!

Formulate a Hypothesis

Every successful ML project begins with a clear and concise hypothesis. This hypothesis is a testable statement that your project will aim to prove or disprove. It sets the direction and scope of your work. For instance, you might hypothesize: “Using image data, we can classify plant species with an accuracy of at least 80%.”

It’s essential to clearly define the problem you’re solving, ensure your hypothesis is specific and measurable, and understand the business context and the impact of your findings.

Gather Images and Labels

Once you have a hypothesis, the next step is to gather the necessary data. For an image classification project, this means collecting images and their corresponding labels. You can use tools like Beautiful Soup or Scrapy for web scraping, explore image databases such as Google Images and Kaggle Datasets, or use manual labeling tools like Label Studio and CVAT. Ensure your dataset is diverse and representative, balance the number of images per class to avoid bias, and preprocess images for consistency.

Define Experiments

Defining your experiments involves choosing the types of models, hyperparameters, and datasets you will use. This step requires careful planning to ensure comprehensive coverage of potential solutions. Select models based on the complexity and nature of your data, define a range of hyperparameters for tuning, and split your data into training, validation, and test sets to avoid overfitting. Tools like TensorFlow, PyTorch, Keras, Optuna, and Scikit-learn can be very helpful here.

Setup Experiment Tracking

Experiment tracking is essential for maintaining reproducibility and managing multiple experiments simultaneously. It allows you to compare different models and configurations systematically. Platforms like MLflow, Weights & Biases, and TensorBoard can help you track hyperparameters, metrics, and model artifacts. Ensure experiments are reproducible by logging configurations and visualize training progress and results for better insights.

Train the Machine Learning Model(s)

Training your ML models is where the magic happens. This step involves feeding your data into the models and adjusting weights to minimize errors. Use frameworks like TensorFlow, PyTorch, and Keras, and leverage compute resources such as GPUs, TPUs, or cloud platforms like AWS, GCP, and Azure. Monitor training metrics to detect overfitting or underfitting, use techniques like early stopping and learning rate decay to improve training, and save checkpoints to resume training if needed.

Test the Models on a Hold-Out Test Set

After training, it’s crucial to evaluate your models on a hold-out test set that was not used during training or validation. This provides an unbiased estimate of model performance. Use evaluation metrics such as accuracy, precision, recall, F1-score, and confusion matrix with libraries like Scikit-learn and TensorFlow. Ensure the test set is representative of real-world data and perform error analysis to understand model weaknesses.

Register the Most Suitable Model

Once you have identified the best-performing model, register it for deployment. This step involves saving the model along with its configuration and metadata. Use model registry tools like MLflow Model Registry and TensorFlow Model Garden, and serialization formats like ONNX and TensorFlow SavedModel. Document model details, ensure compatibility with your deployment environment, and maintain a version history to track improvements over time.

Visualize and Report Results

The final step is to visualize your findings and report them to your team or stakeholders. Effective visualization and clear reporting can significantly enhance the impact of your work. Use tools like Matplotlib, Seaborn, Plotly, and Dash for visualization, and Jupyter Notebooks, Google Slides, or PowerPoint for reporting. Use clear and intuitive visualizations to communicate results, highlight key insights and their implications for the business, and provide actionable recommendations based on your findings.

First Model: CNN Model

Here is a simple CNN model for the CIFAR-10 dataset:

import tensorflow as tf
from tensorflow.keras import datasets, layers, models
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import seaborn as sns

# Load CIFAR-10 dataset
(train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data()

# Normalize pixel values to be between 0 and 1
train_images, test_images = train_images / 255.0, test_images / 255.0

# Split training data into training and validation sets
from sklearn.model_selection import train_test_split
train_images, val_images, train_labels, val_labels = train_test_split(train_images, train_labels, test_size=0.2, random_state=42)

# Define the CNN model
def create_model():
    model = models.Sequential()
    model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))
    model.add(layers.MaxPooling2D((2, 2)))
    model.add(layers.Conv2D(64, (3, 3), activation='relu'))
    model.add(layers.MaxPooling2D((2, 2)))
    model.add(layers.Conv2D(64, (3, 3), activation='relu'))
    model.add(layers.Flatten())
    model.add(layers.Dense(64, activation='relu'))
    model.add(layers.Dense(10, activation='softmax'))
    return model

model = create_model()
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# Setup TensorBoard
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir="./logs")

# Train the model
history = model.fit(train_images, train_labels, epochs=10, 
                    validation_data=(val_images, val_labels), 
                    callbacks=[tensorboard_callback])

# Evaluate the model on the test set
test_loss, test_acc = model.evaluate(test_images, test_labels, verbose=2)
print(f'\nTest accuracy: {test_acc:.2f}')

# Save the model
model.save('best_model.h5')

# Visualize training history
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(history.history['accuracy'], label='accuracy')
plt.plot(history.history['val_accuracy'], label = 'val_accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.ylim([0, 1])
plt.legend(loc='lower right')

plt.subplot(1, 2, 2)
plt.plot(history.history['loss'], label='loss')
plt.plot(history.history['val_loss'], label = 'val_loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend(loc='upper right')
plt.show()

# Confusion matrix
predictions = model.predict(test_images)
predicted_labels = np.argmax(predictions, axis=1)
cm = confusion_matrix(test_labels, predicted_labels)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=range(10))
disp.plot(cmap=plt.cm.Blues)
plt.show()

With the following resulting Accuracy and Loss:

Despite the model training and evaluation process working well, the final test accuracy of 68% is below the desired 80%. To improve performance, we can try several strategies such as data augmentation, adding more layers, or tuning hyperparameters.

Second Model: Data Augmentation and Updated CNN

Here’s an updated version of the code that includes data augmentation and a more complex CNN architecture. Additionally, we’ll update the model saving format to the recommended native Keras format.

import tensorflow as tf
from tensorflow.keras import datasets, layers, models
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import seaborn as sns

# Load CIFAR-10 dataset
(train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data()

# Normalize pixel values to be between 0 and 1
train_images, test_images = train_images / 255.0, test_images / 255.0

# Split training data into training and validation sets
from sklearn.model_selection import train_test_split
train_images, val_images, train_labels, val_labels = train_test_split(train_images, train_labels, test_size=0.2, random_state=42)

# Data augmentation
datagen = ImageDataGenerator(
    rotation_range=20,
    width_shift_range=0.2,
    height_shift_range=0.2,
    horizontal_flip=True,
    fill_mode='nearest'
)
datagen.fit(train_images)

# Define an enhanced CNN model
def create_model():
    model = models.Sequential()
    model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))
    model.add(layers.BatchNormalization())
    model.add(layers.MaxPooling2D((2, 2)))
    model.add(layers.Conv2D(64, (3, 3), activation='relu'))
    model.add(layers.BatchNormalization())
    model.add(layers.MaxPooling2D((2, 2)))
    model.add(layers.Conv2D(128, (3, 3), activation='relu'))
    model.add(layers.BatchNormalization())
    model.add(layers.Flatten())
    model.add(layers.Dense(128, activation='relu'))
    model.add(layers.Dropout(0.5))
    model.add(layers.Dense(10, activation='softmax'))
    return model

model = create_model()
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# Setup TensorBoard
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir="./logs")

# Train the model with data augmentation
history = model.fit(datagen.flow(train_images, train_labels, batch_size=32),
                    epochs=20, 
                    validation_data=(val_images, val_labels), 
                    callbacks=[tensorboard_callback])

# Evaluate the model on the test set
test_loss, test_acc = model.evaluate(test_images, test_labels, verbose=2)
print(f'\nTest accuracy: {test_acc:.2f}')

# Save the model in the native Keras format
model.save('best_model.keras')

# Visualize training history
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(history.history['accuracy'], label='accuracy')
plt.plot(history.history['val_accuracy'], label = 'val_accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.ylim([0, 1])
plt.legend(loc='lower right')

plt.subplot(1, 2, 2)
plt.plot(history.history['loss'], label='loss')
plt.plot(history.history['val_loss'], label = 'val_loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend(loc='upper right')
plt.show()

# Confusion matrix
predictions = model.predict(test_images)
predicted_labels = np.argmax(predictions, axis=1)
cm = confusion_matrix(test_labels, predicted_labels)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=range(10))
disp.plot(cmap=plt.cm.Blues)
plt.show()

Explanation of Updates

  • Data Augmentation: Added data augmentation using ImageDataGenerator to improve the model’s ability to generalize.
  • Enhanced CNN Model: Added more layers, batch normalization, and dropout to the CNN architecture to improve learning and reduce overfitting.
  • Updated Model Saving: Changed the model saving format to the recommended Keras format (model.save('best_model.keras')).

Expected Improvements

  • Data Augmentation: Helps the model learn more robust features by introducing variations in the training data.
  • Enhanced Model Architecture: Additional layers, batch normalization, and dropout can improve model performance and generalization.
  • Experiment with Hyperparameters: Adjusting hyperparameters such as learning rate, batch size, and the number of epochs could further improve performance.

Hyperparameter Tuning

There are at least three kinds of hyperparameter tuning that I would like to dive into:

  1. Grid Search
  2. Random Search
  3. Bayesian Optimization

Grid Search

Grid Search performs an exhaustive evaluation of every possible combination of hyperparameters, ensuring that all potential configurations are explored. Its simplicity and ease of implementation make it accessible and straightforward to use.

However, Grid Search can be computationally expensive and slow, especially when dealing with numerous hyperparameters and large datasets. This exhaustive approach may lead to inefficiency, as it often spends time evaluating suboptimal combinations that are unlikely to yield the best results.

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [10, 20, None]
}

grid_search = GridSearchCV(estimator=RandomForestClassifier(), param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_train)

Random Search

Random Search is faster than Grid Search, as it evaluates a fixed number of random hyperparameter combinations, significantly saving time. By exploring a broader range of hyperparameter values, it has the potential to discover better configurations.

On the other hand, Random Search is not exhaustive and might miss the best hyperparameters since it does not evaluate all possible combinations. Additionally, its results can vary with different random seeds, introducing a level of uncertainty in the outcomes.

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_dist = {
    'n_estimators': randint(100, 200),
    'max_depth': [10, 20, None]
}

random_search = RandomizedSearchCV(estimator=RandomForestClassifier(), param_distributions=param_dist, n_iter=10, cv=5)
random_search.fit(X_train, y_train)

Bayesian Optimization: Efficient Hyperparameter Tuning

Bayesian Optimization builds a probabilistic model of the function mapping hyperparameters to the objective. It selects hyperparameters to evaluate by balancing exploration (trying new areas) and exploitation (focusing on promising areas).

The two main advantages are efficiency and it’s smart search. It’s effiecient since it requires fewer evaluations to find optimal hyperparameters. And Smart Search prioritizes promising regions of the hyperparameter space.

from hyperopt import fmin, tpe, hp, Trials
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import numpy as np

def objective(params):
    clf = RandomForestClassifier(**params)
    return -np.mean(cross_val_score(clf, X_train, y_train, cv=5, n_jobs=-1))

space = {
    'n_estimators': hp.quniform('n_estimators', 100, 200, 1),
    'max_depth': hp.choice('max_depth', [10, 20, None])
}

trials = Trials()
best = fmin(fn=objective, space=space, algo=tpe.suggest, max_evals=50, trials=trials)

Automated Tuning Tools: Hyperopt, Optuna, and AutoML Frameworks

Hyperopt: A Python library for serial and parallel optimization over hyperparameters. Suitable for scenarios where traditional Grid or Random Search are too slow or infeasible.

Optuna: An automatic hyperparameter optimization software framework, designed for both ease of use and performance. Ideal for machine learning tasks requiring sophisticated and efficient optimization strategies.

Google AutoML: A suite of machine learning products that enables developers with limited ML expertise to train high-quality models specific to their business needs. Useful for rapid development and deployment of custom ML models without deep expertise.

H2O.ai: An open-source software for data modeling and general computing, known for its scalable and distributed machine learning. Preferred for enterprises requiring robust, scalable machine learning solutions.

Small exampe with Optuna:

import optuna
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

def objective(trial):
    n_estimators = trial.suggest_int('n_estimators', 100, 200)
    max_depth = trial.suggest_categorical('max_depth', [10, 20, None])
    clf = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth)
    return -np.mean(cross_val_score(clf, X_train, y_train, cv=5))

study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=50)

TensorBoard

To access the TensorBoard functionality, activate your virtual environment and launch TensorBoard:

source path/to/your/virtualenv/bin/activate
tensorboard --logdir=logs/fit

This will give you a comprehensive overview of all training sessions and metrics, providing valuable insights into your model’s performance.

Conclusion

Navigating the machine learning project lifecycle requires careful planning, execution, and communication. By following these steps and leveraging the right tools and libraries, you can enhance the accuracy, reproducibility, and impact of your ML projects.

Add a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.