# Exploring Supervised Learning Techniques in Anomaly Detection

Written on

Supervised learning is a category of machine learning where algorithms are developed using labeled datasets. Common methods in this area include linear regression and decision trees. Among these techniques, ensemble methods hold significant importance. These methods integrate various individual models to form a more effective composite model. This approach enhances predictive accuracy and fosters improved generalization to novel data instances. Therefore, discussing supervised learning necessitates an exploration of ensemble methods.

The term "ensemble" draws a parallel to a musical ensemble where different instruments, such as flutes and violins, collaborate to create harmonious music. Each instrument contributes its unique sound, enriching the overall experience. Similarly, in machine learning, ensemble methods amalgamate various base models to yield a single, enhanced predictive model. These ensemble techniques frequently outperform individual models in both accuracy and robustness. Strategies like bagging and boosting help mitigate overfitting by balancing biases and variances among the models.

Hyperparameter tuning is a vital aspect of machine learning. It involves selecting the best values for hyperparameters, which are set prior to training and not learned during the process. A well-tuned model can significantly boost performance. We will guide you through the key hyperparameters for different techniques. Given the complexity of tuning multiple hyperparameters, an automated iterative method known as grid search has been developed to facilitate this process.

This chapter, along with the next, will assist you in constructing a range of supervised learning models, including Random Forests, Gradient Boosting, XGBoost, Deep Learning, and Generalized Linear Models (GLMs). We will utilize H2O, chosen for its user-friendly API and popularity in the field. H2O is an open-source machine learning platform that offers a powerful and scalable environment for model development and deployment, compatible with R and Python, and integrates seamlessly with big data technologies like Hadoop and Spark.

We will delve into techniques such as Bagging, Boosting, and Deep Learning sequentially, followed by hyperparameter tuning through grid search, complete with code examples. The learning roadmap includes:

- Revisiting a single decision tree
- Ensemble method — Bagging (Random Forests)
- Ensemble method — Boosting (Gradient Boosting Machine)
- Neural networks
- Grid search for hyperparameter tuning

With this comprehensive list, you will be well-equipped to handle complex predictive challenges with increased accuracy and reliability.

(A) Revisiting a Single Decision Tree Decision trees are a familiar concept, often reflected in everyday decision-making processes. For instance, consider someone contemplating playing golf outdoors. This decision involves evaluating various factors, such as weather conditions (sunny or rainy), humidity, wind, whether to go alone or with a companion, tournament preparation, and costs. Figure (1) visually represents this decision-making process, highlighting how we deconstruct complex decisions into simpler components, contributing to the popularity of decision trees.

Imagine this individual has consistently weighed similar factors in past decisions regarding playing golf. Over time, a significant dataset emerges, capturing the conditions (inputs) and corresponding decisions (outputs). By examining this historical data, we can construct a decision tree to identify patterns and forecast future choices.

While decision tree models are powerful, they are not without drawbacks. A primary concern is overfitting, where the model becomes overly tailored to the specific details of the training data, capturing noise rather than generalizable trends. This occurs because the decision tree fits the training data too closely, leading to excessive complexity and reduced effectiveness in predicting unseen data.

To combat this, ensemble methods come into play. Instead of relying on a solitary decision tree, ensemble techniques combine multiple decision trees to enhance predictive performance and stability. The final prediction aggregates the outcomes of all individual trees.

(B) Ensemble Methods — Bagging (Random Forests) The term Bagging, short for Bootstrap Aggregating, refers to the process of building multiple models independently and averaging their predictions to enhance accuracy and stability. A well-known example of bagging is the Random Forest technique. In a random forest, several decision trees are created using different subsets of training data, and the final prediction is made by averaging the predictions of all the trees (for regression) or by majority voting (for classification). This method helps to reduce variance and prevent overfitting, resulting in more robust models.

The random forest technique samples data multiple times to build various models, all executed independently and concurrently, as illustrated in Figure (2). The outcome is the average of the predicted values from all models.

From a mathematical perspective, consider a matrix where columns represent features and the target variable. The subsets can vary in size, with some samples drawn multiple times across different iterations. For each subset, the bagging method constructs a comprehensive decision tree. While these trees may be deep and complex, making them susceptible to overfitting by capturing intricate details specific to their samples, the ensemble approach mitigates this risk.

(B.1) Random Forests When numerous trees are constructed in this manner, they collectively form a forest. Each tree generates its prediction for a given input, and the final prediction is derived by averaging the outputs from all trees. This aggregation process helps to smooth out the individual biases and variances from the overfitted trees, resulting in a more balanced and generalizable model.

Figure (3) demonstrates how the predictions from multiple decision trees are averaged to arrive at the final prediction.

By integrating the outputs from numerous trees, the ensemble method capitalizes on the diversity among models to enhance overall accuracy and diminish the effects of overfitting present in any single tree.

(B.2) Essential Hyperparameters for Random Forests Hyperparameter tuning is a process of determining optimal values for a machine learning model's hyperparameters, which are established before training and not learned during the training phase. Below are some key hyperparameters for random forests:

**Number of Trees**: This hyperparameter specifies how many decision trees are included in the forest. For instance, a random forest with 100 trees comprises 100 decision trees.**Maximum Tree Depth**: This indicates the maximum depth allowed for any tree in the random forest model. A depth of 10.0 means no tree will exceed this depth.**Minimum Rows**: This defines the minimum number of observations required for a leaf.

However, identifying the best hyperparameter combinations can be challenging without constructing and comparing various random forest models. To explore these combinations, grid search is an effective tool. Further details on tuning hyperparameters will be covered in **Section (E) Grid Search for Hyperparameter Tuning**.

Now, let’s move on to another ensemble method: Boosting.

(C) Ensemble Methods — Boosting (GBM) Gradient boosting boasts a rich history, significantly shaped by researchers such as Freund et al. (1996), Freund and Schapire (1997), Breiman et al. (1998), Breiman (1999), Friedman et al. (2000), and Friedman (2001). The fundamental principle of gradient boosting involves constructing a series of small, simple models (typically decision trees) sequentially. Each new model is trained to rectify the residual errors (the difference between actual and predicted values) made by preceding models. The Gradient Boosting Machine (GBM) represents a prominent example of this technique.

(C.1) How GBM Works To explain GBM, I often liken it to operating a cannon to hit a target. You aim and fire a shot.

- Your initial shot symbolizes your first weak model in GBM, likely missing the target or only nearing it. After this shot, you assess the distance and direction of the miss, which corresponds to the residual errors in GBM — indicating how far off your initial prediction is from actual values. This assessment resembles training a new model in GBM to correct errors made by the first model.
- You fire a second shot aimed at correcting the error identified in the first shot.
- The third shot is aimed to reduce the error from the second shot.
- This process continues, with each shot targeting the remaining errors from previous attempts. The cumulative adjustments lead to increasingly accurate shots, culminating in a model that combines all previous models.

The strength of boosting lies not in the individual weak models, but in the sequential correction process. By consistently addressing and minimizing errors, the predictive accuracy of the combined model significantly improves.

A critical element of gradient boosting is its dependence on an optimization algorithm called "gradient descent." Let’s explore how this functions.

(C.2) Gradient Descent to Minimize a Loss Function At the core of many machine learning algorithms is the objective of minimizing a loss function, which quantifies the disparity between the model's predictions and actual target values. Dealing with extensive datasets or complex features often makes analytical solutions for minimizing the loss function impractical. Gradient descent or Stochastic Gradient Descent (SGD) offers a numerical method to navigate high-dimensional spaces and optimize model parameters.

To illustrate SGD, consider a simple parabola function as shown in Figure (5). To find the optimal x value that minimizes the parabola function y=x²-x-2, one typically sets the first-order derivative to zero: dy/dx=2x-1=0. This indicates that the slope is zero at x=1/2. A function is termed tractable when it allows for straightforward mathematical calculations leading to a solution.

However, many functions are not tractable. In such cases, a numerical approach is required to discover the optimal value minimizing the target function. Starting from any randomly chosen number on the parabola, we can perform iterative calculations to approach the optimal value:

- Suppose we begin with x=3. At this point, the slope and y values are dy/dx=2x-1=2(3)-1=5 and y=4, respectively. The positive slope implies that as x increases, y will also increase.
- Let’s take another random x on the parabola, say x=-3. The slope and y value are dy/dx=2x-1=2(-3)-1=-7 and y=10. The derivative indicates whether we are moving closer to or further away from the minimum.
- To minimize y, we need to move in the opposite direction, hence the term “descent.”
- We can express the iteration mathematically such that the next x is the current x minus the slope. The value of ? represents the step length, also known as the learning rate (lr). Smaller ? values mean smaller steps, which take longer to converge.

In summary, SGD makes small, incremental adjustments to model parameters. This iterative approach facilitates continuous model improvement, as each step progresses closer to the optimal solution.

Having covered bagging and boosting methods, let's transition to deep learning.

(D) Neural Networks In this chapter, we introduce the simplest form of neural network — the Feedforward Neural Network. It comprises three primary layer types: the input layer, hidden layers, and the output layer. Unlike recurrent neural networks (RNNs), feedforward networks do not contain cycles or loops; information flows strictly in one direction — forward.

As noted, we previously covered feedforward neural networks in Chapter 8. The book is designed to gradually build knowledge from simple to complex concepts, making it easier to learn. For further details, please refer to Chapter 8, which discusses the input layer, hidden layers, output layer, activation functions, weights and biases, loss functions, and optimization algorithms.

A crucial term in neural networks is propagation, which refers to the process of transferring information through the network layers. This occurs in two main forms: forward propagation and backpropagation. Forward propagation calculates the outputs based on the weights, while backpropagation updates those weights. Let’s define each process.

Forward propagation involves passing input data through the network to generate an output, comprising several steps:

**Input Layer**: The input data enters the network through the input layer, with each feature corresponding to a neuron in this layer.**Weighted Sum Calculation**: The input data is multiplied by the weights associated with each connection between neurons.**Activation Function**: The weighted sum then passes through an activation function.**Propagation to the Next Layer**: The activation outputs from one layer become the inputs for the next layer, continuing through the hidden layers until reaching the output layer.

Backpropagation, short for backward propagation of errors, is also vital in training neural networks. Although it’s a newer term, it's relatively easy to grasp. Consider a basketball player attempting to make a shot. Initially, the player may miss the hoop. After each attempt, they reflect on what went wrong — perhaps they threw too hard or too soft — and adjust their aim. Backpropagation functions similarly in neural networks:

**Initial Attempt**: Think of the neural network as a player taking a shot, making an initial guess.**Check the Result**: After the network generates a prediction, we assess its closeness to the correct answer (akin to verifying if the basketball went through the hoop).**Adjust the Aim**: If the prediction is inaccurate, we need to modify the network to improve its accuracy, similar to how a player adjusts their throw based on previous attempts.**Learning from Mistakes**: The network evaluates the difference between its prediction and the actual answer (the "error" or "loss") and determines how to adjust its parameters (like changing its "throwing technique") to enhance accuracy.**Iterate and Improve**: This process repeats multiple times, with each iteration enabling the network to gradually improve its predictive capabilities.

Thus, backpropagation learns from errors and enhances model performance by performing Steps 1–3 as depicted in Figure (6):

**Calculate the Error**: After forward propagation, the network’s prediction is compared to actual target values to compute the loss.**Gradient Calculation**: Backpropagation computes the gradient of the loss function concerning each weight and bias in the network.**Weight Update**: Using the gradients, the network’s weights and biases are adjusted to minimize the loss. This update typically employs an optimization algorithm like Stochastic Gradient Descent (SGD) or Adam.

In summary, forward propagation computes the scores based on weights, while backpropagation calculates the gradient of error concerning each weight and updates those weights.

(E) Grid Search for Hyperparameter Tuning Grid search is employed in hyperparameter tuning to identify the optimal hyperparameter set for a model. It entails defining a range of potential values for each hyperparameter and systematically assessing the model’s performance across all combinations. The objective is to pinpoint the combination yielding the best performance according to a specified evaluation metric (e.g., accuracy, mean squared error).

After importing the H2O library, H2O requires initialization. We then load a dataset, partition the data into training and testing sets, and create a random forest model. Subsequently, we evaluate model performance and generate predictions. The coding style of H2O is notably straightforward.

In this chapter, we will utilize the publicly available red-wine quality dataset, accessible on Kaggle.com or the UC Irvine Machine Learning Repository. The target variable is 'wine quality', rated from 0 (low) to 10 (high). The input features include 'fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulfates', and 'alcohol'.

import h2o

import pandas as pd

import numpy as np

from h2o.estimators import H2ORandomForestEstimator

# Initialize H2O

h2o.init()

# Split the data

train, test = df_hex.split_frame(ratios=[0.8], seed=1234)

# Define the predictors and the target

predictors = ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',

'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',

'pH', 'sulfates', 'alcohol']

response = 'quality'

To build a random forest model with H2O, we can use:

# Build and train the model:

rf0 = H2ORandomForestEstimator(ntrees=100, max_depth=5, min_rows=10)

# Build the RF model

rf0.train(x=predictors, y=response, training_frame=train, validation_frame=test)

# Evaluate performance:

perf = rf0.model_performance()

# Generate predictions on the test set (if necessary):

pred = rf0.predict(test)

Next, we will incorporate hyperparameter tuning into the random forest model.

(E.1) Grid Search for Random Forests H2O employs H2OGridSearch() to conduct grid search. When the model is specified as H2ORandomForestEstimator, it performs random forests.

from h2o.grid.grid_search import H2OGridSearch

# Define the hyperparameter space for Random Forest (you can adjust these parameters)

rf_params = {

'ntrees': [100, 200, 300],

'max_depth': [10, 20, 30],

'min_rows': [10, 50, 100]

}

# Create a random grid search

rf_grid = H2OGridSearch(model=H2ORandomForestEstimator,

grid_id='rf_grid',

hyper_params=rf_params)

# Train the grid of Random Forest models

rf_grid.train(x=predictors, y=response, training_frame=train, validation_frame=test)

# Get the best model based on a performance metric (e.g., AUC)

rf_models = rf_grid.get_grid(sort_by='mae', decreasing=True)

# Get the best model

best_rf = rf_models.models[0]

# Evaluate the best model on a test set

test_perf = best_rf.model_performance(test_data=test)

print("MAE on test set:", test_perf.mae())

The code above outlines the following steps:

**hyper_params**: This specifies the grid of hyperparameters to explore. This example includes various architectures (hidden layers) and activation functions. Additional regularization hyperparameters (L1 and L2) will be tested in the subsequent chapter.**H2OGridSearch()**: This function defines grid search over the specified hyperparameters.**grid.train()**: This trains the models for each combination of hyperparameters using the training data and evaluates them on the validation set.**grid.get_grid()**: This retrieves the results of the grid search, sorted by the specified metric (e.g., accuracy). You can adjust the sort_by parameter based on the problem type.

The chapter presents code with slight variations for different techniques to build familiarity in constructing all models sequentially.

(E.2) Grid Search for GBM Next, we will use H2OGradientBoostingEstimator in H2OGridSearch to perform grid search for GBM.

from h2o.estimators.gbm import H2OGradientBoostingEstimator

# GBM hyperparameters

gbm_params = {

'ntrees': [100, 200, 400],

'learn_rate': [0.01, 0.1],

'max_depth': [3, 5, 9],

'min_rows': [10, 50, 100]

}

# Train and validate a cartesian grid of GBMs

gbm_grid = H2OGridSearch(model=H2OGradientBoostingEstimator,

grid_id='gbm_grid',

hyper_params=gbm_params)

gbm_grid.train(x=predictors,

y=response,

training_frame=train,

validation_frame=test,

seed=1)

# Get the grid results

gbm_gridperf = gbm_grid.get_grid(sort_by='mae', decreasing=True)

# Grab the top GBM model

best_gbms = gbm_gridperf.models[0]

# Now let's evaluate the model performance on a test set

best_gbm_perf = best_gbms.model_performance(test)

print("MAE on test set:", best_gbm_perf.mae())

The code above mirrors that in Section (E.1). The components include:

**ntrees**: This denotes the number of trees to construct. H2O’s default is 50.**learn_rate**: This represents the learning rate, as discussed in Section (C.2) regarding gradient descent.**max_depth**: This indicates the maximum tree depth.**min_rows**: This specifies the minimum number of observations for a leaf, defaulting to 10.

Next, we will apply a similar procedure for neural networks.

(E.3) Grid Search for Neural Networks To conduct hyperparameter tuning for a deep learning model using the H2O library in Python, simply specify that the model is a deep learning model in H2O’s grid search function. Below is an example demonstrating how to perform hyperparameter tuning for a deep learning model using H2O:

import h2o

from h2o.estimators import H2ODeepLearningEstimator

hyper_params = {

'activation': ['Rectifier', 'Tanh', 'Maxout'],

'hidden': [[50, 50], [100, 100], [200, 200]],

'epochs': [10, 50, 100],

}

# Define the deep learning estimator

dl = H2ODeepLearningEstimator(

distribution="bernoulli", # adjust this based on your problem (e.g., "multinomial" for classification)

stopping_metric="MAE", # use the appropriate metric for your task

stopping_tolerance=0.001,

stopping_rounds=5,

score_each_iteration=True,

seed=1234

)

# Perform grid search

dl_grid = H2OGridSearch(model=dl,

grid_id='dl_grid',

hyper_params=hyper_params)

dl_grid.train(x=predictors, y=response,

training_frame=train,

validation_frame=test

)

# Get the grid results, sorted by validation MAE

dl_gridperf = dl_grid.get_grid(sort_by='mae', decreasing=True)

# Get the best model based on validation performance

best_dl = dl_gridperf.models[0]

best_dl_perf = best_dl.model_performance(test)

print("MAE on test set:", best_dl_perf.mae())

The code above encompasses:

**hyper_params**: The activation parameter denotes the activation function, hidden specifies the architecture (number of neurons in each hidden layer), and epochs defines the number of iterations.**H2ODeepLearningEstimator()**: This function defines a feedforward neural network.

The remainder of the code parallels that of other sections.

Grid search can be computationally intensive, especially with extensive grids and complex models. Consider limiting the number of hyperparameter combinations, or employing more efficient search methods like Random Search or Bayesian Optimization when necessary.

(F) Summary This chapter provides an extensive overview of supervised learning models, including Random Forests, Gradient Boosting, XGBoost, and Deep Learning. It underscores the significance of hyperparameter tuning, a crucial process for optimizing machine learning models, detailing the key hyperparameters for various techniques and employing grid search for systematic optimization.

Handbook of Anomaly Detection: Cutting-edge Methods and Hands-On Code Examples, 2nd edition

- Handbook of Anomaly Detection — (0) Preface
- Handbook of Anomaly Detection — (1) Introduction
- Data Science Q&A — (1) Anomaly Detection
- Handbook of Anomaly Detection — (2) HBOS
- Data Science Q&A — (2) HBOS
- Handbook of Anomaly Detection — (3) ECOD
- Data Science Q&A — (3) ECOD
- Handbook of Anomaly Detection — (4) Isolation Forest
- Data Science Q&A — (4) Isolation Forest
- Handbook of Anomaly Detection — (5) PCA
- Data Science Q&A — (5) PCA
- Handbook of Anomaly Detection — (6) One-Class SVM
- Data Science Q&A — (6) One-Class SVM
- Handbook of Anomaly Detection — (7) GMM
- Data Science Q&A — (7) GMM
- Handbook of Anomaly Detection — (8) KNN
- Data Science Q&A — (8) KNN
- Handbook of Anomaly Detection — (9) Local Outlier Factor (LOF)
- Data Science Q&A — (9) LOF
- Handbook of Anomaly Detection — (10) Cluster-Based Local Outlier Factor (CBLOF)
- Data Science Q&A — (10) CBLOF
- Handbook of Anomaly Detection — (11) XGBOD
- Data Science Q&A — (11) XGBOD
- Handbook of Anomaly Detection — (12) Autoencoders
- Data Science Q&A — (12) Autoencoders
- Handbook of Anomaly Detection — (13) Supervised Learning Primer
- Data Science Q&A — (13) Supervised Learning Primer
- Handbook of Anomaly Detection — (14) Regularization
- Data Science Q&A — (14) Regularization
- Handbook of Anomaly Detection — (15) Sampling Techniques for Extremely Imbalanced Data
- Data Science Q&A — (15) Sampling Techniques for Imbalanced Data
- Handbook of Anomaly Detection — (X) Instructor’s Manual