Ensemble models: Techniques, benefits, applications, algorithms and implementation
Listen to the article
In life, the decisions we make hold the potential to shape our future. Whether it’s choosing a university program or signing a job contract, these decisions hold tremendous weight. Thus, we often seek advice from trusted sources and gather copious amounts of information to navigate these pivotal moments. Machine Learning (ML) has embraced a similar philosophy through ensemble learning. Machine learning predictions, driven by the patterns observed during training, mirror our quest for informed decision-making.
As per Business Fortune Insights, the machine learning market has witnessed remarkable growth, reaching a value of $19.20 billion in 2022. Projections indicate that it is set to soar even further, reaching an impressive $225.91 billion by 2030, reflecting a Compound Annual Growth Rate (CAGR) of 36.2%. With the machine learning market experiencing significant growth, there is a growing interest in and acceptance of advanced ML algorithms and models. And in the diverse landscape of machine learning models, there is a noticeable increase in the adoption and consideration of ensemble models. This heightened interest can be attributed to the widespread acknowledgment of their superiority over individual or single models.
Ensemble models operate on a captivating premise inspired by our own inclination to seek diverse opinions and perspectives before making important decisions. It is a technique that aims to enhance performance and decision-making capabilities by harnessing the collective wisdom of multiple models. These models combine the outputs and insights from multiple individual models, leveraging their unique strengths and compensating for their weaknesses. The result is a synergistic collaboration that often outperforms any single model, paving the way for more accurate predictions and robust solutions. For instance, while one model may excel in capturing certain patterns or features, another model might perform better in handling complex scenarios. This fusion of models creates a more comprehensive and robust prediction, reducing the risk of overfitting and improving generalization capabilities.
Ensemble learning represents a paradigm shift in the world of decision-making, drawing inspiration from our innate desire to seek diverse perspectives. Let’s delve deeper into this article to learn the intricacies of ensemble models, exploring their wide-ranging applications, benefits and more.
- What is an ensemble model?
- Ensemble model techniques
- Ensemble model algorithms
- Implementation of an ensemble model using Python
- Benefits of an ensemble model
- Applications of an ensemble model
What is an ensemble model?
The ensemble model is a machine-learning approach where multiple models work together to make better predictions. Just like a group of people can make smarter decisions than individuals, a group of models can produce more accurate results. By combining the predictions from different models, ensemble learning helps to improve accuracy and make predictions more reliable. It’s like getting feedback from many people with different perspectives to make a better decision. Ensemble modeling is widely used in various areas of machine learning to enhance performance and make more accurate predictions.
For example, you made a short film and wanted feedback before making it public. You can ask your friends for feedback, but they might be too nice and not give you honest opinions. Another option is to ask a group of colleagues who probably don’t even have the expertise to judge your film accurately.
A better idea would be to gather feedback from a larger and more diverse group of, say, 50 people, including friends, colleagues, and even strangers. This way, you get a wider range of perspectives and knowledge to evaluate your film.
This is similar to the ensemble approach, where you combine the predictions of multiple models. Just like getting feedback from many people helps you make better decisions, combining the predictions from different models helps improve accuracy and reliability in machine learning.
The ensemble model takes advantage of the strengths of different models and reduces their weaknesses, resulting in more accurate predictions. By bringing together the wisdom of multiple models, the ensemble model enhances performance and can be used in various fields like computer vision, language processing, and financial forecasting.
Ensemble model techniques
Here, we will discuss both simple and advanced ensemble model techniques.
The simple ensemble model techniques
The simple ensemble model techniques include:
Majority voting (Max voting)
Max voting is a simple technique used in ensemble modeling to make predictions by combining the outputs of multiple models. In the context of classification tasks, each model in the ensemble independently predicts the class label for a given input, and the final prediction is determined by selecting the class label that receives the highest number of votes from the individual models.
Imagine you made a movie and asked five of your friends to rate it on a scale of 1 to 5. Here are their ratings:
Launch your project with LeewayHertz!
Partner with us to harness the power of diverse AI models for enhanced accuracy and robust predictions. Contact us today for robust ensemble model-powered solutions!
Friend A | Friend B | Friend C | Friend D | Friend E | Final rating |
4 | 4 | 5 | 4 | 5 | 4 |
To decide the overall rating using max voting, you look at the ratings and see which one has the most votes. In this case, the rating “4” got the most votes, with three friends giving it that rating. Only two friends gave a 5-star rating.
So, based on the majority vote, the final rating for your movie would be “4.” It means that most of your friends thought it deserved a rating of “4”.
This shows how an ensemble model works using max voting. Each model in the ensemble gives its prediction, and the majority vote of the models determines the final prediction. It helps combine the opinions of different models to make a final decision.
Averaging
Averaging is a technique used in machine learning to combine predictions from multiple models. When using averaging, several models are trained on the same dataset, and each model makes predictions for each data point.
In regression problems, where the goal is to predict a continuous numerical value, such as predicting the price of a house, the averaging technique involves taking the average of the predictions made by all the models. This average value is then used as the final prediction.
In classification problems, where the goal is to assign a data point to one of several predefined categories, probabilities can be calculated for each class by each model. Averaging is used to calculate the average probability assigned to each class by all the models. The class with the highest average probability is then chosen as the final prediction. Let’s consider the previous example:
Friend A | Friend B | Friend C | Friend D | Friend E | Final rating |
4 | 4 | 5 | 4 | 5 | 4.5 |
The averaging technique can help improve predictions’ overall accuracy and robustness by leveraging multiple models’ diversity and collective knowledge. It can help reduce the impact of individual model biases or errors, leading to more reliable predictions.
Weighted averaging
The weighted average is the extension of the averaging method, where each model is assigned a weight that reflects its importance in the final prediction. These weights determine the level of influence each model’s predictions will have on the overall result.
To illustrate the concept, let’s consider a scenario where five colleagues provide feedback. Two of these colleagues are recognized experts and critics in the field, while the remaining three lack prior experience. In this case, we would assign higher weights to the expert colleague’s feedback, denoting their superior understanding and knowledge compared to the inexperienced ones. The higher weights acknowledge the greater significance we attribute to their predictions in the final outcome.
Friend A | Friend B | Friend C | Friend D | Friend E | Final rating | |
Rating | 5 | 4 | 5 | 4 | 4 | 4.41 |
Weight | 0.23 | 0.18 | 0.23 | 0.18 | 0.18 |
The result is calculated as [(5*0.23) + (4*0.23) + (5*0.18) + (4*0.18) + (4*0.18)] = 4.41.
The same process is followed in ensemble modeling. Weighted averaging offers several advantages over traditional averaging methods. First, it allows us to account for different models’ varying expertise and performance. Assigning higher weights to more reliable models ensures that their predictions carry more weight in the final result. This helps mitigate the impact of models with lower performance or limited domain knowledge.
The advanced ensemble model techniques
The advanced ensemble model techniques include:
Stacking
Stacking is a popular ensemble modeling technique used in machine learning. It involves combining multiple weak learners in a parallel manner and leveraging meta-learners to improve the accuracy of future predictions.
The stacking process begins with training several weak learners on the same dataset. These weak learners can be different machine learning algorithms or models with distinct characteristics. Each weak learner independently makes predictions on the input data.
Once the weak learners have generated their predictions, these outputs are used as input for a meta-learner. The meta-learner learns how to effectively combine the predictions from the weak learners to generate an improved output prediction. The meta-learner’s task is to find the best way to weigh and merge the predictions from the weak learners to achieve better overall performance.
The architecture of stacking
The stacking ensemble method is structured with a specific architecture comprising two or more base/learner models and a meta-model that combines the predictions of the base models. This architecture involves the original data, primary-level models, primary-level predictions, a secondary-level model, and the final prediction. The fundamental components of the stacking architecture are as follows:
Original data: The original dataset is divided into n-folds and serves as training and test data.
Base models: These models are also known as level-0 models. They utilize the training data to generate individual predictions, which are combined to form the level-0 predictions.
Level-0 predictions: Each base model is trained on a portion of the training data and produces distinct predictions, known as level-0 predictions.
Meta-model: The stacking model incorporates a meta-model, also referred to as the level-1 model. The meta-model is responsible for effectively combining the predictions from the base models.
Level-1 predictions: The meta-model is trained using the level-0 predictions generated by the base models. It learns how to optimize the combination of these predictions by utilizing a separate data set that was not used to train the base models. The level-1 predictions are obtained by feeding this data into the meta-model.
The training process for the meta-model involves using the level-0 predictions as input and the corresponding expected outputs as the target. By leveraging this input-output pairing, the meta-model learns how to integrate the base models’ predictions best.
The final prediction is obtained by applying the meta-model to the level-0 predictions made on the original dataset.
Blending
Blending is an ensemble technique that combines predictions from multiple models to improve performance and accuracy. It follows a similar idea as stacking but with a simpler approach.
Blending is an ensemble technique that combines predictions from multiple models to improve performance and accuracy. It follows a similar idea as stacking but with a simpler approach.
Data split: The original training set is split into two parts: the training and validation sets. The validation set is typically a smaller portion of the data, often around 10-15% of the training set.
Model training: Models are trained on the training set using different algorithms or techniques. These models can be of different types, have different hyperparameters, or use different feature sets, creating diversity in the ensemble.
Prediction generation: The trained models make predictions on both the validation and test sets. These predictions are obtained by feeding the respective datasets into each model and obtaining the output.
Blending model creation: A new model, called the blending model, is built using the validation set and its corresponding predictions from the individual models. The blending model learns to combine the predictions of the individual models on the validation set and make final predictions.
Final prediction: The blending model is used to make predictions on the test set or unseen data. It leverages the patterns learned from the validation set and its predictions to generate the final predictions.
The key difference between blending and stacking is that blending uses a separate validation set to train the blending model while stacking uses out-of-fold predictions from the training set.
Blending provides a simpler approach to ensemble learning, as it doesn’t require complex computations or the potential for data leakage. It leverages the predictions made on the validation set to create a blending model that improves overall predictions.
Blending combines predictions from multiple models using a validation set, leading to improved performance and accuracy.
Bagging
Bagging, or bootstrap aggregation, is an ensemble learning technique used to make more accurate predictions and handle noisy data. It is a method to make machine learning algorithms more reliable and accurate. Bagging reduces the chance of overfitting and decreases variability in the predictions. It is often used with decision trees and is a way of combining multiple models to improve performance.
Here’s how it works:
Data sampling: We randomly select subsets of data from the training set, allowing some data points to be chosen more than once. These subsets are called bootstrap samples.
Independent training: Each bootstrap sample is used to train a separate model. These models are trained independently of each other, so they don’t know what the other models are doing.
Combining predictions: Once the models are trained, we use them to make predictions on new, unseen data. For regression tasks, we average the predictions from all the models. We choose the prediction that occurs most frequently (majority vote) for classification tasks.
For example, if there are five base models and three of them predict Class A while two predict Class B, the bagged classifier would assign Class A to the unknown sample since it received the majority of votes. The goal of bagging is to reduce variance and improve the accuracy of predictions. We create a more reliable and robust model by training multiple models on different subsets of data and combining their predictions.
Boosting
Boosting is a powerful ensemble learning technique that aims to improve a predictive model’s overall performance by combining weaker models, referred to as weak learners, into strong learners.
Here’s how boosting works:
Subset creation: Initially, a subset of the original dataset is created. This subset is used to train the first weak learner.
Weight initialization: Each data point in the subset is assigned an initial weight. Initially, all weights are equal, indicating that each data point has the same importance.
Model creation and training: A base model, often a simple one like a decision tree with limited depth (also known as a stump), is created and trained on the subset.
Prediction and error calculation: The trained model is used to make predictions on the entire dataset. The errors are calculated by comparing the predicted values with the actual values.
Weight update: The weights of the incorrectly predicted data points are increased, emphasizing the importance of these points in subsequent iterations. This adjustment allows the next weak learner to focus more on the misclassified observations.
Model iterations: Multiple iterations are performed, with each iteration creating a new weak learner. Each subsequent weak learner is trained on a modified version of the dataset, where the weights of the data points have been adjusted based on the errors of the previous weak learner.
Final model creation: The final model, often called the strong learner or the boosted model, is formed by combining the predictions of all the weak learners, usually through a weighted mean or voting scheme. The weights assigned to each weak learner may depend on their individual performance or other factors.
It’s important to note that various boosting algorithms, such as AdaBoost, Gradient boosting, and XGBoost, each with specific variations and enhancements, are available. These algorithms employ different techniques to update the weights, combine the weak learners, and handle various aspects of the boosting process.
Launch your project with LeewayHertz!
Partner with us to harness the power of diverse AI models for enhanced accuracy and robust predictions. Contact us today for robust ensemble model-powered solutions!
Ensemble model algorithms
Ensemble model algorithms leverage the collective power of multiple models to improve predictive accuracy and enhance generalization capabilities. These algorithms combine individual models’ outputs, compensating for weaknesses and leveraging diverse strengths, resulting in superior performance. Let’s delve into the intricacies of the algorithms employed in bagging and boosting, which are distinctive approaches within the ensemble modeling framework.
Bagging algorithms include:
Bagging meta-estimator
The bagging meta-estimator is an ensemble algorithm that can be applied to both classification (Bagging classifier) and regression (Bagging regressor) problems. It utilizes the bagging technique to make predictions by combining the outputs of multiple models.
Here are the steps involved in the bagging meta-estimator algorithm:
Random subset creation: The working of the algorithm begins by creating random subsets, also known as bootstrap samples, from the original dataset. Bootstrapping involves sampling the data with replacement, meaning that each subset can contain duplicate instances as well as instances that are not included.
Feature inclusion: Each subset of the dataset contains all the available features. In other words, when creating the subsets, the algorithm uses the same set of features for each subset.
Base estimator fitting: A user-specified base estimator, such as a decision tree or a support vector machine, is fitted on each of the smaller subsets. This means the same base estimator is trained multiple times using a different bootstrap sample.
Prediction: Once the base estimators are trained on their respective subsets, predictions are made using each model. The final prediction is obtained by combining the predictions from all the models. The combination can vary depending on the task, such as averaging the predictions for regression or majority voting for classification.
By leveraging multiple models trained on different bootstrap samples, the bagging meta-estimator helps to reduce variance and improve the stability and accuracy of the predictions. The use of bootstrapping allows for diversity in the subsets and helps capture different aspects of the data.
Random forest
Random Forest is an enhanced version of the bagging technique, building upon the concept of decision trees. In Random forests, decision trees serve as the base estimators and introduce additional randomness to enhance the diversity and robustness of the ensemble.
The process begins by creating random subsets, known as bootstrapped samples from the original datasets. Each subset is generated by sampling with replacement, allowing for the inclusion of multiple instances of the same data point.
When constructing each decision tree, only a random subset of features is considered at each node to determine the optimal split. This random feature selection ensures that different trees focus on different aspects of the data, introducing variability into the ensemble.
For each subset, a decision tree model is built independently. These individual trees capture different patterns and relationships within the data, utilizing the specific set of features available at each node.
Finally, the outputs from each tree in the random forest are combined when making predictions. The aggregation process varies depending on the task (classification or regression) but typically involves averaging the predictions from all trees to obtain the final result. This aggregation leverages the collective wisdom of the models, incorporating diverse perspectives to improve the random forest model’s overall accuracy and generalization capability.
Boosting algorithms include:
Adaptive boosting (AdaBoost)
AdaBoost is a fundamental boosting algorithm commonly employed with decision trees as base models. It operates through a sequential process of creating multiple models that progressively correct errors made by preceding models. Here is how it works:
The initial step involves assigning equal weights to all observations in the dataset. Next, a model is constructed using a subset of the data, often a decision tree. Using this model, predictions are generated for the entire dataset. To assess the model’s performance, errors are calculated by comparing the predictions with the actual values. In the subsequent iteration, higher weights are assigned to the observations that were inaccurately predicted in the previous iteration. The weights are determined based on the error values, where observations with greater errors receive larger weights. The process is repeated, constructing a new model using the updated weights and generating predictions on the dataset. The weights are adjusted iteratively, and each new model focuses on improving predictions for the previously misclassified observations. The iterations persist until a stopping condition is met, such as the error function reaching a satisfactory level or reaching the maximum limit of estimators, which is predefined. By iteratively adjusting the weights and creating new models that target the misclassified observations, AdaBoost aims to enhance the overall predictive performance of the ensemble. The final prediction is typically determined by aggregating the predictions from all models, usually by using weighted voting or weighted averaging based on the performance of each model during training.
Gradient boosting (GBM)
Gradient Boosting is a powerful algorithm that combines weak models, such as decision trees, to create a strong predictive model. The goal is to iteratively improve the model’s accuracy by learning from the mistakes made in previous iterations.
Initially, a weak model is trained on the dataset, and its predictions are compared to the actual target values. The differences between the predictions and the true values are the errors.
In each iteration, a new weak model is trained to minimize these errors. However, instead of using the original target values as labels, the new model is trained on the errors made by the ensemble models built so far. This approach allows the new model to focus on the areas where the previous models performed poorly.
The new model’s predictions are then added to the ensemble, and the process is repeated. Each new model is trained to minimize the combined ensemble’s errors further. By iteratively adding new models, the ensemble gradually improves its predictive performance.
A shrinkage or learning rate is applied to prevent any single model from dominating the ensemble. The predictions of each model are scaled down by the learning rate before being combined. This helps to find a balance between accuracy and overfitting, as a smaller learning rate reduces the impact of each individual model.
The iteration process continues until a stopping criterion is met, which could be reaching a certain level of accuracy or a predefined number of models.
Gradient Boosting is an effective algorithm for creating accurate predictive models by iteratively combining weak models and learning from their errors. The formula gives the final prediction in gradient-boosted trees:
y(pred) = y1 + (eta * r1) + (eta * r2) + ……. + (eta * rN)
The learning rate (eta) is a parameter that scales the contribution of each tree’s prediction. It ranges between 0 and 1 and helps control the impact of each tree on the final prediction. By multiplying each tree’s prediction by the learning rate (r1, r2…rN), the overall effect of the ensemble is reduced or “shrunk.”
The predictions from each tree in the ensemble are summed up, and the learning rate is applied to this sum to obtain the final prediction. The larger the learning rate, the more weight each tree’s prediction carries in the final result. However, using a smaller learning rate can improve the model’s overall performance by reducing the risk of overfitting.
Extreme gradient boosting (XGBoost)
XGBoost is an optimized distributed gradient boosting library that researchers at the University of Washington originally proposed. It is written in C++ and provides efficient and scalable training for gradient-boosting models.
Gradient boosting is an ensemble learning method that combines the predictions of multiple weak models, typically decision trees, to create a stronger predictive model. XGBoost improves upon traditional gradient boosting by introducing several optimizations and algorithmic enhancements, which result in faster training speed and better performance.
XGBoost works by training an ensemble of decision trees sequentially to create a strong predictive model. Let’s dive into the details of how XGBoost operates:
Objective function: XGBoost starts by defining an objective function that must be optimized during training. The objective function consists of two components: a loss function that measures the difference between the predicted and actual values and a regularization term that controls the complexity of the model. The goal is to minimize the objective function to improve the model’s performance.
Boosting and gradient descent: XGBoost uses boosting, a technique where weak learners (decision trees) are iteratively added to the ensemble to correct the mistakes of the previous models. During each iteration, XGBoost calculates the gradient of the loss function with respect to the current predictions. It then fits a new decision tree to the loss function’s negative gradient (residuals), effectively updating the ensemble to reduce the remaining error.
Tree construction: XGBoost greedily constructs decision trees. It starts with an empty tree and iteratively adds new nodes to create splits. It evaluates different splitting points based on the reduction in the objective function (loss function plus regularization). The algorithm searches for the optimal split by considering multiple factors, such as the purity of the resulting subsets and the overall gain in the objective function.
Weighted instances: Each training instance is assigned a weight, initially set to the same value for all instances. However, as the algorithm progresses, the weights are updated based on the errors of the previous models. Instances that were predicted incorrectly by the ensemble are given higher weights to focus the subsequent models’ attention on these harder-to-predict cases.
Tree pruning: XGBoost allows the decision trees to grow up to a specified maximum depth (controlled by the max depth hyperparameter). After reaching the maximum depth, the algorithm performs backward pruning. It evaluates the quality of each split in the tree and removes those that do not result in a positive gain in the objective function. Pruning helps to control the complexity of the trees and prevent overfitting.
Regularization: XGBoost incorporates regularization techniques to prevent overfitting. The objective function includes both L1 (Lasso) and L2 (Ridge) regularization terms. These regularization terms penalize the complexity of the model by adding additional terms based on the weights of the features. Regularization encourages the model to favor simpler trees and helps improve generalization.
Ensemble: After training a set of decision trees, XGBoost combines their predictions to make the final prediction. The individual predictions of the trees are weighted based on their importance, which is determined by factors such as their performance and the number of times they were selected during the boosting process. The ensemble of trees produces a more accurate and robust prediction than a single tree.
Parallel processing: XGBoost is designed to leverage parallel processing capabilities. It can utilize multiple threads or distributed computing frameworks to train the ensemble of trees concurrently. This parallelization speeds up the training process, particularly for large datasets, by simultaneously allowing multiple trees to be built and updated.
The flexibility and customizability of XGBoost’s algorithm and its ability to handle missing values and perform cross-validation contribute to its exceptional performance in various machine-learning tasks.
Light gradient boosting machine (Light GBM)
LightGBM is a gradient-boosting framework that leverages decision trees to enhance model efficiency and reduce memory usage. It introduces two innovative techniques: Gradient-based Side Sampling (GOSS) and Exclusive Feature Bundling (EFB), which address limitations in histogram-based algorithms commonly used in gradient-boosting decision tree frameworks. These techniques work together to make lightGBM a highly efficient and competitive framework.
In lightGBM, the GOSS technique optimizes information gain computation by considering the varying roles of different data instances. Instances with larger gradients, indicating under-trained instances, contribute more to information gain. GOSS retains instances with large gradients (above a threshold or among the top percentiles) while randomly dropping instances with small gradients. This selective approach ensures accurate estimation of information gain, leading to more precise gain estimations compared to uniformly random sampling. This technique is particularly valuable when information gain has a wide range of values.
Additionally, lightGBM employs Exclusive Feature Bundling (EFB) to handle high-dimensional and sparse data efficiently. Many features are mutually exclusive in high-dimensional, sparse feature spaces, meaning they never take non-zero values simultaneously. These mutually exclusive features can be bundled together into a single feature, forming an EFB. By bundling exclusive features, the complexity of histogram building is reduced from O(data × feature) to O(data × bundle), where the bundle is significantly smaller than the original number of features. This reduction in complexity improves the training framework’s speed without sacrificing accuracy.
LightGBM adopts a leaf-wise tree growth strategy, unlike other boosting algorithms that grow trees level-wise. It selects the leaf with the maximum delta loss for further growth in this approach. By fixing the leaf and growing it, the leaf-wise algorithm achieves lower loss compared to the level-wise algorithm. However, it’s important to note that leaf-wise tree growth can potentially increase model complexity and may lead to overfitting in small datasets.
Categorical boosting (CatBoost)
CatBoost is a machine learning algorithm that has been recently made open-source by Yandex. It is designed to handle diverse data types, making it valuable for solving a wide range of business problems. One of its notable strengths is its ability to achieve top-notch accuracy without requiring extensive data training like other machine learning methods. Additionally, CatBoost provides built-in support for descriptive data formats commonly encountered in business scenarios.
The name “CatBoost” originates from two words: “Category” and “Boosting.” The algorithm is adept at working with different categories of data, such as audio, text, images, and historical data. The “Boosting” part refers to the algorithm’s foundation in gradient boosting, which is a powerful machine-learning technique. Gradient boosting is widely used to tackle various business challenges, including fraud detection, recommendation systems, and forecasting. It excels in delivering high-quality results even with relatively smaller amounts of data, unlike deep learning models that often require massive datasets to learn effectively. In simple terms, CatBoost combines the strengths of handling diverse data types and leveraging the power of gradient boosting to provide accurate and efficient solutions for business problems.
One of the key advantages of CatBoost is its ability to automatically handle categorical features without the need for explicit feature encoding techniques such as a one-hot encoder or label encoder. Unlike traditional gradient boosting methods, CatBoost can directly process categorical features, saving time and effort in preprocessing steps.
Additionally, CatBoost employs an algorithm called Symmetric Weighted Quantile Sketch (SWQS) to handle missing values in the dataset effectively. By addressing missing values, CatBoost helps reduce the risk of overfitting and improves the model’s overall performance.
Implementation of an ensemble model using Python
By now, you understand ensemble modeling and the functioning of certain models. This section will delve into the practical implementation of an ensemble model using Python.
Step-1: Environment setup
To start, ensure that you have Python installed on your computer, and make sure to have the following libraries installed as well:
- Pandas, used for loading the data frame.
- Scikit-learn, which provides access to various pre-existing machine learning algorithms.
- Seaborn and Matplotlib, both of which are useful for visualization purposes.
You can install these libraries using pip, the Python package manager, as follows:
pip install scikit-learn pip install pandas pip install matplotlib seaborn
Step-2: Understanding data
We are importing the pandas’ library and assigning it the alias pd. Then, we read a CSV file named “loan_data.csv” into a pandas DataFrame called loan_data and display the first few rows of the DataFrame using the head() function. Use the following codes for this:
import pandas as pd loan_data = pd.read_csv("loan_data.csv") loan_data.head()
To better illustrate the use case, we will be using the loan data available:
Launch your project with LeewayHertz!
Partner with us to harness the power of diverse AI models for enhanced accuracy and robust predictions. Contact us today for robust ensemble model-powered solutions!
Credit policy | Purpose | Int. rate | Installment | log. annual.inc | Dti | Fico | Days. with. cr. line | Revol bal | Revol. util |
0 | debt consolidation | 0.1189 | 829.10 | 11.35040 | 19.4 | 737 | 5639.9583 | 288554 | 52.1 |
1 | credit card | 0.1071 | 228.22 | 11.08214 | 14.29 | 707 | 2760.0000 | 33623 | 76.7 |
2 | debt consolidation | 0.1357 | 366.86 | 10.37349 | 11.63 | 682 | 4710.0000 | 3511 | 25.6 |
3 | debt consolidation | 0.1008 | 162.34 | 11.35049 | 8.10 | 712 | 2699.9583 | 33667 | 73.2 |
4 | credit card | 0.1426 | 102.92 | 11.29973 | 14.97 | 667 | 4066.0000 | 4740 | 39.5 |
Using the info() function generates the relevant information about the data.
loan_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 9578 entries, 0 to 9577 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 credit.policy 9578 non-null int64 1 purpose 9578 non-null object 2 int.rate 9578 non-null float64 3 installment 9578 non-null float64 4 log.annual.inc 9578 non-null float64 5 dti 9578 non-null float64 6 fico 9578 non-null int64 7 days.with.cr.line 9578 non-null float64 8 revol.bal 9578 non-null int64 9 revol.util 9578 non-null float64 10 inq.last.6mths 9578 non-null int64 11 delinq.2yrs 9578 non-null int64 12 pub.rec 9578 non-null int64 13 not.fully.paid 9578 non-null int64 dtypes: float64(6), int64(7), object(1) memory usage: 1.0+ MB
Upon closer examination, we have identified an imbalanced data scenario in which the number of observations for class 0 significantly outweighs the number of observations for class 1. Specifically, we have 8045 observations for class 0, while only 1533 observations exist for class 1.
print(loan_data['not.fully.paid'].value_counts()) loan_data['not.fully.paid'].value_counts().plot(kind='barh', color='#144086')
The above code calculates the frequency of each unique value in the ‘not.fully.paid’ column of the ‘loan_data’ dataset, providing insights into the distribution. It then visualizes this distribution using a horizontal bar plot.
When dealing with imbalanced data during model training, the skewed distribution can negatively impact overall performance. To address this, one approach is to utilize undersampling, which involves reducing the number of majority observations (class 0) to match the number of observations in class 1.
The following steps outline the process:
- Determine the number of observations in class 1.
- Randomly select N observations from class 0, where N represents the data frame size in class 1.
- Combine the two data frames obtained from the previous steps through concatenation.
By implementing undersampling, we can create a more balanced training dataset that helps alleviate the bias caused by imbalanced data.
loan_data_class_1 = loan_data[loan_data['not.fully.paid'] == 1] number_class_1 = len(loan_data_class_1) loan_data_class_0 = loan_data[loan_data['not.fully.paid'] == 0].sample(number_class_1) final_loan_data = pd.concat([loan_data_class_1, loan_data_class_0]) print(final_loan_data.shape)
The final balance after sampling is shown below:
Step-3: Prepare the data for training
After analyzing the dataset, it becomes evident that the features do not share the same scale. This discrepancy can pose a problem for certain machine learning models, including KNN, as they are sensitive to such variations in scaling. To address this issue, we can utilize the MinMaxScaler module to normalize the feature ranges to a consistent scale, typically between 0 and 1.
In addition, for the sake of simplicity and to streamline the preprocessing steps, we have decided to remove the “purpose” column from the dataset. While the purpose column may contain valuable information, its inclusion would require additional preprocessing steps that we want to avoid in this particular scenario.
from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler(feature_range=(0, 1)) # Remove unwanted 'purpose' column and get the data final_loan_data.drop('purpose', axis=1, inplace=True) X = final_loan_data.drop('not.fully.paid', axis=1) normalized_X = scaler.fit_transform(X)
As with any predictive task, splitting the dataset into training and testing sets is essential. For this specific case, we will allocate 77% of the data for training purposes, while the remaining 33% will be used for testing.
To ensure reproducibility and consistency in our results, we initialize the random state attribute with a specific value, which in our case, is set to 2023. This ensures that the same randomization process can be replicated, yielding consistent outcomes in subsequent runs.
In addition, we employ the “stratify” parameter during the data splitting process. Using stratified sampling, we ensure that the target variable, represented by the “y” value, is evenly distributed across the training and testing datasets. This helps maintain the integrity of the original distribution and prevents potential biases during the evaluation of the model’s performance.
from sklearn.model_selection import train_test_split y = final_loan_data['not.fully.paid'] r_state = 2023 t_size = 0.33 X_train, X_test, y_train, y_test = train_test_split(normalized_X, y, test_size=t_size, random_state=r_state, stratify=y)
The models will be trained on the training data, used to predict the testing data, and their performance will be evaluated accordingly.
Step-4: Bagging model
The Random Forest model will be utilized as the bagging model in this section, with the default parameters employed for simplicity.
from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import cross_val_score # Define the model random_forest_model = RandomForestClassifier() # Fit the random search object to the data random_forest_model.fit(X_train, y_train)
The performance of the trained model is evaluated using the accuracy_score function.
# Make predictions
y_pred = random_forest_model.predict(X_test)
# Get the performance
accuracy = accuracy_score(y_test, y_pred)
print(“Accuracy:”, accuracy)
Now, let’s compare this score to the performance of a blending aggregation approach.
By employing a blending aggregation approach, we can assess the combined performance of multiple models and compare it to the previous accuracy score.
Step-5: Blending
For blending, two base models (a decision tree and a K-Nearest Neighbors classifier) will be used, and a final regression model will be employed to make the final predictions.
from sklearn.neighbors import KNeighborsClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.linear_model import LogisticRegression
The original training data is divided into a new training dataset and a validation dataset, upon which the base models are trained.
X_train, X_val, y_train, y_val = train_test_split( X_train, y_train, test_size=t_size, random_state=r_state)
Two data frames are created:
The first data frame consists of the predictions made on the validation data concatenated with the original validation data.
The second data frame comprises the predictions made on the test data, concatenated with the original test data.
# Decision Tree Model dt_model = DecisionTreeClassifier() dt_model.fit(X_train, y_train) dt_model_pred_val = dt_model.predict(X_val) dt_model_pred_test= dt_model.predict(X_test) dt_model_pred_val = pd.DataFrame(dt_model_pred_val) dt_model_pred_test = pd.DataFrame(dt_model_pred_test) # KNN Model knn_model = KNeighborsClassifier() knn_model.fit(X_train,y_train) knn_model_pred_val = knn_model.predict(X_val) knn_model_pred_test = knn_model.predict(X_test) knn_model_pred_val = pd.DataFrame(knn_model_pred_val) knn_model_pred_test = pd.DataFrame(knn_model_pred_test)
The final logistic regression model is constructed using the concatenated validation data and evaluated using the concatenated test data.
x_val = pd.DataFrame(X_val) x_test = pd.DataFrame(X_test) df_val_lr = pd.concat([x_val, knn_model_pred_val, dt_model_pred_val], axis=1) df_test_lr = pd.concat([x_test, dt_model_pred_test, knn_model_pred_test],axis=1) # Logistic Regression Model lr_model = LogisticRegression() lr_model.fit(df_val_lr,y_val) lr_model.score(df_test_lr,y_test)
By utilizing the .score() function, the performance of the blending model is computed, which defaults to the accuracy score and does not require explicit predictions. The resulting performance score is 0.6383, equivalent to 63.83%, demonstrating an improvement of approximately 2% compared to the initial random forest model. Therefore, the blending model showcases superior performance.
Launch your project with LeewayHertz!
Partner with us to harness the power of diverse AI models for enhanced accuracy and robust predictions. Contact us today for robust ensemble model-powered solutions!
Benefits of an ensemble model
Ensemble modeling refers to the technique of combining multiple individual models to make more accurate predictions or classifications. The benefits of the ensemble model include the following:
Improved accuracy: Ensemble models often outperform individual models by combining their predictions. Each model in the ensemble may have different strengths and weaknesses, and by aggregating their predictions, the ensemble model can achieve higher accuracy and make more reliable predictions.
Reduces overfitting: Overfitting occurs when a model learns too much from the training data and performs poorly on new, unseen data. Ensemble models can help mitigate overfitting by combining multiple models that have been trained on different subsets of the data or using different algorithms. The models’ diversity helps reduce the risk of over-relying on specific patterns or noise in the training data.
Increased stability: Ensemble models tend to be more stable than individual models. When individual models exhibit high variance or instability due to the randomness in the training process, combining them into an ensemble can smooth out those fluctuations and provide more consistent predictions.
Handling complex relationships: Ensemble models can effectively capture complex relationships in the data by combining the outputs of different models. Each model may focus on different aspects or patterns in the data, and their combined predictions can provide a more comprehensive understanding of the underlying relationships.
Robustness to outliers and noise: Individual models may be sensitive to outliers or noisy data points, leading to inaccurate predictions. Ensemble models are generally more robust to such anomalies because combining multiple models can help mitigate the impact of outliers or noisy data. Outliers may have a lesser influence on the overall ensemble prediction compared to a single model.
Applications of an ensemble model
Here are some real-life applications of the ensemble model in machine learning:
Disease detection: Ensemble modeling can be applied to disease detection tasks, such as cardiovascular disease detection from medical images like X-rays and CT scans. By combining the predictions of multiple models or classifiers, ensemble techniques can improve the accuracy and reliability of disease diagnosis and prognosis.
Remote sensing: Ensemble modeling can aid in the analysis of remote sensing data. The combination of multiple models can help overcome the challenges posed by varying data resolutions and inconsistencies in data distribution. Tasks like landslide detection and scene classification can benefit from ensemble techniques to improve the accuracy of predictions.
Fraud detection: Ensemble learning has been successfully applied to fraud detection tasks, including credit card fraud and impression fraud detection. Combining multiple models’ outputs, ensemble techniques can enhance the precision and efficiency of detecting fraudulent activities in digital systems.
Speech emotion recognition: Ensemble modeling can be utilized in speech emotion recognition, particularly in multilingual environments. By combining the effects of different classifiers or models, ensemble techniques can improve the accuracy and performance of recognizing and classifying emotions in speech data, accounting for different languages and cultural variations.
Portfolio management: Ensemble models can be applied in financial portfolio management to make investment decisions. By combining the predictions of multiple models, portfolio managers can reduce risks, enhance returns, and make more informed investment choices based on diverse models’ outputs.
Ensemble learning’s ability to combine the strengths of multiple models and classifiers makes it a valuable approach in many other real-life applications, enabling improved accuracy, robustness, and reliability in complex tasks.
Endnote
The ensemble model represents a paradigm shift in the world of AI-enabled decision-making, drawing inspiration from our innate desire to seek diverse perspectives. Combining multiple models’ strengths, ensemble learning has brought enhanced performance and decision-making capabilities to the table. Its ability to unlock superior predictive accuracy and robustness has captivated researchers and businesses alike.
By leveraging the diverse insights of individual models, ensemble techniques overcome limitations, reduce overfitting, and improve generalization. They have found practical applications in areas such as disease detection, remote sensing, fraud detection, and speech-emotion recognition, showcasing their effectiveness in real-life scenarios.
The success of ensemble models can be attributed to their ability to handle complex data, exploit diverse modeling approaches, and overcome the limitations of individual models. Techniques such as bagging, boosting, stacking, and random forests have been widely used to create ensemble models, each with its own strengths and applications. As we venture into the future, ensemble learning is poised to change how we approach complex problems.
Ready to optimize your ML predictive potential? Leverage ensemble models, where multiple algorithms harmonize to enhance accuracy. Contact LeewayHertz for expert guidance.
Listen to the article
Start a conversation by filling the form
All information will be kept confidential.