Model validation in machine learning: Its importance, types of validations and how to perform it
In the rapidly evolving digital landscape, the power of data and machine learning has transformed the way we approach complex problems and make critical decisions. From healthcare and finance to marketing and autonomous vehicles, machine learning models have become indispensable tools for extracting insights from vast datasets and predicting outcomes with remarkable accuracy. Every decision, every strategic move, and every prediction made by businesses today hinges on the strength and reliability of these models. But how can we ensure these models are reliable? The answer lies in a critical, often underrated step in model development: validation. Once a model has been built, the subsequent step is not immediate implementation but validation, making it an indispensable part of the model development process. While it may seem counterintuitive, the time invested in model validation often surpasses the time spent on the model’s development. Why? A model’s efficacy, robustness, and accuracy can only be ascertained through thorough and rigorous validation.
In today’s high-stakes business landscape, an unvalidated or poorly validated model can lead to imprecise predictions, ineffective strategies, and business failures. As businesses lean heavily on data-driven decisions, it’s not an exaggeration to say that a company’s success may very well hinge on the strength of its model validation techniques.
These validation methods test a model’s accuracy and evaluate its adaptability to different scenarios, resilience under stress, and applicability to real-world situations. These aspects collectively ensure that the model is accurate in its predictions, versatile and robust, ready to drive your business success in a fast-paced, dynamic marketplace. Given the crucial role of model validation, investing significant time and resources in this phase of the model development process is not just a recommended best practice; it’s a business imperative. Hence, understanding and implementing various model validation techniques are key to ensuring business success.
This article will delve into different model validation techniques in machine learning, illustrating their significance and underscoring their indispensable role in shaping successful business strategies.
- What is model validation in machine learning?
- Why is model validation important?
- Various machine learning models and their validation necessities
- The cross-validation technique
- Different types of model validation techniques
- How to validate machine learning models using TensorFlow Model Analysis (TFMA)?
- Challenges in ML model validation
What is model validation in machine learning?
Model validation in machine learning represents an indispensable step in the development of AI models. It involves verifying the efficacy of an AI model by assessing its performance against certain predefined standards. This process does not merely involve feeding data to a model, training it, and deploying it. Rather, model validation in machine learning necessitates a rigorous check on the model’s results to ascertain it aligns with our expectations.
Numerous model validation techniques are available, each designed to evaluate and validate a model according to its distinct characteristics and behaviors.
In the realm of machine learning, the quality and quantity of data, as well as the ability to manipulate it effectively, are crucial. More often than not, this involves gathering data, cleaning it, preprocessing it, applying the appropriate algorithm, and finally determining the most suitable model. However, the process doesn’t end there. Validating a model is as critical as training it.
Deploying a model without validation is not a viable approach, especially in sensitive sectors such as healthcare, where the stakes are considerably high. In scenarios where real-world predictions have life-altering implications, the margin for error in a model is virtually non-existent. Therefore, an uncompromising model validation process is crucial to ensuring AI systems’ reliability and accuracy.
Launch your project with LeewayHertz!
Elevate your business with our high-performing AI models tailored to your specific needs. Our robust models undergo rigorous validation techniques to guarantee accuracy and adaptability over time.
Why is model validation important?
Model validation is essential in creating any machine learning or artificial intelligence system as it assures the model’s functionality and ability at the optimum level to manage unseen data. Without it, confidence in the model’s capacity to accurately interpret new data is uncertain. Moreover, validation aids in identifying the most suitable model, parameters, and accuracy measures for a particular task.
Through proper model validation, potential issues can be detected and rectified early on. It also facilitates the comparison of different models, enabling us to select the most fitting one for our requirements. Further, it helps establish the model’s accuracy when dealing with new data.
Notably, model validation is carried out impartially, often by a separate team or third party, to guarantee that the model complies with the required regulations and standards. This builds user trust in the model, reassuring them of its dependability.
Here are some of the consequences of improper model validation:
Inadequate model performance
If model validation isn’t done correctly, the model might not perform optimally when exposed to unseen data, which is, after all, the ultimate goal of any predictive model. Various validation techniques exist, with the most critical ones being in-time and out-of-time validations.
In-time validation involves setting aside a portion of the development dataset. The model is then tested against this data to assess how it performs on unseen data from the same time frame used for its creation.
In contrast, out-of-time validation entails obtaining a dataset from a different time period and testing the model’s response to this new, unseen data.
Both validation methods ensure developers’ confidence in their model’s performance. Without them, the model might not be able to deliver accurate results consistently, leading to potential problems down the line.
Questionable robustness
The effectiveness of a machine learning model hinges on its proper validation. When a model undergoes thorough validation, developers gain confidence in its performance and capability to act robustly in future scenarios. It ensures the model’s robustness and reliability, which leads to trustworthy outcomes.
During the validation process, sensitivity analysis tests are performed. These tests involve modifying the independent model variables to a certain degree to account for potential fluctuations, such as economic changes. The goal is to ensure that these changes don’t cause a severe impact on the dependent variable that could render the model ineffective.
Without this critical step, the model’s results could become dubious. In other words, the model might not handle future variables well and produce questionable or even incorrect results, which could undermine the model’s value in real-world applications.
Inability to handle stress scenarios
Machine learning models need to be resilient and adaptable, especially in extreme scenarios such as economic recessions or global health crises like a pandemic. If a model isn’t validated properly, it may struggle to adapt and deliver reliable predictions amidst such volatility.
This is where stress testing measures during model validation play a crucial role. By incorporating these tests, the model’s performance under strenuous conditions is evaluated, and its ability to withstand unexpected shocks is strengthened.
If the validation process encompasses these stress-testing measures, it aids in deploying a model version that is already been rigorously tested for turbulent scenarios. Consequently, the model is less likely to fail when encountering real-world adversities, thus ensuring it continues delivering valuable insights during crucial times.
Untrustworthy model outputs
Without proper model validation, particularly ‘out of time’ validation tests, there’s a risk that a machine learning model may be overfitted. An overfitted model performs exceptionally well on the data it was trained on, the development sample, but fails to generalize and perform well on unseen data. This discrepancy can lead to unreliable and untrustworthy outputs when the model is applied in the real world scenarios. To circumvent this issue, it’s crucial to ensure thorough validation is carried out on your machine learning models. This way, you can ensure that your model’s predictions remain reliable and accurate when presented with new data.
Various machine learning models and their validation necessities
Supervised learning models
Supervised learning models hold a crucial place in machine learning, being extensively used for predictive analysis based on data processing.
This category encompasses several model types, such as linear regression, logistic regression, support vector machines, decision trees and random forests, and artificial neural networks, each having unique validation needs.
Models like linear and logistic regression necessitate scrutiny for potential overfitting and underfitting situations. Support vector machines, decision trees and random forests all demand a split approach where the data is divided into training and test sets. The model learns from the training set, while the test set aids in assessing its performance.
Incorporating a separate validation set is essential when dealing with artificial neural networks. This validation set allows evaluating and comparing different models’ performance, ensuring the most accurate model is utilized.
Unsupervised learning models
Models falling under the unsupervised learning umbrella are distinguished by their ability to discern patterns in data devoid of external labels. These encompass varied models like clustering, anomaly detection, neural networks, and self-organizing maps.
To validate these models, different criteria are used based on the specific model type. For example, clustering models necessitate performance assessment measures like the silhouette coefficient or Davies-Bouldin Index.
Anomaly detection models, on the other hand, typically rely on precision-recall curves and ROC curves to gauge performance. Neural networks can be examined using hold-out validation and k-fold cross-validation methods. Lastly, self-organizing maps need to be checked using metrics like topographic or quantization errors.
Hybrid models
Hybrid models in machine learning bring together multiple techniques to optimize predictive performance. Validating these models is crucial, as the fusion of various models can enhance both accuracy and efficiency.
Confirming the reliability and consistency of hybrid models is another integral part of validation. During this phase, the model is tested against data it hasn’t encountered before to assess its accuracy and performance metrics.
Validation is key to gaining insights into the machine learning model’s capabilities and confirming that the hybrid models neither overfit nor underfit the data.
Further, validation serves as a tool to detect any existing biases and data leakage in the model and to identify modifications necessary for model improvement.
Deep learning models
Deep learning models, a robust variant of artificial intelligence, can be harnessed for diverse tasks, spanning image recognition and natural language processing to powering autonomous vehicles.
To ensure their optimal functionality, these models require rigorous validation. This step confirms the model’s capacity to identify objects, categorize data, or forecast outcomes precisely.
Commonly used deep learning models, such as Convolutional Neural Networks (CNNs), deployed for image classification, need to undergo validation against known object datasets to confirm their identification accuracy.
Similarly, Recurrent Neural Networks (RNNs), employed for natural language processing, require testing against a text corpus to ascertain their ability to dissect text accurately and yield correct results.
Lastly, reinforcement learning models designed for autonomous vehicles need to be trialed against driving simulators to ensure their competence in accurately processing and reacting to their environment.
Random Forest models
The Random Forest model, an ensemble machine learning method, amalgamates numerous decision trees to form a highly precise and stable model. Its role in model validation is pivotal, given its capacity to curtail the overfitting risk, thus delivering a more accurate model performance prediction.
Arbitrarily selecting samples from the training dataset creates multiple decision trees, each offering a prediction. The ultimate prediction is a mean of all tree predictions, offering a more precise outcome than any individual tree.
The ability of this approach to promote better generalization of the model comes in handy during model validation. It heightens the probability of the model delivering an accurate result when deployed on fresh data.
Support Vector Machines (SVMs)
A Support Vector Machine (SVM) is a widely employed machine learning model for validation, owing to its unique capability to augment the gap between different class data points.
It can identify the best hyperplane that distinguishes data points from varying classes, facilitating the accurate and dependable classification of these points.
Moreover, an SVM’s utility extends to identifying outliers, uncovering non-linear correlations in data, and dealing with regression and classification issues. This versatility enhances its popularity and makes it a go-to choice for model validation.
Neural network models
Artificial neural network models are a subset of machine learning models that emulate the workings of a human brain. They can independently learn and make decisions without being limited by pre-set parameters or prior knowledge. Certain validation requisites must be met for these models to perform effectively and provide accurate results.
Primarily, these models need substantial training data to make accurate decisions and build links between the diverse inputs and outputs. The training data should reflect the data the model will encounter in production, as inconsistencies between the training and production data could lead to inaccurate results.
Additionally, the data should be normalized to ensure all variables are on a consistent scale, directly influencing the model’s performance.
Moreover, the model should be subjected to various parameters and data types to verify its capability of handling diverse inputs and outputs.
Lastly, the model’s performance should be measured against various metrics to confirm that it delivers the expected level of accuracy. These metrics can encompass accuracy scores, precision, recall, and F1 scores. Evaluating the model against various metrics helps identify if the model’s performance aligns with expectations and if any modifications are necessary to enhance its performance.
k-nearest neighbors models
The k-Nearest Neighbors (k-NN) model is a widely used supervised learning algorithm primarily used for tackling classification and regression tasks. Its simplicity and ease of implementation make it a favored choice for model validation.
k-NN operates by locating the k-nearest neighbors, that is, the data points most similar to a given input, and then classifying the input based on the most common label amongst these k-nearest neighbors. This allows the model to make reliable predictions without needing prior training on the data.
Furthermore, k-NN is relatively simpler when compared to other machine learning models, hence making it an optimal selection for validation processes.
Being a non-parametric model, k-NN is not influenced by the number of features or the size of the dataset. This distinctive attribute further bolsters k-NN’s suitability for validation as it can accurately forecast a model’s performance on unseen data.
Bayesian models
Bayesian models are a category of probabilistic models that employ Bayes’ theorem to calculate the likelihood of a given hypothesis based on certain data. They typically rely on prior knowledge and often hinge on the data scientist’s preliminary assumptions. Bayesian models are employed to predict and gauge the likely distributions of unknown variables.
There are three main types of Bayesian models: Bayesian parameter estimation models, Bayesian network models, and Bayesian non-parametric models.
Bayesian parameter estimation models are used when there’s a need to estimate a probabilistic model’s uncertain or unknown parameters. These models are utilized to deduce the posterior distribution of a parameter set within a probabilistic model based on the observed data.
Bayesian network models are probabilistic graphical models that illustrate the relationships between various variables. They predict the value of a variable based on the values of other variables in the system.
Bayesian non-parametric models, on the other hand, are probabilistic models that make no assumptions about the data’s underlying distribution. They are primarily used to estimate the likelihood of a hypothesis without having to define the parameters of the distribution.
Bayesian models are beneficial for modeling intricate systems and predicting a system’s behavior using observed data. They have found extensive applications in machine learning, AI, medical research, and more.
Clustering models
Clustering models need to be validated to confirm that the produced clusters are significant and that the model is dependable.
Several prerequisites must be satisfied when utilizing this methodology:
- Evaluation of the produced clusters’ quality is crucial.
- Comparing the clusters generated by different algorithms provides important insights.
- It is important to assess the stability of clusters over multiple iterations.\
- The scalability of the clustering model needs to be tested to ensure it can handle large datasets.
- Reviewing the results of the clustering model is essential to ensure they are meaningful, reliable, and reflect the characteristics of the original data.
Launch your project with LeewayHertz!
Elevate your business with our high-performing AI models tailored to your specific needs. Our robust models undergo rigorous validation techniques to guarantee accuracy and adaptability over time.
The cross-validation technique
Let’s imagine you have created a machine learning model and trained it on a specific dataset. The model’s accuracy on this training data is about 95%. Does this mean your model is superbly trained and the best model due to its high accuracy? Not necessarily! Your model is knowledgeable about the training data, and it may have even captured minor variations, thus over-generalizing from this specific data. However, when faced with entirely new, unseen data, it might not predict with the same accuracy. This problem is known as overfitting.
On the other hand, there could be instances where the model fails to train effectively on the training set as it’s unable to identify patterns. Consequently, the model might not perform well on the test set either. This problem is known as underfitting.
To overcome these issues of overfitting and underfitting, we use a technique known as cross-validation.
Cross-validation is a method of resampling that segments the dataset into two portions – one for training and the other for testing. The training data is used to train the model, while the unseen test data assesses its predictive capability. If the model performs well on the test data with a high level of accuracy, it suggests that the model has not overfit the training data and can be used for future predictions.
Cross-validation is a statistical process utilized to determine machine learning models’ performance (or accuracy). It aids in preventing overfitting in a predictive model, particularly in situations where data may be limited. The data is divided into a fixed number of partitions or folds during cross-validation. The model analysis is run on each fold, and the overall error estimate is averaged.
When embarking on a machine learning task, you must first correctly define the problem to select the most fitting algorithm to yield the best results. But how do we compare different models? For instance, you have trained a model with available data, and now you want to assess its performance. One method could be to test the model on the same dataset used for training. However, this is not always the best practice.
Testing the model on the training dataset could lead us to presume that the training data represents all possible real-world scenarios, which is rarely the case. Our main goal is to ensure the model performs well on real-world data. Although the training dataset is derived from real-world data, it only represents a small subset of all possible data points (examples) out there.
Therefore, to truly gauge the model’s capabilities, it should be tested on data it has never seen before, often referred to as the test dataset. However, by splitting our data into training and testing data, aren’t we potentially missing out on some important information that the test dataset may hold? Let’s explore the different types of cross-validation to find the answers to this question.
Different types of model validation techniques
We will illustrate the following strategies for validation:
- Splitting data into training and testing sets
- Cross-validation using k-folds
- Leave-one-out cross-validation method
- Leave-one-group-out cross-validation
- Nested cross-validation technique
- Cross-validation for time-series data
- Stratified k-fold cross-validation
Splitting data into training and testing sets
The crux of all validation methods is the division of your data during model training. This step is crucial to simulate how your model would react to data it has never encountered before.
- Training and testing data division: The most fundamental approach involves dividing your data into training and testing sets. Generally, you randomly split your data into two parts, approximately 70% for training the model and 30% for testing its performance. The advantage of this method lies in its ability to evaluate the model’s response to new, unseen data. Nevertheless, this approach could pose an issue if a specific subset of our data only contains individuals with a certain age or income bracket, leading to a situation commonly referred to as sampling bias. We could employ methods like k-fold cross-validation to mitigate this bias, which we will delve into later. However, before that, let’s examine the concept of a ‘holdout set.’
- Holdout set: When adjusting the hyperparameters of your model, there’s a possibility of overfitting if you optimize using just the train/test split. The reason is that the model attempts to find hyperparameters that fit the specific train/test split you have made. To address this problem, an additional holdout set can be created.
The holdout method is one of the most straightforward and commonly implemented evaluation techniques in machine learning projects. This approach splits the entire dataset into two subsets – a training set and a test set. The division ratio can vary based on the use case – common splits include 70-30, 60-40, 75-25, 80-20, or even 50-50, with the training data usually forming the larger proportion. The split is done randomly, and unless a specific ‘random_state’ is specified, we have no control over which data points fall into the training or testing buckets. This could result in considerable variance, as each change in the split can lead to variations in accuracy.
However, this method has certain limitations:
The test error rates in the holdout method can be highly variable and can greatly depend on which observations land in the training and testing sets. This can lead to high variance. The model is only trained on a portion of the data, which might not be ideal when the available data is not abundant. This could lead to overestimating the test error, indicating a high bias. One of the primary advantages of the holdout method is that it is computationally less demanding compared to other cross-validation techniques, making it a cost-effective option.
If you are only using a train/test split, comparing the distributions of your training and testing sets is advisable. If the distributions significantly differ, you might encounter generalization problems. You can use tools like Facets to compare their distributions easily.
Cross-validation using k-folds (k-fold CV)
Cross-validation using k-folds is an effective technique used to mitigate the issue of sampling bias in machine learning. Instead of making a single split, it involves creating numerous splits and then conducting validation across all these different splits.
The k-fold cross-validation method divides the data into ‘k’ number of subsets, or folds. The model is then trained on k-1 of these folds and tested on the remaining folds set aside. This process repeats for each unique combination, ensuring every fold acts as a testing set once. The results of each iteration are then averaged.
One of the key advantages of this approach is that every data point is used both for training and validation, and each one is used for validation precisely once. A common practice is to choose k=5 or k=10, as these values provide a good balance between computational efficiency and validation accuracy.
The scores obtained from each fold in the cross-validation process can offer more insights than just the average performance. Examining these scores’ variance or standard deviation can provide information about the model’s stability across varying data inputs.
With k-fold cross-validation, k is smaller than the number of observations in the dataset. This method generates the mean outcomes from k-fitted models, which demonstrate marginally lower correlation due to the reduced intersection among the training sets incorporated in each model. This results in reduced variance.
The k-fold method balances bias, as each training set contains fewer observations than the leave-one-out method but more than the holdout method. This leads to an intermediate level of bias.
Typically, k-fold cross validation is performed with k set to either 5 or 10, as these values have been proven, through practical evidence, to provide test error estimates that are not overly impacted by high bias or variance.
The primary drawback of this method is its computational expense. The model must be run from scratch k times, which requires more computational resources than the holdout method, but it’s still more efficient than the leave-one-out method.
Leave-one-out Cross-validation method
Leave-one-out Cross-validation (LOOCV) is a nuanced variant of the k-fold cross-validation method. In LOOCV, each data point in the dataset serves as a separate test set, while all the remaining data points form the training set. This method is identical to k-fold cross-validation, where k equals the total number of observations.
Though LOOCV is highly thorough, it can be computationally demanding since the model must be trained for every data point in the dataset. This method is only recommended if the dataset is small or the computational resources can accommodate such intensive calculations.
The process involves selecting one observation as the test data and treating all remaining observations as training data. The model is then trained, and this procedure is repeated for every observation in the dataset. The test error is estimated by averaging the errors across all iterations.
When it comes to estimating test errors, LOOCV is known to offer unbiased estimates due to its comprehensive approach to testing on each data point. However, bias isn’t the only factor to consider in estimation problems – variance must also be accounted for. LOOCV tends to have high variance as it averages the outputs of multiple models, each trained on nearly identical data. As such, the outputs are highly correlated with each other.
LOOCV is a computationally expensive method since the model must be run as often as there are data points. This issue is addressed in other methods which aim to strike a better balance between bias and variance.
Leave-one-group-out Cross-validation (LOGOCV)
Leave-one-group-out Cross-validation (LOGOCV) is a specialized form of k-fold cross-validation designed to handle data grouped in distinct categories. Suppose you have a dataset containing information about multiple companies and their clients, and your goal is to predict the companies’ success. In a standard k-fold CV, each fold may contain data from multiple companies. However, with LOGOCV, you construct each fold to include data from a single company. This way, you maintain the integrity of company-specific data within each fold.
The process is akin to a hybrid of k-fold CV and LOOCV, but with a twist: instead of leaving out a single data point (as in LOOCV) or random subsets of data (as in k-fold CV), you exclude all data pertaining to one company or ‘group.’ This ensures each validation step tests the model’s performance on unseen data from a completely different company.
This method is particularly advantageous when the aim is to understand how well the model generalizes across different groups or categories in the data, as it allows for an unbiased assessment of the model’s predictive capability on entirely new groups.
Nested cross-validation technique
Nested cross-validation is a robust method used in scenarios where we need to optimize hyperparameters and estimate model performance simultaneously yet want to avoid overfitting. This approach ingeniously integrates two loops of k-fold cross-validation – an inner loop and an outer loop.
The inner loop primarily deals with hyperparameter optimization. It helps determine the best hyperparameters that would tune the model for optimal performance. However, estimating the model’s accuracy on the same data used for selecting hyperparameters can lead to overfitting.
This is where the outer loop comes into play. The role of the outer loop is to validate the model’s performance. It provides an unbiased estimate of model accuracy using data that the model has not seen during the hyperparameter tuning phase in the inner loop.
This arrangement ensures that the model’s performance is assessed on unseen data, mitigating the risk of overfitting. The nested structure of these two loops, which gives it the name ‘nested cross-validation,’ provides a more robust and fair assessment of the model’s predictive power and generalization ability.
Launch your project with LeewayHertz!
Elevate your business with our high-performing AI models tailored to your specific needs. Our robust models undergo rigorous validation techniques to guarantee accuracy and adaptability over time.
Cross-validation for time-series data
Validating time-series data demands a distinctive approach, unlike other forms of data. When employing cross-validation on time-series data, preserving the temporal order of the observations is essential. This means the usual random partitioning strategies like k-fold CV may lead to overfitting as they can inadvertently leak future information into the training set.
Properly handling cross-validation for time-series data involves structuring your folds so that all the training data occurs chronologically before your test data. This technique, often called time-series cross-validation, ensures that the model is trained on a past ‘slice’ of data at each fold and validated on a subsequent, future ‘slice.’
This preserves the temporal structure and dependency inherent in time-series data, offering a more reliable and realistic estimate of the model’s predictive performance on unseen, future data. It effectively curbs the risk of information leakage from the future, keeping the model evaluation process in line with how time-series forecasting models are intended to work in real-world applications.
Stratified k-fold cross-validation
Stratified k-fold cross-validation offers a slight modification to the typical k-fold cross-validation method, utilizing the concept of ‘stratified sampling’ as an alternative to ‘random sampling.’
Before we delve into the specifics of stratified sampling, let’s first distinguish it from random sampling. Imagine a dataset consisting of product reviews from a skincare product both genders use. Through random sampling to segregate data into training and testing subsets, we might inadvertently cause a disproportionate representation of one gender in the training data while having a larger representation in the testing data. If the model is trained on data that doesn’t truly represent the full diversity of the actual population, it’s likely to deliver less accurate predictions when tested.
To circumvent this issue, we employ stratified sampling. In this method, the data is partitioned to maintain the proportion of each class from the overall population.
For instance, consider a dataset containing 1000 customer reviews for a product, with a gender ratio of 60% females and 40% males. If we wish to divide the data into training and testing subsets following an 80:20 ratio, stratified sampling ensures that gender proportionality is preserved in each subset. Therefore, the 800-sample training set would include 480 reviews from females and 320 from males. Similarly, the 200-sample testing set would maintain the same gender distribution.
This is the essence of Stratified k-fold cross-validation – creating k-folds that preserve the original class proportionality. Consequently, it addresses the imbalance issue potentially introduced by the Holdout and k-fold methods, enhancing the model’s ability to generalize across the entire dataset.
How to validate machine learning models using TensorFlow Model Analysis (TFMA)?
TensorFlow Model Analysis (TFMA) is a robust tool developed by Google, intended to present an in-depth evaluation of the performance of machine learning models. By leveraging Apache Beam’s capabilities, TFMA conducts computations in a distributed manner over large-scale datasets. The unique strength of TFMA lies in its capacity to provide deep insights into the variations in model performance across different data segments. It is capable of calculating metrics that were utilized during the training phase, known as built-in metrics, as well as metrics determined after the model was saved as part of the TFMA configuration settings.
In this example, we focus on a comprehensive evaluation and analysis of a machine learning model that has been previously trained. The complete code of the model is available in this GitHub location, leveraging the Taxi Trips dataset released by the city of Chicago.
Pre-requisites
- Basic proficiency in Apache Beam.
- An elementary comprehension of how machine learning models function.
- A fresh Google Colab notebook to execute the Python code in the Google Drive.
Installation of TensorFlow Model Analysis (TFMA)
Once the Google Colab notebook is set up, the initial step involves importing all the necessary dependencies. This operation may require some time to complete.
Rename the file from Untitled.ipynb to TFMA.ipynb.
Next execute the below code to install the dependencies:
!pip install -U pip !pip install tensorflow-model-analysis` !pip install apache_beam !pip install apache-beam
The initial command upgrades pip, the package management system responsible for the installation and management of Python software packages. The term “pip” signifies “preferred installer program.” Subsequently, the TensorFlow Model Analysis, TFMA, is installed via the second command.
Following the completion of this process, it is crucial to reset the runtime environment before proceeding to execute the forthcoming cells.
import sys assert sys.version_info.major==3 import tensorflow as tf import apache_beam as beam import tensorflow_model_analysis as tfma
This segment of code brings in the necessary libraries such as ‘sys’, ‘tensorflow’, ‘apache_beam’, and ‘tensorflow_model_analysis’. The command ‘assert sys.version_info.major==3’ is used to confirm that Python 3 is the version being employed to execute the notebook.
Loading the data set
You need to download the data set stored as a tar file and extract it. Use the following code for the same:
import io, os, tempfile TAR_NAME = 'saved_models-2.2' BASE_DIR = tempfile.mkdtemp() DATA_DIR = os.path.join(BASE_DIR, TAR_NAME, 'data') MODELS_DIR = os.path.join(BASE_DIR, TAR_NAME, 'models') SCHEMA = os.path.join(BASE_DIR, TAR_NAME, 'schema.pbtxt') OUTPUT_DIR = os.path.join(BASE_DIR, 'output') !curl -O https://storage.googleapis.com/artifacts.tfx-oss-public.appspot.com/datasets/{TAR_NAME}.tar !tar xf {TAR_NAME}.tar !mv {TAR_NAME} {BASE_DIR} !rm {TAR_NAME}.tar
The downloaded dataset comes in the form of a tar file. This tar file contains a variety of resources including the training datasets, evaluation datasets, the data schema, as well as the saved models for both training and serving, alongside the saved models for evaluation.
Parsing the schema
The downloaded schema has to be parsed to be effectively utilized with TFMA.
import tensorflow as tf from google.protobuf import text_format from tensorflow.python.lib.io import file_io from tensorflow_metadata.proto.v0 import schema_pb2 from tensorflow.core.example import example_pb2 schema = schema_pb2.Schema() contents = file_io.read_file_to_string(SCHEMA) schema = text_format.Parse(contents, schema)
The schema will be parsed by utilizing the text_format function from the google.protobuf library, which transforms the protobuf message into a textual format, along with TensorFlow’s schema_pb2.
Using the schema to create TFRecords
The subsequent step involves providing TFMA with access to the dataset. To accomplish this, a TFRecords file must be established. Utilizing the existing schema to do this is advantageous because it ensures each feature is assigned the accurate type.
import csv datafile = os.path.join(DATA_DIR, 'eval', 'data.csv') reader = csv.DictReader(open(datafile, 'r')) examples = [] for line in reader: example = example_pb2.Example() for feature in schema.feature: key = feature.name if feature.type == schema_pb2.FLOAT: example.features.feature[key].float_list.value[:] = ( [float(line[key])] if len(line[key]) > 0 else []) elif feature.type == schema_pb2.INT: example.features.feature[key].int64_list.value[:] = ( [int(line[key])] if len(line[key]) > 0 else []) elif feature.type == schema_pb2.BYTES: example.features.feature[key].bytes_list.value[:] = ( [line[key].encode('utf8')] if len(line[key]) > 0 else []) # Add a new column 'big_tipper' that indicates if the tip was > 20% of the fare. # TODO(b/157064428): Remove after label transformation is supported for Keras. big_tipper = float(line['tips']) > float(line['fare']) * 0.2 example.features.feature['big_tipper'].float_list.value[:] = [big_tipper] examples.append(example) tfrecord_file = os.path.join(BASE_DIR, 'train_data.rio') with tf.io.TFRecordWriter(tfrecord_file) as writer: for example in examples: writer.write(example.SerializeToString()) !ls {tfrecord_file}
It is important to highlight that TFMA supports various model types, such as TF Keras models, models built on generic TF2 signature APIs, and TF estimator-based models.
Here we will explore how we to configure a Keras-based model.
In configuring the Keras model, the metrics and plots will be manually integrated as part of the setup process.
Set up and run TFMA using Keras
Import tfma using the below code
import tensorflow_model_analysis as tfma
You will need to call and use the previously imported instance of TensorFlow Model Analysis (TFMA).
# You will setup tfma.EvalConfig settings keras_eval_config = text_format.Parse(""" ## Model information model_specs { # For keras (and serving models) we need to add a `label_key`. label_key: "big_tipper" } ## You will post training metric information. These will be merged with any built-in ## metrics from training. metrics_specs { metrics { class_name: "ExampleCount" } metrics { class_name: "BinaryAccuracy" } metrics { class_name: "BinaryCrossentropy" } metrics { class_name: "AUC" } metrics { class_name: "AUCPrecisionRecall" } metrics { class_name: "Precision" } metrics { class_name: "Recall" } metrics { class_name: "MeanLabel" } metrics { class_name: "MeanPrediction" } metrics { class_name: "Calibration" } metrics { class_name: "CalibrationPlot" } metrics { class_name: "ConfusionMatrixPlot" } # ... add additional metrics and plots ... } ## You will slice the information slicing_specs {} # overall slice slicing_specs { feature_keys: ["trip_start_hour"] } slicing_specs { feature_keys: ["trip_start_day"] } slicing_specs { feature_values: { key: "trip_start_month" value: "1" } } slicing_specs { feature_keys: ["trip_start_hour", "trip_start_day"] } """, tfma.EvalConfig())
It’s crucial to establish a TensorFlow Model Analysis (TFMA) EvalSharedModel that references the Keras model.
keras_model_path = os.path.join(MODELS_DIR, 'keras', '2') keras_eval_shared_model = tfma.default_eval_shared_model( eval_saved_model_path=keras_model_path, eval_config=keras_eval_config) keras_output_path = os.path.join(OUTPUT_DIR, 'keras')
Finally run TFMA.
keras_eval_result = tfma.run_model_analysis( eval_shared_model=keras_eval_shared_model, eval_config=keras_eval_config, data_location=tfrecord_file, output_path=keras_output_path)
Upon completing the evaluation, the next step is to examine the output through the lens of TensorFlow Model Analysis (TFMA) visualizations.
To showcase metrics, the function tfma.view.render_slicing_metrics can be employed. By default, this function presents an “Overall” slice. If the aim is to view a specific slice, one can input the column name (by setting slicing_column) or supply a tfma.SlicingSpec.
Track your model’s performance
Model training will be executed utilizing the training dataset, which should ideally echo the characteristics of the test dataset and data destined for production deployment.
However, it’s worth noting that even though the inference request data may mirror the training data initially, over time, it’s likely to deviate. This can influence the model’s performance.
Therefore, it is crucial to continuously monitor and gauge the model’s performance to stay alert to and mitigate any potential changes.
This is where TensorFlow Model Analysis (TFMA) steps in as an invaluable tool.
output_paths = [] for i in range(3): # Create a tfma.EvalSharedModel that points to our saved model. eval_shared_model = tfma.default_eval_shared_model( eval_saved_model_path=os.path.join(MODELS_DIR, 'keras', str(i)), eval_config=keras_eval_config) output_path = os.path.join(OUTPUT_DIR, 'time_series', str(i)) output_paths.append(output_path) # Run TFMA tfma.run_model_analysis(eval_shared_model=eval_shared_model, eval_config=keras_eval_config, data_location=tfrecord_file, output_path=output_path) eval_results_from_disk = tfma.load_eval_results(output_paths[:2]) tfma.view.render_time_series(eval_results_from_disk)
With TensorFlow Model Analysis (TFMA), it’s possible to assess and verify machine learning models across various data segments. This provides the ability to comprehend the performance of models across different data conditions and scenarios
Challenges in ML model validation
- Data density: With the advancement of new ML and AI techniques, there is a growing need for large, diverse, and detailed datasets to build effective models. This data can also often be less structured, which makes validation challenging. Thus, developing innovative tools becomes essential for data integrity and suitability checks.
- Theoretical clarity: A major challenge in the ML/AI domain lies in understanding its methodologies. They are not as comprehensively grasped by practitioners as more traditional techniques. This lack of understanding extends to the specifics of a certain method and how well-suited a procedure is for a given modeling scenario. This makes it harder for model developers to substantiate the appropriateness of the theoretical framework in a specific modeling or business context.
- Model documentation and coding: An integral part of model validation is evaluating the extent and completeness of the model documentation. The ideal documentation would be exhaustive and standalone, enabling third-party reviewers to recreate the model without accessing the model code. This level of detail is challenging even with standard modeling techniques and even more so in the context of ML/AI, making model validation a demanding task.
- Evaluation of results and model testing: With ML/AI, financial institutions may need to reconsider standard backtesting methods, like measures of model fit and other analytics such as sensitivity analysis or stability analysis. More computationally taxing techniques, like k-fold cross-validation, might be necessary to test the accuracy and robustness of ML/AI models. Unlike traditional methods like ordinary regression, sensitivity analysis may vary significantly depending on the model type due to the less direct correlation between inputs and outputs.
- Third-party models: In the world of ML and AI, it’s often more common for companies to use models created by external vendors. Regulatory standards require these externally sourced models to undergo the same rigorous checks as those built in-house, and this could be even more important with ML/AI models. Traditional methods of checking model performance can be more difficult in the context of ML/AI. Therefore, companies may rely more on less formal methods of validation like regular monitoring of models and ensuring they are based on solid theory. These checks would involve a thorough review of documents that detail how the model was customized, how it was developed, and how suitable it is for the company’s needs.
Endnote
Model validation is a critical step in machine learning, determining the accuracy and reliability of various models, including supervised, unsupervised, deep learning models, and many others. The importance of model validation lies in its ability to ensure that models perform adequately, are robust, and can handle stress scenarios.
Techniques such as cross-validation, splitting data into training and testing sets, and tools like TensorFlow Model Analysis (TFMA) help in this process. Despite the challenges, model validation is essential. Without it, the outputs of the models may not be reliable, which can negatively affect business decisions.
Model validation safeguards the integrity of data-driven decision-making and is pivotal to the success of businesses in a data-centric world. It’s not just an option; it’s a necessity.
Want reliable, high-performing machine learning models? Collaborate with LeewayHertz’s team of ML experts who specialize in developing robust ML models capable of withstanding stress scenarios and delivering consistent results. Contact us today to elevate your machine learning initiatives!
Start a conversation by filling the form
All information will be kept confidential.