Parameter-efficient Fine-tuning (PEFT): Overview, benefits, techniques and model training
Transfer learning plays a crucial role in the development of large language models such as GPT-3 and BERT. It is an ML technique in which a model trained on a certain task is used as a starting point for a distinct but similar task. The idea behind transfer learning is that the knowledge gained by a model from solving one problem can be leveraged to help solve another problem.
One of the earliest examples of transfer learning was using pre-trained word embeddings, such as Word2Vec, to improve the performance of NLP-based models. More recently, with the emergence of large pre-trained language models such as BERT and GPT-3, the scope of transfer learning has extended remarkably. Fine-tuning is one of the most popular methods used in transfer learning. It involves adapting a pre-trained model to a particular task by training it on a smaller set of task-specific labeled data.
However, with the parameter count of large language models reaching trillions, fine-tuning the entire model has become computationally expensive and often impractical. In response, the focus has shifted towards in-context learning, where the model is provided with prompts for a given task and returns in-context updates. However, inefficiencies like processing the prompt each time the model makes a prediction and its poor performance at times make it a less favorable choice. This is where Parameter-efficient Fine-tuning (PEFT) comes in as an alternative paradigm to prompting. PEFT aims to fine-tune only a small subset of the model’s parameters, achieving comparable performance to full fine-tuning while significantly reducing computational requirements. This article will discuss the PEFT method in detail, exploring its benefits and how it has become an efficient way to fine-tune LLMs on downstream tasks.
- A glossary of important terms
- What is PEFT?
- What is the difference between fine-tuning and parameter-efficient fine-tuning?
- Benefits of PEFT
- Use cases of parameter-efficient fine-tuning
- PEFT: A better alternative to standard fine-tuning
- Parameter-efficient fine-tuning techniques
- Training your model using PEFT
- Few-shot in-context learning vs. parameter-efficient fine-tuning
- Is PEFT more efficient than ICL?
- Pitfalls to avoid in parameter-efficient fine-tuning
- The process of parameter-efficient fine-tuning
A glossary of important terms
LLM models: Large Language Models or LLMs are a type of machine learning models that can learn the underlying structure and semantics of text data for NLP tasks. They do this by learning a set of latent variables representing the text’s high-level concepts and features. Essentially, LLM models try to capture what the text is about, without solely focusing on what words are used.
Pre-trained models: Pre-trained models are machine learning models that have been trained on large amounts of data to facilitate a specific task, such as image classification, speech recognition, or natural language processing. These models have already learned the optimal set of weights and parameters needed to perform the task effectively so that they can be used as a starting point for further training on new data or for use in other applications.
Parameters: Parameters are the values/variables that a model learns during training to make predictions or classifications on new data. Parameters are usually represented as weights and biases in neural networks, and they control how the input data is transformed into output predictions.
Transfer learning: Transfer learning refers to taking a pre-trained model developed for a specific task and reusing it as a starting point for a new, related task. This involves using the pre-trained model’s learned feature representations as a starting point for a new model, which is then trained on a smaller dataset specific to the new task.
Fine-tuning: Fine-tuning is a specific type of transfer learning where the pre-trained model’s weights are adjusted or fine-tuned on a new task-specific dataset. The pre-trained model is used as a starting point in this process, but the weights are adjusted during training to fit the new data better. The amount of fine-tuning can vary depending on the amount of available data and the similarity between the original and new tasks.
Padding: Padding is a common technique used during fine-tuning language models to handle variable-length input sequences. It is the process of adding special tokens (typically a “padding” token) to the input sequence to bring it up to a fixed length.
Hidden representations: Hidden representations are the internal representations of the input data learned by the pre-trained model’s layers. These representations capture different levels of abstraction of the input data and can be used as features to train a new model for the task at hand.
Few-shot learning: Few-shot learning is a machine learning technique that aims to train models on a limited amount of labeled data, typically in the range of a few dozen to a few hundred examples, and then generalize to new tasks with only a few or even a single labeled example. Few-shot learning algorithms can learn to recognize novel objects, categories, or concepts with very few examples by leveraging prior knowledge from related tasks or domains.
What is PEFT?
Parameter-efficient Fine-tuning (PEFT) is a technique used in Natural Language Processing (NLP) to improve the performance of pre-trained language models on specific downstream tasks. It involves reusing the pre-trained model’s parameters and fine-tuning them on a smaller dataset, which saves computational resources and time compared to training the entire model from scratch.
PEFT achieves this efficiency by freezing some of the layers of the pre-trained model and only fine-tuning the last few layers that are specific to the downstream task. This way, the model can be adapted to new tasks with less computational overhead and fewer labeled examples. Although PEFT has been a relatively novel concept, updating the last layer of models has been in practice in the field of computer vision since the introduction of transfer learning. Even in NLP, experiments with static and non-static word embeddings were carried out early on.
Parameter-efficient fine-tuning aims to improve the performance of pre-trained models, such as BERT and RoBERTa, on various downstream tasks, including sentiment analysis, named entity recognition, and question-answering. It achieves this in low-resource settings with limited data and computational resources. It modifies only a small subset of model parameters and is less prone to overfitting.
What is a PEFT model?
A PEFT model is a pre-trained model that has been fine-tuned using the parameter-efficient fine-tuning technique. A PEFT model starts as a general-purpose model trained on vast amounts of data to learn a broad understanding of language or image patterns. The fine-tuning process then adapts this model to perform well on more specific tasks by modifying only a select few parameters rather than the entire network. This selective updating makes PEFT models particularly useful for applications where deploying large-scale models is computationally or financially prohibitive. By focusing updates on the most impactful parameters, a PEFT model maintains high performance while being more efficient and agile than fully retrained models.
What is the difference between fine-tuning and parameter-efficient fine-tuning?
Fine-tuning and parameter-efficient fine-tuning are two approaches used in machine learning to improve the performance of pre-trained models on a specific task.
Fine-tuning is taking a pre-trained model and training it further on a new task with new data. The entire pre-trained model is usually trained in fine-tuning, including all its layers and parameters. This process can be computationally expensive and time-consuming, especially for large models.
On the other hand, parameter-efficient fine-tuning is a method of fine-tuning that focuses on training only a subset of the pre-trained model’s parameters. This approach involves identifying the most important parameters for the new task and only updating those parameters during training. Doing so, PEFT can significantly reduce the computation required for fine-tuning.
Parameter-efficient fine-tuning | Standard fine-tuning | |
Goal | Improve the performance of a pre-trained model on a specific task with limited data and computation | Improve the performance of a pre-trained model on a specific task with ample data and computation |
Training Data | Small dataset (fewer examples) | Large dataset (many examples) |
Training Time | Faster training time as compared to fine-tuning | Longer training time as compared to PEFT |
Computational Resources | Uses fewer computational resources | Requires larger computational resources |
Model Parameters | Modifies only a small subset of model parameters | Re-trains the entire model |
Overfitting | Less prone to overfitting as the model is not excessively modified | More prone to overfitting as the model is extensively modified |
Training Performance | Not as good as fine-tuning, but still good enough | Typically results in better performance than PEFT |
Use Cases | Ideal for low-resource settings or where large amounts of training data are not available | Ideal for high-resource settings with ample training data and computational resources |
Parameter-efficient fine-tuning can be particularly useful in scenarios where computational resources are limited or where large pre-trained models are involved. In such cases, PEFT can provide a more efficient way of fine-tuning the model without sacrificing performance. However, it’s important to note that PEFT may sometimes achieve a different level of performance than full fine-tuning, especially in cases where the pre-trained model requires significant modification to perform well on the new task.
Benefits of PEFT
Here, we will discuss the benefits of PEFT in relation to traditional fine-tuning. So, let us understand why parameter-efficient fine-tuning is more beneficial than fine-tuning.
- Decreased computational and storage costs: PEFT involves fine-tuning only a small number of extra model parameters while freezing most parameters of the pre-trained LLMs, thereby reducing computational and storage costs significantly.
- Overcoming catastrophic forgetting: During full fine-tuning of LLMs, catastrophic forgetting can occur where the model forgets the knowledge it learned during pretraining. PEFT stands to overcome this issue by only updating a few parameters.
- Better performance in low-data regimes: PEFT is particularly advantageous in scenarios with sparse data. PEFT can achieve superior performance and better generalize to new, unseen domains compared to traditional fine-tuning methods by optimizing a smaller parameter subset.
- Portability: PEFT methods enable users to obtain tiny checkpoints worth a few MBs compared to the large checkpoints of full fine-tuning. This makes the trained weights from PEFT approaches easy to deploy and use for multiple tasks without replacing the entire model. Models tuned via PEFT are smaller and more manageable, resulting in lightweight models that are easier to deploy across various platforms, including mobile and other resource-constrained devices.
- Comparable performance with fewer trainable parameters: Despite tuning fewer parameters, PEFT can achieve performance levels comparable to full model fine-tuning. This efficiency enables scaling to various applications without extensive computational resources.
- Sustainability: PEFT represents a more sustainable approach to model training, requiring less computational power and time. It reduces the carbon footprint associated with extensive computational tasks, aligning with eco-friendly operational goals.
- Faster training cycles: PEFT’s focus on fewer parameters speeds up the training process, enabling quicker iterations and deployments. This is ideal for projects with tight development schedules.
- Lower storage requirements: The resultant model footprint is smaller with most of the original model parameters unchanged. This size reduction facilitates easier management and distribution of model updates, making continuous improvement more practical in operational environments.
Use cases of parameter-efficient fine-tuning
Natural Language Processing (NLP)
- Text classification
- Sentiment analysis: Utilize PEFT to adapt large language models for sentiment analysis with minimal resources, ideal for real-time social media evaluation, reviews, and customer feedback.
- Named Entity Recognition (NER): Efficiently refine models to identify key entities such as names, organizations, and locations in text. This is pivotal for data extraction in sectors like healthcare and finance.
- Machine translation
- Adapt pre-trained models for specific language pairs or sectors using PEFT, delivering high-quality translations with reduced computational demand, suitable for deployment in resource-limited settings.
Conversational AI
- Chatbots and virtual assistants: Employ PEFT to tailor pre-trained conversational models for distinct industries or corporate environments, enhancing the model’s ability to manage specialized queries and contexts.
Computer vision
- Image classification: Modify pre-trained vision models for particular datasets using minimal parameter adjustments. This approach benefits medical imaging, where models are fine-tuned to detect specific conditions.
- Object detection: Enhance models to efficiently identify and classify objects in imagery and videos. This is vital for use in surveillance, autonomous driving, and retail inventory management.
Speech recognition
- Adapt extensive pre-trained speech recognition models to specific accents, dialects, or languages with PEFT, boosting accuracy and utility across diverse linguistic landscapes.
Recommendation systems
- Personalize recommendation engines swiftly for distinct user demographics or content categories using PEFT. This method allows systems to adapt promptly to new trends, enhancing user engagement and satisfaction.
Healthcare
- Medical diagnostics: Apply PEFT to specialize models on particular medical datasets for aiding in disease diagnosis from images or clinical data, which is crucial in settings with limited resources.
- Drug discovery: Utilize PEFT to modify pre-trained models for analyzing vast chemical datasets, accelerating drug development processes while minimizing computational costs.
Finance
- Fraud detection: With PEFT, customize models to identify fraudulent transactions in financial data, enabling the systems to adapt efficiently to emerging fraud patterns.
PEFT: A better alternative to standard fine-tuning
A standard fine-tuning process involves adjusting the hidden representations (h) extracted by transformer models to enhance their performance in downstream tasks. These hidden representations refer to any features the transformer architecture extracts, such as the output of a transformer layer or a self-attention layer.
To illustrate, suppose we have an input sentence, “This is a total waste of money.” Before fine-tuning, the transformer model computes the hidden representations (h) of each token in the sentence. After fine-tuning, the model’s parameters are updated, and the updated parameters will generate a different set of hidden representations, denoted by h’. Thus, the hidden representations generated by the pre-trained and fine-tuned models will differ even for the same sentence.
In essence, fine-tuning is a process that modifies the pre-trained language model’s hidden representations to make them more suitable for downstream tasks. However, fine-tuning all the parameters in the model is not necessary to achieve this goal. Only fine-tuning a small fraction of the parameters is often sufficient to change the hidden representations from h to h’.
Parameter-efficient fine-tuning techniques
Presently, only the following PEFT methods are employed. Nevertheless, ongoing research is underway to explore and develop new methods.
Adapter
Adapters are a special type of submodule that can be added to pre-trained language models to modify their hidden representation during fine-tuning. By inserting adapters after the multi-head attention and feed-forward layers in the transformer architecture, we can update only the parameters in the adapters during fine-tuning while keeping the rest of the model parameters frozen.
Adopting adapters can be a straightforward process. All that is required is to add adapters into each transformer layer and place a classifier layer on top of the pre-trained model. By updating the parameters of the adapters and the classifier head, we can improve the performance of the pre-trained model on a particular task without updating the entire model. This approach can save time and computational resources while still producing impressive results.
How does fine-tuning using an adapter work?
The adapter module comprises two feed-forward projection layers connected with a non-linear activation layer. There is also a skip connection that bypasses the feed-forward layers.
If we take the adapter placed right after the multi-head attention layer, then the input to the adapter layer is the hidden representation h calculated by the multi-head attention layer. Here, h takes two different paths in the adapter layer; one is the skip-connection, which leaves the input unchanged, and the other way involves the feed-forward layers.
Initially, the first feed-forward layer projects h into a low-dimension space. This space has a dimension less than h. Following this, the input is passed through a non-linear activation function, and the second feed-forward layer then projects it back up to the dimensionality of h. The results obtained from the two ways are summed together to obtain the final output of the adapter module.
The skip-connection preserves the original input h of the adapter, while the feed-forward path generates an incremental change, represented as Δh, based on the original h. By adding the incremental change Δh, obtained from the feed-forward layer with the original h from the previous layer, the adapter modifies the hidden representation calculated by the pre-trained model. This allows the adapter to alter the hidden representation of the pre-trained model, thereby changing its output for a specific task.
LoRA
Low-Rank Adaptation (LoRA) of large language models is another approach in the area of fine-tuning models for specific tasks or domains. Similar to the adapters, LoRA is also a small trainable submodule that can be inserted into the PEFT transformer architecture. It involves freezing the pre-trained model weights and injecting trainable rank decomposition matrices into each layer of the transformer architecture, greatly diminishing the number of trainable parameters for downstream tasks. This method can minimize the number of trainable parameters by up to 10,000 times and the GPU memory necessity by 3 times while still performing on par or better than fine-tuning model quality on various tasks. LoRA also allows for more efficient task-switching, lowering the hardware barrier to entry, and has no additional inference latency compared to other methods.
How does it work?
LoRA is inserted in parallel to the modules in the pre-trained transformer model, specifically in parallel to the feed-forward layers. A feed-forward layer has two projection layers and a non-linear layer in between them, where the input vector is projected into an output vector with a different dimensionality using an affine transformation. The LoRA layers are inserted next to each of the two feed-forward layers.
Now, let us consider the feed-forward up-project layer and the LoRA next to it. The original parameters of the feed-forward layer take the output from the previous layer with the dimension dmodel and projects it into dFFW. Here, FFW is the abbreviation for feed-forward. The LoRA module placed next to it consists of two feed-forward layers. The LoRA’s first feed-forward layer takes the same input as the feed-forward up-project layer and projects it into an r-dimensional vector, which is far less than the dmodel. Then, the second feed-forward layer projects the vector into another vector with a dimensionality of dFFW. Finally, the two vectors are added together to form the final representation.
As we have discussed earlier, fine-tuning is changing the hidden representation h calculated by the original transformer model. Hence, in this case, the hidden representation calculated by the feed-forward up-project layer of the original transformer is h. Meanwhile, the vector calculated by LoRA is the incremental change Δh that is used to modify the original h. Thus, the sum of the original representation and the incremental change is the updated hidden representation h’.
By inserting LoRA modules next to the feed-forward layers and a classifier head on top of the pre-trained model, task-specific parameters for each task are kept to a minimum.
Prefix tuning
Prefix-tuning is a lightweight alternative to fine-tuning large pre-trained language models for natural language generation tasks. Fine-tuning requires updating and storing all the model parameters for each task, which can be very expensive given the large size of current models. Prefix-tuning keeps the language model parameters frozen and optimizes a small continuous task-specific vector called the prefix. In prefix-tuning, the prefix is a set of free parameters that are trained along with the language model. The goal of prefix-tuning is to find a context that steers the language model toward generating text that solves a particular task.
The prefix can be seen as a sequence of “virtual tokens” that subsequent tokens can attend to. By learning only 0.1% of the parameters, prefix-tuning obtains comparable performance to fine-tuning in the full data setting, outperforms fine-tuning in low-data settings, and extrapolates better to examples with topics unseen during training.
Similar to all previously mentioned PEFT techniques, the end goal of prefix tuning is to reach h’. Prefix tuning uses prefixes to modify the hidden representations extracted by the original pre-trained language models. When the incremental change Δh is added to the original hidden representation h, we get the modified representation, i.e., h’.
When using prefix tuning, only the prefixes are updated, while the rest of the layers are fixed and not updated.
Prompt tuning
Prompt tuning is another PEFT technique for adapting pre-trained language models to specific downstream tasks. Unlike the traditional “model tuning” approach, where all the pre-trained model parameters are tuned for each task, prompt tuning involves learning soft prompts through backpropagation that can be fine-tuned for specific tasks by incorporating labeled examples. Prompt tuning outperforms the few-shot learning of GPT-3 and becomes more competitive as the model size increases. It also benefits domain transfer’s robustness and enables efficient prompt ensembling. It requires storing a small task-specific prompt for each task, making it easier to reuse a single frozen model for multiple downstream tasks, unlike model tuning, which requires making a task-specific copy of the entire pre-trained model for each task.
How does it work?
Prompt tuning is a simpler variant of prefix tuning. In it, some vectors are prepended at the beginning of a sequence at the input layer. When presented with an input sentence, the embedding layer converts each token into its corresponding word embedding, and the prefix embeddings are prepended to the sequence of token embeddings. Next, the pre-trained transformer layers will process the embedding sequence like a transformer model does to a normal sequence. Only the prefix embeddings are adjusted during the fine-tuning process, while the rest of the transformer model is kept frozen and unchanged.
This technique has several advantages over traditional fine-tuning methods, including improved efficiency and reduced computational overhead. Additionally, the fact that only the prefix embeddings are fine-tuned means that there is a lower risk of overfitting to the training data, thereby producing more robust and generalizable models.
P-tuning
P-tuning can improve the performance of language models such as GPTs in Natural Language Understanding (NLU) tasks. Traditional fine-tuning techniques have not been effective for GPTs, but P-tuning uses trainable continuous prompt embeddings to improve their performance. This method has been tested on two NLU benchmarks, LAMA and SuperGLUE, and has shown significant improvements in precision and world knowledge recovery. P-tuning also reduces the need for prompt engineering and outperforms state-of-the-art approaches on the few-shot SuperGLUE benchmark.
P-tuning can be used to improve pre-trained language models for various tasks, including sentence classification and predicting a country’s capital. The technique involves modifying the input embeddings of the pre-trained language model with differential output embeddings generated using a prompt. The continuous prompts can be optimized using a downstream loss function and a prompt encoder, which helps solve discreteness and association challenges.
IA3
IA3, short for Infused Adapter by Inhibiting and Amplifying Inner Activations, is another parameter-efficient fine-tuning technique designed to improve upon the LoRA technique. It focuses on making the fine-tuning process more efficient by reducing the number of trainable parameters in a model.
Both LoRA and IA3 share some similarities in their core objective of improving fine-tuning efficiency. They achieve this by introducing learned components, reducing the number of trainable parameters, and keeping the original pre-trained weights frozen. These shared characteristics make both techniques valuable tools for adapting large pre-trained models to specific tasks while minimizing computational demands. Additionally, both LoRA and IA3 prioritize maintaining model performance, ensuring that fine-tuned models remain competitive with fully fine-tuned ones. Furthermore, their capacity to merge adapter weights without adding inference latency contributes to their versatility and practicality for real-time applications and various downstream tasks.
How does IA3 work?
IA3 optimizes the fine-tuning process by rescaling the inner activations of a pre-trained model using learned vectors. These learned vectors are incorporated into the attention and feedforward modules within a standard transformer-based architecture. The key innovation of IA3 is that it freezes the original pre-trained weights of the model, making only the introduced learned vectors trainable during fine-tuning. This drastic reduction in the number of trainable parameters significantly improves the efficiency of fine-tuning without compromising model performance. IA3 is compatible with various downstream tasks, maintains inference speed, and can be applied to specific layers of a neural network, making it a valuable tool for efficient model adaptation and deployment.
Comparative analysis of popular PEFT methods
PEFT methods | Description | When to use | Computational overhead | Memory efficiency | Versatility across tasks | Performance Impact |
Adapters | Inserts neural modules between a model’s layers; only adapter weights are updated during fine-tuning. | To perform multiple tasks on one model. Flexibility required. | Moderate | Good (only adapters are fine-tuned) | High (can be added for multiple tasks) | Typically positive if adapters are well-tuned |
LoRA | Introduces a low-rank matrix into the attention mechanism to learn task-specific patterns. | Tasks with specialized attention requirements. Limited resources. | Low-moderate | Good | Moderate | Generally positive with good training |
Prefix Tuning | Adds a trainable prefix to modify the model’s learned representation. | Task-specific adaptation. Limited resources. | Low | Moderate | Moderate | Can vary, but usually positive with proper tuning |
Prompt Tuning | Modifies the model’s hidden states with trainable parameters in response to task-specific prompts. | Large pre-trained model. Adaptation to multiple tasks. | Low | Moderate | High | Depends on prompt quality |
P-tuning | Employs trainable prompt embeddings that encapsulate task-specific information for better adaptability. | Situations requiring precise, contextual modifications without extensive model retraining. | Low | Moderate | High | Positive, especially in nuanced tasks |
IA3 | Uses an iterative algorithm to adaptively adjust the importance of attributes in model fine-tuning. | Complex scenarios where attribute significance varies. | Moderate | Good | High | Superior adaptability and performance |
Training your model using PEFT
In our example, we will use LoRA to fine-tune a pre-trained sequence-to-sequence language model to generate text for a specific task, in this case, for Twitter complaints.
Import the dependencies and define the variables
First, let us import all the necessary libraries, modules and other dependencies, like AutoModelForSeq2SeqLM, PeftModel, torch, the datasets and AutoTokenizer, among others. The line of codes would be something like this:
from transformers import AutoModelForSeq2SeqLM from peft import PeftModel, PeftConfig import torch from datasets import load_dataset import os from transformers import AutoTokenizer from torch.utils.data import DataLoader from transformers import default_data_collator, get_linear_schedule_with_warmup from tqdm import tqdm from datasets import load_dataset
Next, we need to define the name of the dataset, the text column name, the label column name, and the batch size for training the model.
dataset_name = "twitter_complaints" text_column = "Tweet text" label_column = "text_label" batch_size = 8
Now, run the following commands to define the pre-trained PEFT model and load its configuration.
peft_model_id = "smangrul/twitter_complaints_bigscience_T0_3B_LORA_SEQ_2_SEQ_LM" config = PeftConfig.from_pretrained(peft_model_id)
In the above set of codes, the ‘peft_model_id’ variable contains the ID of the pre-trained model and the ‘config’ variable is set to the model’s configuration.
Now, set the maximum memory allowed for each device; say, GPU is allowed to use up to 6GB of memory, and the CPU can use up to 30GB of memory.
max_memory = {0: "6GIB", 1: "0GIB", 2: "0GIB", 3: "0GIB", 4: "0GIB", "cpu": "30GB"}
Load the base model of the pre-trained PEFT model specified by peft_model_id.
model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path, device_map="auto", max_memory=max_memory)
In the above command, the ‘AutoModelForSeq2SeqLM’ class is used to load the base model and the ‘from_pretrained’ function is used to load the weights of the pre-trained model. The ‘device_map’ argument specifies the mapping between devices and model components, and the ‘max_memory’ argument specifies the maximum memory allowed for each device.
Next, load the full PEFT model specified by ‘peft_model_id’ using the following command:
model = PeftModel.from_pretrained(model, peft_model_id, device_map="auto", max_memory=max_memory)
Preprocess the data
Map the dataset labels to human-readable class names:
The first step in preprocessing the data is to map the dataset labels to human-readable class names. For this, you need to replace all the underscores with spaces in the label names of the training set.
classes = [k.replace("_", " ") for k in dataset["train"].features["Label"].names] print(classes)
Then, run the following codes to map the labels into human-readable class names.
dataset = dataset.map( lambda x: {"text_label": [classes[label] for label in x["Label"]]}, batched=True, num_proc=1, ) print(dataset) dataset["train"][0]
Tokenization:
First, we need to load a pre-trained tokenizer from the transformers library for tokenization. We also need to set the maximum length of the target labels by tokenizing each class label and taking the length of the resulting list of token IDs. This can be used later to pad all labels to a consistent length. For this, run the following:
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path) target_max_length = max([len(tokenizer(class_label)["input_ids"]) for class_label in classes])
Run the following codes to extract the text and target labels from the input examples, tokenize the text using the pre-trained tokenizer, and pad the labels to a consistent length.
def preprocess_function(examples): inputs = examples[text_column] targets = examples[label_column] model_inputs = tokenizer(inputs, truncation=True) labels = tokenizer( targets, max_length=target_max_length, padding="max_length", truncation=True, return_tensors="pt" ) labels = labels["input_ids"] labels[labels == tokenizer.pad_token_id] = -100 model_inputs["labels"] = labels return model_inputs
Specify the steps needed to preprocess the dataset and prepare it for fine-tuning the model.
processed_datasets = dataset.map( preprocess_function, batched=True, num_proc=1, remove_columns=dataset["train"].column_names, load_from_cache_file=True, desc="Running tokenizer on dataset", )
Now, split the preprocessed dataset into separate training, evaluation, and test sets.
train_dataset = processed_datasets["train"] eval_dataset = processed_datasets["eval"] test_dataset = processed_datasets["test"]
Define a collate function:
Next, we need to define a collate function to gather and combine the preprocessed examples into batches.
def collate_fn(examples): return tokenizer.pad(examples, padding="longest", return_tensors="pt")
Next, define the data loaders for the training, evaluation, and test datasets.
train_dataloader = DataLoader( train_dataset, shuffle=True, collate_fn=collate_fn, batch_size=batch_size, pin_memory=True ) eval_dataloader = DataLoader(eval_dataset, collate_fn=collate_fn, batch_size=batch_size, pin_memory=True) test_dataloader = DataLoader(test_dataset, collate_fn=collate_fn, batch_size=batch_size, pin_memory=True)
Model training and evaluation
To train the model using the preprocessed dataset, first, define the specifications, like the number of epochs and loss function.
Once trained, evaluate the model on its intended purpose.
model.eval() i = 15 inputs = tokenizer(f'{text_column} : {dataset["test"][i]["Tweet text"]} Label : ', return_tensors="pt") print(dataset["test"][i]["Tweet text"]) print(inputs) with torch.no_grad(): outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=10) print(outputs) print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True))
Assessing the performance of a fine-tuned machine learning model is an essential step. One common way to evaluate a model’s performance is by checking its accuracy on an evaluation dataset. You can refer to this GitHub repository to view the entire evaluation process, including the code for calculating these metrics.
Few-shot in-context learning vs. parameter-efficient fine-tuning
Few-shot in-context learning and parameter-efficient fine-tuning are techniques or approaches used to train natural language, processing models. Although both these approaches enable pre-trained language models to perform new tasks without extensive training, the methods adopted in both approaches are technically different. The first approach, ICL, allows the model to perform a new task by inputting prompted examples without requiring gradient-based training. However, ICL incurs significant computational, memory, and storage costs. The second approach, PEFT, involves training a small number of added or selected parameters to enable a model to perform a new task with minimal updates.
ICL is an approach that aims to improve the few-shot learning performance of pre-trained language models by incorporating contextual information during fine-tuning. This approach involves fine-tuning a pre-trained language model on a few-shot task with additional contextual information provided as input. This contextual information could be in the form of additional sentences or paragraphs that provide more information about the task at hand. ICL aims to use this contextual information to enhance the model’s ability to generalize to new tasks, even with limited training examples.
On the other hand, parameter-efficient fine-tuning is an approach that aims to improve the efficiency of fine-tuning pre-trained language models on downstream tasks by identifying and freezing important model parameters. This approach involves fine-tuning the pre-trained model on a small amount of data while also freezing some of the model’s parameters to prevent overfitting. By selectively freezing certain parameters, the model can retain more of its pre-trained knowledge, improving its performance on downstream tasks with limited training data.
Is PEFT more efficient than ICL?
Parametric Few-shot Learning (PFSL) is an important task for natural language processing applications, where models must quickly adapt to new tasks with limited training examples. In recent years, various approaches have been put forward to tackle this challenge, with ICL being one of the most popular techniques. However, a research paper published in 2021 introduces a new approach called parametric efficient few-shot learning, which outperforms ICL in terms of accuracy while requiring significantly fewer computational resources.
One of the main reasons PEFT outperforms ICL is its use of a novel scaling method called (IA)^3, which rescales inner activations with learned vectors. This technique performs better than fine-tuning the full model while introducing only a few additional parameters. In contrast, ICL fine-tunes the entire model on a small amount of data, which can lead to overfitting and a drop in accuracy.
Another reason why PEFT is better than ICL is due to its use of two additional loss terms that encourage the model to output lower probabilities for incorrect choices and account for the length of different answer choices. These loss terms help the model to better generalize to new tasks and avoid overfitting.
In addition to its superior performance, parameter-efficient fine-tuning is also more computationally efficient than ICL. The research paper found that PEFT uses over 1,000x fewer floating-point operations (FLOPs) during inference than few-shot ICL with GPT-3 and only requires 30 minutes to train on a single NVIDIA A100 GPU. This makes PEFT a more practical and scalable solution for real-world NLP applications.
Overall, the introduction of PEFT represents a significant advancement in the field of few-shot learning for NLP applications. Its use of (IA)^3 scaling, additional loss terms, and superior computational efficiency make it a better alternative to ICL for tasks that require rapid adaptation to new few-shot learning scenarios.
Pitfalls to avoid in parameter-efficient fine-tuning
When leveraging PEFT to enhance a pre-trained model, being aware of potential pitfalls is crucial to sidestep suboptimal performance. Here are essential considerations:
- Overfitting: PEFT’s reliance on fine-tuning a limited number of parameters raises the risk of overfitting to the training dataset. Employ regularization methods such as weight decay and dropout to mitigate this risk. Additionally, monitoring validation loss can provide early indicators of overfitting, enabling timely adjustments to training protocols.
- Adapter size selection: The efficacy of PEFT hinges on the appropriate sizing of adapter modules. An undersized adapter might fail to encapsulate necessary information, whereas an oversized adapter could foster overfitting. For balanced performance, opt for an adapter size around 10% of the pre-trained model’s dimensions.
- Optimal learning rate: The learning rate is pivotal in PEFT. An excessively high rate risks model divergence, while a very low rate could slow down convergence excessively. Implementing a learning rate schedule that methodically decreases the rate over time can optimize training outcomes.
- Pre-trained model selection: The choice of a pre-trained model is fundamental. Models differ in their suitability for specific tasks based on factors like model size, the quality of the initial training data, and proven performance on analogous tasks. Selecting the right model is a strategic decision that can significantly impact the success of PEFT applications.
Pre-trained models commonly used in PEFT:
- BERT (Bidirectional Encoder Representations from Transformers)
- GPT-3 (Generative Pre-trained Transformer 3)
- T5 (Text-to-Text Transfer Transformer)
- RoBERTa (A Robustly Optimized BERT Pretraining Approach)
- XLNet
- ELECTRA
- ALBERT (A Lite BERT)
The process of parameter-efficient fine-tuning
The steps involved in parameter-efficient fine-tuning can vary depending on the specific implementation and the pre-trained model being used. However, here is a general outline of the steps involved in PEFT:
Pre-training: Initially, a large-scale model is pre-trained on a large dataset using a general task such as image classification or language modeling. This pre-training phase helps the model learn meaningful representations and features from the data.
Task-specific dataset: Gather or create a dataset that is specific to the target task you want to fine-tune the pre-trained model for. This dataset should be labeled and representative of the target task.
Parameter identification: Identify or estimate the importance or relevance of parameters in the pre-trained model for the target task. This step helps in determining which parameters should be prioritized during fine-tuning. Various techniques, such as importance estimation, sensitivity analysis, or gradient-based methods, can be used to identify important parameters.
Subset selection: Select a subset of the pre-trained model’s parameters based on their importance or relevance to the target task. The subset can be determined by setting certain criteria, such as a threshold on the importance scores or selecting the top-k most important parameters.
Fine-tuning: Initialize the selected subset of parameters with the values from the pre-trained model and freeze the remaining parameters. Fine-tune the selected parameters using the task-specific dataset. This involves training the model on the target task data, typically using techniques like Stochastic Gradient Descent (SGD) or Adam optimization.
Evaluation: Evaluate the performance of the fine-tuned PEFT model on a validation set or through other evaluation metrics relevant to the target task. This step helps assess the effectiveness of PEFT in achieving the desired performance while using fewer parameters.
Iterative refinement (optional): Depending on the performance and requirements, you may choose to iterate and refine the PEFT process by adjusting the criteria for parameter selection, exploring different subsets, or fine-tuning for additional epochs to optimize the model’s performance further.
However, it’s important to note that the specific implementation details and techniques used in PEFT can vary across research papers as well as applications.These steps provide a foundational framework for deploying a PEFT model effectively.
Endnote
PEFT, or Parameter-efficient Fine-tuning, is a natural language processing technique used to improve the performance of pre-trained language models on specific downstream tasks. It involves freezing some of the layers of the pre-trained model and only fine-tuning the last few layers that are specific to the downstream task. This technique is more beneficial than traditional fine-tuning in several ways, such as decreased computational and storage costs, overcoming catastrophic forgetting, and comparable performance to full fine-tuning with a small number of trainable parameters. Overall, PEFT is a promising approach to improving the efficiency and effectiveness of NLP models in various applications.
Ready to optimize your pre-trained models with PEFT? Look no further than LeewayHertz. Contact us today to boost your machine learning model’s capabilities.
Start a conversation by filling the form
All information will be kept confidential.
Insights
How to build a generative AI model for image synthesis?
With tools like Midjourney and DALL-E, image synthesis has become simpler and more efficient than before. Dive in deep to know more about the image synthesis process with generative AI.
AI in anomaly detection: Use cases, methods, algorithms and solution
The significance of AI in anomaly detection isn’t merely about finding statistical quirks; it’s about uncovering valuable insights, underlying problems, or opportunities that might otherwise go unnoticed.
No-code AI: Importance, benefits, use cases and development
No-code AI empowers users to develop AI-based applications swiftly and efficiently without the need for coding expertise.