180 likes | 356 Vues
Parameter-efficient Fine-tuning (PEFT) is a technique used in Natural Language Processing (NLP) to improve the performance of pre-trained language models on specific downstream tasks. It involves reusing the pre-trained modelu2019s parameters and fine-tuning them on a smaller dataset, which saves computational resources and time compared to training the entire model from scratch.
E N D
A guide to Parameter-efficient Fine-tuning(PEFT) leewayhertz.com/parameter-efficient-fine-tuning Transfer learning plays a crucial role in the development of large language models such as GPT-3 and BERT. It is an ML technique in which a model trained on a certain task is used as a starting point for a distinct but similar task. The idea behind transfer learning is that the knowledge gained by a model from solving one problem can be leveraged to help solve another problem. One of the earliest examples of transfer learning was using pre-trained word embeddings, such as Word2Vec, to improve the performance of NLP-based models. More recently, with the emergence of large pre-trained language models such as BERT and GPT-3, the scope of transfer learning has extended remarkably. Fine-tuning is one of the most popular methods used in transfer learning. It involves adapting a pre-trained model to a particular task by training it on a smaller set of task-specific labeled data. However, with the parameter count of large language models reaching trillions, fine-tuning the entire model has become computationally expensive and often impractical. In response, the focus has shifted towards in-context learning, where the model is provided with prompts for a given task and returns in-context updates. However, inefficiencies like processing the prompt each time the model makes a prediction and its poor performance at times make it a less favorable choice. This is where Parameter-efficient Fine-tuning (PEFT) comes in as an alternative paradigm to prompting. PEFT aims to fine-tune only a small subset of the model’s parameters, achieving comparable performance to full fine- tuning while significantly reducing computational requirements. This article will discuss the PEFT method in detail, exploring its benefits and how it has become an efficient way to fine-tune LLMs on downstream tasks. 1/18
A glossary of important terms What is PEFT? What is the difference between fine-tuning and parameter-efficient fine-tuning? Benefits of PEFT PEFT: A better alternative to standard fine-tuning Parameter-efficient fine-tuning techniques Training your model using PEFT Few-shot In-context Learning (ICL) vs. Parameter-efficient Fine-tuning (PEFT) Is PEFT more efficient than ICL? The process of parameter-efficient fine-tuning A glossary of important terms LLM models: Large Language Models or LLMs are a type of machine learning models that can learn the underlying structure and semantics of text data for NLP tasks. They do this by learning a set of latent variables representing the text’s high-level concepts and features. Essentially, LLM models try to capture what the text is about, without solely focusing on what words are used. Pre-trained models: Pre-trained models are machine learning models that have been trained on large amounts of data to facilitate a specific task, such as image classification, speech recognition, or natural language processing. These models have already learned the optimal set of weights and parameters needed to perform the task effectively so that they can be used as a starting point for further training on new data or for use in other applications. Parameters: Parameters are the values/variables that a model learns during training to make predictions or classifications on new data. Parameters are usually represented as weights and biases in neural networks, and they control how the input data is transformed into output predictions. Transfer learning: Transfer learning refers to taking a pre-trained model developed for a specific task and reusing it as a starting point for a new, related task. This involves using the pre-trained model’s learned feature representations as a starting point for a new model, which is then trained on a smaller dataset specific to the new task. Fine-tuning: Fine-tuning is a specific type of transfer learning where the pre-trained model’s weights are adjusted or fine-tuned on a new task-specific dataset. The pre- trained model is used as a starting point in this process, but the weights are adjusted during training to fit the new data better. The amount of fine-tuning can vary depending on the amount of available data and the similarity between the original and new tasks. Padding: Padding is a common technique used during fine-tuning language models to handle variable-length input sequences. It is the process of adding special tokens (typically a “padding” token) to the input sequence to bring it up to a fixed length. 2/18
Hidden representations: Hidden representations are the internal representations of the input data learned by the pre-trained model’s layers. These representations capture different levels of abstraction of the input data and can be used as features to train a new model for the task at hand. Few-shot learning: Few-shot learning is a machine learning technique that aims to train models on a limited amount of labeled data, typically in the range of a few dozen to a few hundred examples, and then generalize to new tasks with only a few or even a single labeled example. Few-shot learning algorithms can learn to recognize novel objects, categories, or concepts with very few examples by leveraging prior knowledge from related tasks or domains. What is PEFT? Parameter-efficient Fine-tuning (PEFT) is a technique used in Natural Language Processing (NLP) to improve the performance of pre-trained language models on specific downstream tasks. It involves reusing the pre-trained model’s parameters and fine-tuning them on a smaller dataset, which saves computational resources and time compared to training the entire model from scratch. PEFT achieves this efficiency by freezing some of the layers of the pre-trained model and only fine-tuning the last few layers that are specific to the downstream task. This way, the model can be adapted to new tasks with less computational overhead and fewer labeled examples. Although PEFT has been a relatively novel concept, updating the last layer of models has been in practice in the field of computer vision since the introduction of transfer learning. Even in NLP, experiments with static and non-static word embeddings were carried out early on. Parameter-efficient fine-tuning aims to improve the performance of pre-trained models, such as BERT and RoBERTa, on various downstream tasks, including sentiment analysis, named entity recognition, and question-answering. It achieves this in low- resource settings with limited data and computational resources. It modifies only a small subset of model parameters and is less prone to overfitting. What is the difference between fine-tuning and parameter-efficient fine-tuning? Fine-tuning and parameter-efficient fine-tuning are two approaches used in machine learning to improve the performance of pre-trained models on a specific task. Fine-tuning is taking a pre-trained model and training it further on a new task with new data. The entire pre-trained model is usually trained in fine-tuning, including all its layers and parameters. This process can be computationally expensive and time-consuming, especially for large models. 3/18
On the other hand, parameter-efficient fine-tuning is a method of fine-tuning that focuses on training only a subset of the pre-trained model’s parameters. This approach involves identifying the most important parameters for the new task and only updating those parameters during training. Doing so, PEFT can significantly reduce the computation required for fine-tuning. Contact LeewayHertz for AI consultancy and development Optimize AI model performance on any task with PEFT, without the need for extensive retraining or large-scale parameter updates Learn More Parameter-efficient Fine- tuning Standard Fine-tuning Goal Improve the performance of a pre-trained model on a specific task with limited data and computation Improve the performance of a pre-trained model on a specific task with ample data and computation Training Data Small dataset (fewer examples) Large dataset (many examples) Training Time Faster training time as compared to fine-tuning Longer training time as compared to PEFT Computational Resources Uses fewer computational resources Requires larger computational resources Model Parameters Modifies only a small subset of model parameters Re-trains the entire model Overfitting Less prone to overfitting as the model is not excessively modified More prone to overfitting as the model is extensively modified Training Performance Not as good as fine-tuning, but still good enough Typically results in better performance than PEFT Use Cases Ideal for low-resource settings or where large amounts of training data are not available Ideal for high-resource settings with ample training data and computational resources Parameter-efficient fine-tuning can be particularly useful in scenarios where computational resources are limited or where large pre-trained models are involved. In such cases, PEFT can provide a more efficient way of fine-tuning the model without sacrificing performance. However, it’s important to note that PEFT may sometimes achieve a different level of performance than full fine-tuning, especially in cases where the pre-trained model requires significant modification to perform well on the new task. Benefits of PEFT 4/18
Here we will discuss the benefits of PEFT in relation to traditional fine-tuning. So, let us understand why parameter-efficient fine-tuning is more beneficial than fine-tuning. 1. Decreased computational and storage costs: PEFT involves fine-tuning only a small number of extra model parameters while freezing most parameters of the pre- trained LLMs, thereby reducing computational and storage costs significantly. 2. Overcoming catastrophic forgetting: During full fine-tuning of LLMs, catastrophic forgetting can occur where the model forgets the knowledge it learned during pretraining. PEFT stands to overcome this issue by only updating a few parameters. 3. Better performance in low-data regimes: PEFT approaches have been shown to perform better than full fine-tuning in low-data regimes and generalize better to out- of-domain scenarios. 4. Portability: PEFT methods enable users to obtain tiny checkpoints worth a few MBs compared to the large checkpoints of full fine-tuning. This makes the trained weights from PEFT approaches easy to deploy and use for multiple tasks without replacing the entire model. 5. Performance comparable to full fine-tuning: PEFT enables achieving comparable performance to full fine-tuning with only small number of trainable parameters. PEFT: A better alternative to standard fine-tuning A standard fine-tuning process involves adjusting the hidden representations (h) extracted by transformer models to enhance their performance in downstream tasks. These hidden representations refer to any features the transformer architecture extracts, such as the output of a transformer layer or a self-attention layer. Before Fine-Tuning h [CLS] This is a total waste of money Transformer Layer N Transformer Layer 1 Transformer Layer 2 Embedding Layer LeewayHertz To illustrate, suppose we have an input sentence, “This is a total waste of money.” Before fine-tuning, the transformer model computes the hidden representations (h) of each token in the sentence. After fine-tuning, the model’s parameters are updated, and the updated 5/18
parameters will generate a different set of hidden representations, denoted by h’. Thus, the hidden representations generated by the pre-trained and fine-tuned models will differ even for the same sentence. After Fine-Tuning h’ [CLS] This is a total waste of money Classifier Head Transformer Layer N Transformer Layer 1 Transformer Layer 2 Embedding Layer LeewayHertz In essence, fine-tuning is a process that modifies the pre-trained language model’s hidden representations to make them more suitable for downstream tasks. However, fine- tuning all the parameters in the model is not necessary to achieve this goal. Only fine- tuning a small fraction of the parameters is often sufficient to change the hidden representations from h to h’. Parameter-efficient fine-tuning techniques Presently, only the following PEFT methods are employed. Nevertheless, ongoing research is underway to explore and develop new methods. Adapter Adapters are a special type of submodule that can be added to pre-trained language models to modify their hidden representation during fine-tuning. By inserting adapters after the multi-head attention and feed-forward layers in the transformer architecture, we can update only the parameters in the adapters during fine-tuning while keeping the rest of the model parameters frozen. Adopting adapters can be a straightforward process. All that is required is to add adapters into each transformer layer and place a classifier layer on top of the pre-trained model. By updating the parameters of the adapters and the classifier head, we can improve the performance of the pre-trained model on a particular task without updating the entire model. This approach can save time and computational resources while still producing impressive results. How does fine-tuning using an adapter work? 6/18
The adapter module comprises two feed-forward projection layers connected with a non- linear activation layer. There is also a skip connection that bypasses the feed-forward layers. If we take the adapter placed right after the multi-head attention layer, then the input to the adapter layer is the hidden representation h calculated by the multi-head attention layer. Here, h takes two different paths in the adapter layer; one is the skip-connection, which leaves the input unchanged, and the other way involves the feed-forward layers. h h’ = h + h Adapter + Layer Norm Skip Connection h + Adapter Feed-Forward Up-Project Adapters are Updated Feed-Forward Nonlinearity Layer Norm + Feed-Forward Down-Project Adapter Multi-Headed Attention Hidden Representation h LeewayHertz Initially, the first feed-forward layer projects h into a low-dimension space. This space has a dimension less than h. Following this, the input is passed through a non-linear activation function, and the second feed-forward layer then projects it back up to the dimensionality of h. The results obtained from the two ways are summed together to obtain the final output of the adapter module. The skip-connection preserves the original input h of the adapter, while the feed-forward path generates an incremental change, represented as Δh, based on the original h. By adding the incremental change Δh, obtained from the feed-forward layer with the original h from the previous layer, the adapter modifies the hidden representation calculated by the pre-trained model. This allows the adapter to alter the hidden representation of the pre-trained model, thereby changing its output for a specific task. LoRA 7/18
Low-Rank Adaptation (LoRA) of large language models is another approach in the area of fine-tuning models for specific tasks or domains. Similar to the adapters, LoRA is also a small trainable submodule that can be inserted into the transformer architecture. It involves freezing the pre-trained model weights and injecting trainable rank decomposition matrices into each layer of the transformer architecture, greatly diminishing the number of trainable parameters for downstream tasks. This method can minimize the number of trainable parameters by up to 10,000 times and the GPU memory necessity by 3 times while still performing on par or better than fine-tuning model quality on various tasks. LoRA also allows for more efficient task-switching, lowering the hardware barrier to entry, and has no additional inference latency compared to other methods. How does it work? LoRA is inserted in parallel to the modules in the pre-trained transformer model, specifically in parallel to the feed-forward layers. A feed-forward layer has two projection layers and a non-linear layer in between them, where the input vector is projected into an output vector with a different dimensionality using an affine transformation. The LoRA layers are inserted next to each of the two feed-forward layers. + Feed-Forward Down-Project Nonlinearity LoRA layers + Feed-Forward Up-Project LeewayHertz 8/18
Now, let us consider the feed-forward up-project layer and the LoRA next to it. The original parameters of the feed-forward layer take the output from the previous layer with the dimension d and projects it into d forward. The LoRA module placed next to it consists of two feed-forward layers. The LoRA’s first feed-forward layer takes the same input as the feed-forward up-project layer and projects it into an r-dimensional vector, which is far less than the d second feed-forward layer projects the vector into another vector with a dimensionality of d . Finally, the two vectors are added together to form the final representation. FFW . Here, FFW is the abbreviation for feed- FFW model . Then, the model h h’ = h + + h h dFFW dFFW r Feed-Forward Up-Project r dmodel dmodel LeewayHertz As we have discussed earlier, fine-tuning is changing the hidden representation h calculated by the original transformer model. Hence, in this case, the hidden representation calculated by the feed-forward up-project layer of the original transformer is h. Meanwhile, the vector calculated by LoRA is the incremental change Δh that is used to modify the original h. Thus, the sum of the original representation and the incremental change is the updated hidden representation h’. By inserting LoRA modules next to the feed-forward layers and a classifier head on top of the pre-trained model, task-specific parameters for each task are kept to a minimum. Prefix tuning Prefix-tuning is a lightweight alternative to fine-tuning large pre-trained language models for natural language generation tasks. Fine-tuning requires updating and storing all the model parameters for each task, which can be very expensive given the large size of current models. Prefix-tuning keeps the language model parameters frozen and optimizes 9/18
a small continuous task-specific vector called the prefix. In prefix-tuning, the prefix is a set of free parameters that are trained along with the language model. The goal of prefix- tuning is to find a context that steers the language model toward generating text that solves a particular task. Prefix Prefix Prefix [BOS] This is a total waste of money Transformer Layer N Transformer Layer 1 Transformer Layer 2 Embedding Layer LeewayHertz The prefix can be seen as a sequence of “virtual tokens” that subsequent tokens can attend to. By learning only 0.1% of the parameters, prefix-tuning obtains comparable performance to fine-tuning in the full data setting, outperforms fine-tuning in low-data settings, and extrapolates better to examples with topics unseen during training. Similar to all previously mentioned PEFT techniques, the end goal of prefix tuning is to reach h’. Prefix tuning uses prefixes to modify the hidden representations extracted by the original pre-trained language models. When the incremental change Δh is added to the original hidden representation h, we get the modified representation, i.e., h’. When using prefix tuning, only the prefixes are updated, while the rest of the layers are fixed and not updated. Prompt tuning Prompt tuning is another PEFT technique for adapting pre-trained language models to specific downstream tasks. Unlike the traditional “model tuning” approach, where all the pre-trained model parameters are tuned for each task, prompt tuning involves learning soft prompts through backpropagation that can be fine-tuned for specific tasks by incorporating labeled examples. Prompt tuning outperforms the few-shot learning of GPT- 3 and becomes more competitive as the model size increases. It also benefits domain transfer’s robustness and enables efficient prompt ensembling. It requires storing a small task-specific prompt for each task, making it easier to reuse a single frozen model for multiple downstream tasks, unlike model tuning, which requires making a task-specific copy of the entire pre-trained model for each task. 10/18
How does it work? Prompt tuning is a simpler variant of prefix tuning. In it, some vectors are prepended at the beginning of a sequence at the input layer. When presented with an input sentence, the embedding layer converts each token into its corresponding word embedding, and the prefix embeddings are prepended to the sequence of token embeddings. Next, the pre- trained transformer layers will process the embedding sequence like a transformer model does to a normal sequence. Only the prefix embeddings are adjusted during the fine- tuning process, while the rest of the transformer model is kept frozen and unchanged. Prefix Embedding Transformer Layer N Transformer Layer 1 Transformer Layer 2 [BOS] Input Sequence Embedding Layer LeewayHertz This technique has several advantages over traditional fine-tuning methods, including improved efficiency and reduced computational overhead. Additionally, the fact that only the prefix embeddings are fine-tuned means that there is a lower risk of overfitting to the training data, thereby producing more robust and generalizable models. P-tuning P-tuning can improve the performance of language models such as GPTs in Natural Language Understanding (NLU) tasks. Traditional fine-tuning techniques have not been effective for GPTs, but P-tuning uses trainable continuous prompt embeddings to improve their performance. This method has been tested on two NLU benchmarks, LAMA and SuperGLUE, and has shown significant improvements in precision and world knowledge recovery. P-tuning also reduces the need for prompt engineering and outperforms state- of-the-art approaches on the few-shot SuperGLUE benchmark. P-tuning can be used to improve pre-trained language models for various tasks, including sentence classification and predicting a country’s capital. The technique involves modifying the input embeddings of the pre-trained language model with differential output embeddings generated using a prompt. The continuous prompts can be optimized using a downstream loss function and a prompt encoder, which helps solve discreteness and association challenges. Training your model using PEFT 11/18
In our example, we will use LoRA to fine-tune a pre-trained sequence-to-sequence language model to generate text for a specific task, in this case, for Twitter complaints. Import the dependencies and define the variables First, let us import all the necessary libraries, modules and other dependencies, like AutoModelForSeq2SeqLM, PeftModel, torch, the datasets and AutoTokenizer, among others. The line of codes would be something like this: from transformers import AutoModelForSeq2SeqLM from peft import PeftModel, PeftConfig import torch from datasets import load_dataset import os from transformers import AutoTokenizer from torch.utils.data import DataLoader from transformers import default_data_collator, get_linear_schedule_with_warmup from tqdm import tqdm from datasets import load_dataset Next, we need to define the name of the dataset, the text column name, the label column name, and the batch size for training the model. dataset_name = "twitter_complaints" text_column = "Tweet text" label_column = "text_label" batch_size = 8 Now, run the following commands to define the pre-trained PEFT model and load its configuration. peft_model_id = "smangrul/twitter_complaints_bigscience_T0_3B_LORA_SEQ_2_SEQ_LM" config = PeftConfig.from_pretrained(peft_model_id) In the above set of codes, the ‘peft_model_id’ variable contains the ID of the pre-trained model and the ‘config’ variable is set to the model’s configuration. 12/18
Now, set the maximum memory allowed for each device; say, GPU is allowed to use up to 6GB of memory, and the CPU can use up to 30GB of memory. max_memory = {0: "6GIB", 1: "0GIB", 2: "0GIB", 3: "0GIB", 4: "0GIB", "cpu": "30GB"} Load the base model of the pre-trained PEFT model specified by peft_model_id. model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path, device_map="auto", max_memory=max_memory) In the above command, the ‘AutoModelForSeq2SeqLM’ class is used to load the base model and the ‘from_pretrained’ function is used to load the weights of the pre-trained model. The ‘device_map’ argument specifies the mapping between devices and model components, and the ‘max_memory’ argument specifies the maximum memory allowed for each device. Next, load the full PEFT model specified by ‘peft_model_id’ using the following command: model = PeftModel.from_pretrained(model, peft_model_id, device_map="auto", max_memory=max_memory) Preprocess the data Map the dataset labels to human-readable class names: The first step in preprocessing the data is to map the dataset labels to human-readable class names. For this, you need to replace all the underscores with spaces in the label names of the training set. classes = [k.replace("_", " ") for k in dataset["train"].features["Label"].names] print(classes) Then, run the following codes to map the labels into human-readable class names. dataset = dataset.map( lambda x: {"text_label": [classes[label] for label in x["Label"]]}, batched=True, num_proc=1, ) print(dataset) dataset["train"][0] Tokenization: 13/18
First, we need to load a pre-trained tokenizer from the transformers library for tokenization. We also need to set the maximum length of the target labels by tokenizing each class label and taking the length of the resulting list of token IDs. This can be used later to pad all labels to a consistent length. For this, run the following: tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path) target_max_length = max([len(tokenizer(class_label)["input_ids"]) for class_label in classes]) Run the following codes to extract the text and target labels from the input examples, tokenize the text using the pre-trained tokenizer, and pad the labels to a consistent length. def preprocess_function(examples): inputs = examples[text_column] targets = examples[label_column] model_inputs = tokenizer(inputs, truncation=True) labels = tokenizer( targets, max_length=target_max_length, padding="max_length", truncation=True, return_tensors="pt" ) labels = labels["input_ids"] labels[labels == tokenizer.pad_token_id] = -100 model_inputs["labels"] = labels return model_inputs Specify the steps needed to preprocess the dataset and prepare it for fine-tuning the model. processed_datasets = dataset.map( preprocess_function, batched=True, num_proc=1, remove_columns=dataset["train"].column_names, load_from_cache_file=True, 14/18
desc="Running tokenizer on dataset", ) Now, split the preprocessed dataset into separate training, evaluation, and test sets. train_dataset = processed_datasets["train"] eval_dataset = processed_datasets["eval"] test_dataset = processed_datasets["test"] Define a collate function: Next, we need to define a collate function to gather and combine the preprocessed examples into batches. def collate_fn(examples): return tokenizer.pad(examples, padding="longest", return_tensors="pt") Next, define the data loaders for the training, evaluation, and test datasets. train_dataloader = DataLoader( train_dataset, shuffle=True, collate_fn=collate_fn, batch_size=batch_size, pin_memory=True ) eval_dataloader = DataLoader(eval_dataset, collate_fn=collate_fn, batch_size=batch_size, pin_memory=True) test_dataloader = DataLoader(test_dataset, collate_fn=collate_fn, batch_size=batch_size, pin_memory=True) Model training and evaluation To train the model using the preprocessed dataset, first, define the specifications, like the number of epochs and loss function. Once trained, evaluate the model on its intended purpose. model.eval() i = 15 inputs = tokenizer(f'{text_column} : {dataset["test"][i]["Tweet text"]} Label : ', return_tensors="pt") print(inputs) 15/18
with torch.no_grad(): outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=10) print(outputs) print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)) Assessing the performance of a fine-tuned machine learning model is an essential step. One common way to evaluate a model’s performance is by checking its accuracy on an evaluation dataset. You can refer to this GitHub repository to view the entire evaluation process, including the code for calculating these metrics. Few-shot In-context Learning (ICL) vs. Parameter-efficient Fine- tuning (PEFT) Few-shot in-context learning and parameter-efficient fine-tuning are techniques or approaches used to train natural language, processing models. Although both these approaches enable pre-trained language models to perform new tasks without extensive training, the methods adopted in both approaches are technically different. The first approach, ICL, allows the model to perform a new task by inputting prompted examples without requiring gradient-based training. However, ICL incurs significant computational, memory, and storage costs. The second approach, PEFT, involves training a small number of added or selected parameters to enable a model to perform a new task with minimal updates. ICL is an approach that aims to improve the few-shot learning performance of pre-trained language models by incorporating contextual information during fine-tuning. This approach involves fine-tuning a pre-trained language model on a few-shot task with additional contextual information provided as input. This contextual information could be in the form of additional sentences or paragraphs that provide more information about the task at hand. ICL aims to use this contextual information to enhance the model’s ability to generalize to new tasks, even with limited training examples. On the other hand, parameter-efficient fine-tuning is an approach that aims to improve the efficiency of fine-tuning pre-trained language models on downstream tasks by identifying and freezing important model parameters. This approach involves fine-tuning the pre- trained model on a small amount of data while also freezing some of the model’s parameters to prevent overfitting. By selectively freezing certain parameters, the model can retain more of its pre-trained knowledge, improving its performance on downstream tasks with limited training data. Is PEFT more efficient than ICL? Parametric Few-shot Learning (PFSL) is an important task for natural language processing applications, where models must quickly adapt to new tasks with limited training examples. In recent years, various approaches have been put forward to tackle 16/18
this challenge, with ICL being one of the most popular techniques. However, a research paper published in 2021 introduces a new approach called parametric efficient few-shot learning, which outperforms ICL in terms of accuracy while requiring significantly fewer computational resources. One of the main reasons PEFT outperforms ICL is its use of a novel scaling method called (IA)^3, which rescales inner activations with learned vectors. This technique performs better than fine-tuning the full model while introducing only a few additional parameters. In contrast, ICL fine-tunes the entire model on a small amount of data, which can lead to overfitting and a drop in accuracy. Another reason why PEFT is better than ICL is due to its use of two additional loss terms that encourage the model to output lower probabilities for incorrect choices and account for the length of different answer choices. These loss terms help the model to better generalize to new tasks and avoid overfitting. In addition to its superior performance, parameter-efficient fine-tuning is also more computationally efficient than ICL. The research paper found that PEFT uses over 1,000x fewer floating-point operations (FLOPs) during inference than few-shot ICL with GPT-3 and only requires 30 minutes to train on a single NVIDIA A100 GPU. This makes PEFT a more practical and scalable solution for real-world NLP applications. Overall, the introduction of PEFT represents a significant advancement in the field of few- shot learning for NLP applications. Its use of (IA)^3 scaling, additional loss terms, and superior computational efficiency make it a better alternative to ICL for tasks that require rapid adaptation to new few-shot learning scenarios. The process of parameter-efficient fine-tuning The steps involved in parameter-efficient fine-tuning can vary depending on the specific implementation and the pre-trained model being used. However, here is a general outline of the steps involved in PEFT: Pre-training: Initially, a large-scale model is pre-trained on a large dataset using a general task such as image classification or language modeling. This pre-training phase helps the model learn meaningful representations and features from the data. Task-specific dataset: Gather or create a dataset that is specific to the target task you want to fine-tune the pre-trained model for. This dataset should be labeled and representative of the target task. Parameter identification: Identify or estimate the importance or relevance of parameters in the pre-trained model for the target task. This step helps in determining which parameters should be prioritized during fine-tuning. Various techniques, such as importance estimation, sensitivity analysis, or gradient-based methods, can be used to identify important parameters. 17/18
Subset selection: Select a subset of the pre-trained model’s parameters based on their importance or relevance to the target task. The subset can be determined by setting certain criteria, such as a threshold on the importance scores or selecting the top-k most important parameters. Fine-tuning: Initialize the selected subset of parameters with the values from the pre- trained model and freeze the remaining parameters. Fine-tune the selected parameters using the task-specific dataset. This involves training the model on the target task data, typically using techniques like Stochastic Gradient Descent (SGD) or Adam optimization. Evaluation: Evaluate the performance of the fine-tuned model on a validation set or through other evaluation metrics relevant to the target task. This step helps assess the effectiveness of PEFT in achieving the desired performance while using fewer parameters. Iterative refinement (optional): Depending on the performance and requirements, you may choose to iterate and refine the PEFT process by adjusting the criteria for parameter selection, exploring different subsets, or fine-tuning for additional epochs to optimize the model’s performance further. However, it’s important to note that the specific implementation details and techniques used in PEFT can vary across research papers as well as applications. Endnote PEFT, or Parameter-efficient Fine-tuning, is a natural language processing technique used to improve the performance of pre-trained language models on specific downstream tasks. It involves freezing some of the layers of the pre-trained model and only fine-tuning the last few layers that are specific to the downstream task. This technique is more beneficial than traditional fine-tuning in several ways, such as decreased computational and storage costs, overcoming catastrophic forgetting, and comparable performance to full fine-tuning with a small number of trainable parameters. Overall, PEFT is a promising approach to improving the efficiency and effectiveness of NLP models in various applications. Ready to optimize your pre-trained models with PEFT? Look no further than LeewayHertz. Contact us today to boost your machine learning model’s capabilities. 18/18