How to train an open-source foundation model into a domain-specific LLM?
In the fast-paced world of corporate environments, technological innovations have always been a catalyst, driving business growth with remarkable velocity. One such technological advancement that has lately become a talking point and brought about a paradigm shift is generative AI. Today, every industry is being profoundly influenced by generative AI, which has heralded the dawn of a new era defined by the widespread use of Large Language Models (LLMs). Corporate entities across diverse sectors are increasingly recognizing the immense potential of customized large language models to enhance their business operations in unprecedented ways. Businesses are progressively pivoting away from generic, universal models due to growing needs for enhanced control, greater data privacy, and cost-efficient methodologies. As organizations increasingly leverage the power of customization, they open the doors to a world where intelligent systems stand poised to revolutionize everything from innovation and efficiency to the very core of their operational processes.
While general-purpose LLMs are trained on large and diverse datasets, domain-specific LLMs are specifically designed for a particular domain and trained on exclusive datasets that cater to an organization’s specific requirements. General-purpose LLMs can be useful for many businesses but domain-specific LLMs are more suitable for situations that require accurate and contextually appropriate outputs within a specific domain. These models, which often outperform extensive models like GPT-3.5 that serve as the basis for ChatGPT, are highly targeted and potentially more efficient.
This significant change is further propelled by the availability of numerous open-source foundation models, making it feasible for businesses to craft their personalized LLMs to meet their unique demands and achieve optimal results.
Domain-specific LLMs are witnessing widespread adoption as they can adapt to specific requirements, utilize available resources more efficiently, and deliver powerful language processing capabilities that address specific challenges and tasks. They are poised to bring about a new era of advanced language processing solutions with enhanced adaptability, resourcefulness, and potency.
This article provides detailed insights into domain-specific LLMs, covering aspects like their benefits, challenges involved in their development and how to create an industry-specific LLM, leveraging the potential of an open-source foundation model. All this will help you understand the intricacies of this fascinating progression in AI.
- What are foundation models?
- What are LLMs?
- What are domain-specific LLMs?
- What are the benefits of training on the domain-specific LLMs?
- Challenges of building domain-specific custom LLMs
- A detailed case study of BloombergGPT, a finance-specific LLM
- How to train an open-source foundation model into a domain-specific LLM?
What are foundation models?
The term “Foundation Models” was put forward by Stanford researchers to describe a new breed of machine learning models. Rather than being designed for a specific task such as image recognition, these models are trained on extensive, diverse datasets using self-supervised learning at scale, allowing them to be fine-tuned for various downstream tasks.
Contrary to what the name might suggest, Foundation Models (FMs) are not the bedrock of AI, nor are they suggestive of AGI (Artificial General Intelligence). These models are unique and remarkable in their own right, characterized by five main traits:
- Pre-training: FMs are pre-trained with vast data and substantial computing power, making them ready for use without further training.
- Generalization: Unlike traditional AI models, which are task-specific, FMs are versatile, and designed to tackle numerous tasks.
- Adaptability: FMs are adjustable through prompting, meaning they respond to user-defined inputs like text.
- Large-scale: FMs are large in both model size and data size. For instance, GPT-3 boasts 175 billion parameters and was trained on approximately 500,000 million words.
- Self-supervision: FMs learn from data patterns without explicit labels.
Prominent examples of FMs include GPT-3 and DALL-E-2. These models enable everyday users and non-developers to carry out remarkable tasks by providing simple “prompts.”
Once trained, FMs can manage a plethora of data types and tasks, making them highly adaptable. They can automate complex workflows when paired with an appropriate operational chain. Beyond generating content such as text, images, audio, and video, FMs can also be employed for prediction and classification tasks, a concept known as discriminative modeling.
FMs have initiated a transformative era in knowledge work, fostering improvements in language-related tasks, reasoning, search, and computer vision, especially when combined with other machine learning approaches like Reinforced Learning with Human Feedback. The development of multi-modal FMs, such as CLIP and ViLBERT, is underway, and these models could significantly enhance robotics. By using approaches such as few-shot learning and domain specialization, FMs can address a multitude of challenges.
Despite their remarkable potential, FMs do have their limitations. These include fabricating answers (known as hallucination), issues with temporal shifting or continual adaptation, the requirement of large data sets and considerable computing resources for training, and the need for human evaluation for fine-tuning. Other hurdles include the tension between specialization and diversity in FM training data, uncertainty about optimal adaptation methods, and limited understanding of the model’s inner workings: what a model can do, why it produces certain behaviors, and how it accomplishes this.
Leverage LeewayHertz’s expertise in domain-specific LLMs!
We train your preferred foundation model on your proprietary data to create a domain-specific LLM that aligns perfectly with your business needs.
What are LLMs?
Large Language Models (LLMs) represent a subclass of Foundation Models designed to comprehend and generate text. By ingesting immense amounts of textual data and boasting billions of parameters, these models can perform a variety of tasks based on a user-provided prompt. Functions we encounter daily, such as autocomplete in search engines or Smart Compose in email platforms, are real-world applications of these language models. They excel at predicting the most likely subsequent word or phrase based on the provided text.
To understand LLMs, we should recognize that they are built upon advanced neural networks known as transformers. These transformers extract patterns from vast amounts of text data, ranging from strings to numbers to code. It is worth noting that the creation and utilization of LLMs involve substantial hurdles: massive datasets, considerable computational power, and specialized skills are prerequisites for training, fine-tuning, and extracting insights from these models. This explains why only a few models, like BERT and GPT-3, are readily accessible and why many models remain closed-source.
LLMs should not be conflated with Artificial General Intelligence (AGI) despite their potential. Current LLMs don’t inherently comprehend concepts and abstractions. They have limitations but are still the most effective tools in Natural Language Processing (NLP) for a broad range of tasks. Their potential applications span from coding to assisting with task completion to composing coherent and engaging narratives.
The output of an LLM typically undergoes rigorous filtering before a response is generated. This process is integral to ensuring the generated content is factual and safe, which is crucial considering the risks associated with these models. As businesses aim to leverage LLMs for specific products or systems, a high degree of oversight and governance becomes essential.
The historical journey of LLMs can be traced back to the early 2010s. Initially, transfer learning with annotated or labeled data was a prevalent practice. However, this approach was costly, challenging to implement consistently, and posed privacy concerns. The advent of transformer-based models around 2017 marked a significant stride, with models like GPT-3 and ChatGPT making their debut and gaining swift popularity.
In the post-ChatGPT world, LLMs continually show promise for various applications. They can explain complex subjects in simple terms, translate languages, generate creative content, brainstorm, assist with online tasks, provide customer service, and even perform technical functions. However, they are not without flaws. The reliability of answers and the risk of fabricated responses, often referred to as “model hallucination,” highlight the need for further refinement of these models.
Thus, while LLMs have transformative potential and are reshaping the complex AI landscape, they require careful management and continual refinement to ensure they show their full potential without compromising on safety and accuracy.
LLM Stack — Simple View
What are domain-specific LLMs?
A domain-specific language model constitutes a specialized subset of large language models dedicated to producing highly accurate results within a particular domain. Unlike a generalized LLM, which aims to process a diverse range of topics with satisfactory proficiency, a domain-specific language model focuses on optimizing its performance within a predefined sector, which could range from technical domains such as law or medicine to more casual areas like culinary arts.
The methodology of training a domain-specific language model hinges on acquiring a substantial volume of domain-specific data. This data serves as the training set, ensuring the model is immersed in the contextual intricacies and specific knowledge pertinent to the chosen domain.
There are two primary strategies to undertake domain-specific pre-training. The first strategy involves initializing a pre-trained LLM and fine-tuning it with the domain-specific data to adapt its capabilities to the new context. This method typically allows faster convergence and often superior performance. The alternative strategy is to construct a new LLM from scratch using the domain-specific data. The optimal strategy may vary depending on the application and the domain in question.
However, this specialization entails a fundamental trade-off. A domain-specific language model, such as a model trained exclusively on culinary recipes, can comprehend the subtleties of its specialized domain more proficiently than a generalized model. Nevertheless, its performance deteriorates significantly when faced with tasks outside its trained domain. On the other hand, a generic LLM possesses broader knowledge, albeit without the fine-grained expertise of a domain-specific language model.
This difference is similar to what we see in real-life expertise. It is exceedingly rare to encounter a professional chef who can also provide proficient stock market analysis or a mathematician who doubles as an accomplished artist. Correspondingly, developing an LLM that can seamlessly transition between domains while maintaining a high degree of proficiency is a challenge. The trade-off between breadth and depth of knowledge is intrinsic to the design of language models, necessitating judicious consideration in the development process.
Here is a table comparing general LLMs and domain-specific LLMs:
Criteria
|
General LLMs (e.g., GPT-3)
|
Domain-specific LLMs (e.g., FoodUDT-1B)
|
---|---|---|
Purpose | Developed to understand and generate text across a broad range of topics and contexts. | Specifically designed to understand and generate text in a particular niche or domain. |
Training Data | Trained on diverse internet text, encompassing a wide range of topics | Trained on domain-specific data (in this case, 500K recipes). |
Knowledge Depth | Has a general understanding of a wide range of topics, including domain-specific topics at a high level. | Has a deep understanding of a specific domain. |
Representation of Context | Represents text and context based on a broad understanding of language and world knowledge. For example, “apple” is closer to “iPhone” than to “apple pie.” | Represents text and context based on deep, specialized knowledge. For example, “apple” is closer to “apple pie” than to “iPhone.” |
Word Associations | Forms associations based on generic context. For example, “meat,” “sugar,” and “dessert” are equally related. | Forms associations based on domain-specific context. For example, “dessert” and “sugar” are more related than “meat” and “sugar.” |
Advantages | Versatile and can handle a wide range of tasks and queries. | Highly accurate and efficient in its specific domain due to targeted training. |
Limitations | May not possess in-depth understanding or generate accurate outputs in specialized domains. | Limited to its specific domain and may not generate accurate outputs outside that domain. |
Use Cases | General conversational AI, summarization, translation, and other generic tasks. | Highly specialized tasks within the domain (in this case, recipe search and suggestions). |
Leverage LeewayHertz’s expertise in domain-specific LLMs!
We train your preferred foundation model on your proprietary data to create a domain-specific LLM that aligns perfectly with your business needs.
It is important to remember that while domain-specific LLMs can provide better performance within their domain, general LLMs are more versatile and can handle a wider variety of tasks. The choice between the two depends on the specific needs and objectives of the application.
What are the benefits using domain-specific LLMs?
In the rapidly evolving field of artificial intelligence, domain-specific language models have emerged as a critical tool for enhancing task proficiency and model efficiency. Tailored to excel in specific fields, these models leverage the extensive knowledge base of large language models and fine-tune it to provide superior generalization capabilities and augmented interpretability in specialized domains.
Here are some benefits:
Enhanced task proficiency
Large language models, ingrained with extensive textual data, demonstrate a comprehensive understanding of language and an abundance of general knowledge, serving as an excellent springboard for domain-specific fine-tuning. While an LLM is suitable for generalized linguistic tasks, it may not perform optimally in specialized domains such as healthcare. Leveraging an LLM as a foundational model and fine-tuning it for specific domains can considerably improve task-related performance.
Amplified efficiency
Domain-specific models are highly tailored to execute a particular set of tasks and concentrate solely on the information relevant to those tasks. This specificity reduces computational overhead and processing requirements, thereby enhancing model efficiency and expediting task completion.
Superior generalization capabilities
Training models to comprehend the particular language and domain-specific knowledge allows them to generalize more proficiently to novel instances within the domain. This becomes a crucial asset when dealing with limited data sets or when the domain incorporates its own unique terminology and concepts. This ability is especially advantageous in rapidly evolving fields such as precision medicine in healthcare.
Augmented interpretability
The interpretability of domain-specific models is notably enhanced due to their task-specific focus and domain-specific training data. These models yield more relevant outcomes, regardless of the task, be it chat, text search, prediction, or information extraction, thus making them more comprehensible and meaningful.
Focused expertise
Domain-specific LLMs bring a focused expertise that goes beyond general linguistic understanding. Their training is tuned to the nuances and intricacies of a particular field, enabling them to provide specialized insights and solutions.
Reduced noise in outputs
Reducing noise in outputs refers to the process of refining the model to sift through irrelevant or extraneous information, ensuring that responses are streamlined and directly relevant to the user’s query. By tailoring the model to a specific domain, the outputs become more precise and focused, enhancing the signal-to-noise ratio in the model’s responses. This noise reduction ensures that the valuable information the model provides stands out more prominently amidst any potential distractions or inaccuracies, resulting in clearer, more useful responses for the user.
Alignment with industry standards
These models can be trained to align with industry standards, regulations, and best practices. This ensures that the generated content complies with specific requirements, making them more suitable for industries with stringent guidelines.
Mitigation of bias and ethical concerns
Fine-tuning a language model for a specific domain allows for careful consideration and mitigation of biases related to that domain. This targeted approach supports ethical AI practices and helps in building more responsible models.
Improved adaptation to dynamic environments
Domains that undergo rapid changes or frequent updates benefit from models that are skilled at adapting to dynamic environments. Domain-specific LLMs can quickly assimilate new information and adapt their understanding to stay relevant in fast-paced industries.
Challenges of building domain-specific custom LLMs
Creating custom large language models introduces various challenges for organizations, broadly segmented into data-, technical-, ethical-, and resource-related issues.
Data challenges
Organizations striving to establish custom LLMs grapple with data-associated challenges, including data acquisition, quality, and privacy. The procurement of substantial, niche-specific data could be demanding, especially when dealing with specialized or confidential data. Ensuring data integrity during collection is paramount, and so is addressing data privacy and protection, which requires deploying effective measures to anonymize data and safeguard it throughout training and deployment phases.
Technical challenges
Technical hurdles in the development of custom LLMs revolve around aspects such as model architecture, training, assessment, and validation. Selection of suitable architecture and parameters necessitates specialized knowledge, while training custom LLMs requires advanced competencies in machine learning. Evaluation becomes complex owing to the lack of established benchmarks for niche-specific tasks, and accuracy, safety, and compliance validation of model responses present additional challenges.
Ethical challenges
It is essential to confront ethical issues related to bias, fairness, content moderation, and safety while creating custom LLMs. LLMs may inadvertently incorporate and perpetuate biases from training data, making meticulous auditing and mitigation strategies a necessity. Robust content moderation mechanisms must be in place to prevent potentially inappropriate or harmful content generated by custom LLMs.
Resource challenges
The development of custom LLMs poses resource-related issues, predominantly in terms of computational resources and specialized skills. Training LLMs necessitate substantial computational resources, which could be expensive and inaccessible for some organizations. It also requires a team proficient in machine learning, natural language processing, and software engineering, which could be challenging to recruit and retain, thereby increasing the complexity and cost of the project.
Despite the inherent challenges, they are not unconquerable. Organizations can successfully develop and deploy custom LLMs with strategic planning, apt resources, and expert guidance to meet their unique requirements. As the market begins to witness the advent of commercially viable open-source foundation models, the trend of building domain-specific LLMs using these models is set to intensify.
In the following sections we will describe about the famous BloombergGPT model. By providing an overview of the BloombergGPT model, we help you understand the process, techniques, and strategies to be employed while developing your own domain-specific LLM using a foundation model.
Leverage LeewayHertz’s expertise in domain-specific LLMs!
We train your preferred foundation model on your proprietary data to create a domain-specific LLM that aligns perfectly with your business needs.
A detailed case study of BloombergGPT, a finance-specific LLM
Bloomberg has unveiled BloombergGPT, a state-of-the-art large language model, primed with extensive financial data for superior performance in financial sector-specific NLP tasks. This powerful AI tool expedites the evaluation of financial data, assists in risk assessments, measures financial sentiment, and could potentially automate accounting and auditing functions.
BloombergGPT is a highly specialized language model developed for the financial domain. It’s a sophisticated construct, equipped with 50 billion parameters. The model’s training utilized a considerable volume of data – 363 billion tokens sourced from Bloomberg’s unique databases and an additional 345 billion tokens derived from generalized datasets. This combination of financial and general-purpose data has resulted in a model that performs exceptionally well in its specialized domain. BloombergGPT has demonstrated substantial improvements over existing models when tackling financial tasks. Notably, despite its specialization, the model sustains commendable performance levels on standard large language model benchmarks, asserting its versatile applicability.
According to Bloomberg, the financial industry’s intricate nature necessitates an AI specifically trained on financial data. In this context, BloombergGPT has been designed to interact with the Bloomberg Terminal, a software platform that provides real-time market data, breaking news, in-depth financial research, and sophisticated analytics to financial professionals.
BloombergGPT is a pioneering stride towards leveraging this novel technology for the financial sector. It is projected to enhance Bloomberg’s existing NLP functionalities, including sentiment analysis, named entity recognition, news classification, and question-answering. Moreover, the model’s capability to structure the vast data accessible on the Bloomberg Terminal can significantly improve client services and unlock AI’s immense potential within the financial industry.
Bloomberg’s research emphasizes that while general models eliminate the need for specialized training due to their wide-ranging capabilities, existing domain-specific models’ results suggest that they are irreplaceable. Despite the financial domain being Bloomberg’s primary focus, the company supports diverse tasks that benefit from a general model. Hence, Bloomberg aimed to construct a model like BloombergGPT, which performs commendably on general-purpose LLM benchmarks, while also excelling in financial tasks.
BloombergGPT stands as a potent Large Language Model (LLM), specifically optimized for financial Natural Language Processing (NLP) tasks. This optimization is achieved by integrating both domain-specific and general-purpose data during its training phase.
Bloomberg Query Language (BQL) functions as a sophisticated query language, enabling access and analysis of financial data within Bloomberg’s platform. BQL, although a powerful tool, presents complexities and can be applied to diverse tasks including data search, data analysis, report generation, and insight derivation. BloombergGPT streamlines this process by converting natural language queries into valid BQL, thereby facilitating more intuitive interactions with financial data.
Additionally, BloombergGPT holds potential applications for news settings, serving as a tool for suggesting news headlines. This functionality proves valuable for news applications and assists journalists in curating newsletters. The model accepts paragraphs as inputs and proposes an appropriate title in return. This capability enhances the ease and speed of news article creation, contributing further to BloombergGPT’s utility in the areas of finance and news.
BloombergGPT has demonstrated a competitive performance compared to GPT-3 and other LLMs on general tasks, outpacing them in several finance-specific tasks. This introduction of BloombergGPT represents a significant breakthrough in AI-driven financial analysis, amplifying the application scope of NLP within the financial technology landscape.
What tasks BloombergGPT can perform?
- Financial tasks in natural language processing
When BloombergGPT approaches financial tasks within the NLP discipline, it handles them similarly to general NLP tasks. However, these financial tasks carry unique features and challenges that differentiate them from their general counterparts. A prime example of this dynamic can be seen in sentiment analysis.
Sentiment analysis is a common NLP task, which aims to identify the tone or emotional context of a given text. Take, for instance, a headline such as “COMPANY to cut 10,000 jobs”. In general, this would typically suggest a negative sentiment as it implies job loss and potential hardship for many individuals. However, when seen through the lens of financial context, the sentiment can be perceived differently.
In the world of finance, job cuts often indicate a move towards operational efficiency by a company. Such a move can increase the company’s stock price or bolster investor confidence in the company. This clearly demonstrates the multifaceted nature of sentiment analysis when applied within the financial domain, as understood and implemented by BloombergGPT.
- External financial benchmarks
When it comes to external financial benchmarks, BloombergGPT utilizes a variety of resources. This includes four tasks drawn from the FLUE benchmark as well as the ConvFinQA dataset. The performance of large language models on these financial tasks is a relatively unexplored territory, making the assessment complex.
BloombergGPT operates in a unique fashion, utilizing a few-shot learning approach when dealing with these tasks. This technique involves training the model on a small number of examples or ‘shots’. The purpose behind this method is to optimize the average performance across all models.
This approach demonstrates how BloombergGPT addresses the challenges of evaluating performance on tasks where there isn’t a broad consensus or standard testing framework. Despite the intricate nature of these tasks, BloombergGPT remains committed to performing efficiently, abiding by its main principle of choosing a number of shots that would maximize average performance.
- Internal task: Sentiment analysis
When dealing with internal tasks, BloombergGPT focuses heavily on aspect-specific sentiment analysis, a practice common in financial literature. The process involves a methodical approach, structured around a distinct discovery phase.
The discovery phase is instrumental in setting up the groundwork for sentiment analysis. It establishes the annotation and sampling procedures, which are essential components of the process. Additionally, this phase helps to gauge the required number of annotators for each example.
The training requirements for the annotators are also determined during this discovery phase. BloombergGPT relies on this structured procedure to maintain a consistent and effective approach to sentiment analysis, ensuring accurate and relevant results. This method underscores BloombergGPT’s commitment to perform complex tasks with diligence and precision.
- Exploratory task: Named Entity Recognition (NER)
Named Entity Recognition (NER) is a well-established task in the broader scope of natural language processing. However, this area has not been deeply investigated when it comes to generative language learning models. Recognizing the relevance of NER in the financial sector, BloombergGPT embarked on an exploratory journey into this task, aiming to uncover new insights and approaches.
The team behind BloombergGPT reported preliminary results from their exploration into NER. As part of this endeavor, they utilized seven distinct NER datasets originating from various domains within Bloomberg’s internal resources.
By venturing into this relatively uncharted territory within generative LLMs, BloombergGPT is enhancing its capabilities and contributing valuable research and findings to the broader field of natural language processing.
- Evaluation on standard general-purpose NLP tasks
The performance of BloombergGPT was critically examined on a variety of standard general-purpose NLP tasks. This evaluation process utilized BIG-bench Hard, representing a subset of the most demanding tasks within the broader BIG-bench benchmark.
Although the main focus of BloombergGPT is distinctly finance-oriented tasks, the inclusion of general-purpose training data in its learning process was considered essential. This approach is based on the possibility that such comprehensive data might augment BloombergGPT’s efficacy. This could not only enhance its capabilities in dealing with financial tasks, but could potentially improve its performance on conventional NLP tasks as well.
Therefore, even as BloombergGPT is fundamentally finance-focused, it appreciates the value of a well-rounded NLP competence, actively seeking to excel in not just specialized, but also universal NLP tasks.
- Knowledge assessments
BloombergGPT’s knowledge capacity, specifically its capability to recall information learned during its training phase, was put under the microscope by the research team. This evaluation was conducted through various scenarios where the model was required to provide answers to posed questions, without the aid of additional context or resources.
These scenarios comprised multiple-choice questions, along with benchmarks based on reading comprehension. Thus, the assessment was designed to evaluate not just the factual knowledge retained by BloombergGPT during its training, but also its ability to interpret, understand and utilize that knowledge effectively. It provided a valuable insight into BloombergGPT’s understanding and recall capabilities, which are critical aspects for any language model, especially in the financial domain where accuracy and reliability of information is paramount.
- Linguistic tasks
In the evaluation of BloombergGPT, certain tasks were not directly linked to end-user applications. These were essentially linguistic tasks, put in place to measure the model’s proficiency in understanding language at its core.
Such tasks scrutinized BloombergGPT’s capability in tackling disambiguation, parsing grammar, and understanding entailment. These areas are fundamental to language understanding and were chosen to directly assess BloombergGPT’s linguistic competence. By examining these abilities, the team aimed to ensure that the model could adeptly navigate language complexities, a skill that would prove invaluable even in its primary focus area of financial language processing.
The architecture of a domain-specific LLM like BloombergGPT
The creation of BloombergGPT was a joint endeavor between the Machine Learning (ML) Product and Research group and the AI Engineering team at Bloomberg. Their objective was to compile one of the largest domain-specific datasets to date. This involved leveraging Bloomberg’s vast data creation, collection, and curation resources. The team utilized Bloomberg’s expansive archive of financial data, assembling a comprehensive dataset of 363 billion tokens composed of English financial documents. This corpus was further augmented with a public dataset of 345 billion tokens, resulting in a combined training corpus surpassing 700 billion tokens.
To operationalize this vast dataset, the team elected to train a 50-billion parameter decoder-only causal language model. This type of model is adept at generating text in a sequential manner, making it ideal for processing natural language data. The resulting model was then tested on several validation sets: finance-specific natural language processing benchmarks, Bloomberg’s internal benchmarks, and a variety of generic NLP tasks drawn from widely recognized benchmarks.
BloombergGPT demonstrated exceptional performance, surpassing existing open source models of comparable size on finance-specific tasks by substantial margins. Importantly, while its specialization didn’t impede its ability to tackle broader tasks, the model also matched or exceeded performance on the general NLP benchmarks. This underscores the potential of BloombergGPT to enhance financial NLP tasks without sacrificing its utility for more general applications.
BloombergGPT’s architecture is based on a language model called BLOOM, which operates as a causal or “decoder-based” model. “Causal” essentially means that the model generates text sequentially, predicting the next word in a sequence based on the words that came before it.
The core of the BLOOM model consists of 70 layers of transformer decoder blocks. Here’s a simplified explanation of what this means:
- Transformer decoder blocks: These are a type of model structure used in natural language processing. They are designed to handle sequential data, like sentences, where the order of the elements is important. The transformer decoder block uses the following components:
- SA (Self attention): This mechanism helps the model pay varying amounts of ‘attention’ to different words in the input when generating the output. It allows the model to focus on the important words and understand the context better.
- LN (Layer Normalization): This is a technique to stabilize and speed up the learning process of the model. It ensures that the calculations the model performs stay within a manageable range, preventing numerical issues.
- FFN (Feed-forward Network): This is a type of artificial neural network where information moves in one direction – from input to output. It does not have any cycles or loops, and each layer of neurons is fully connected to the next layer.
The model also ties the input token embeddings (the numerical representation of words) to the linear mapping that occurs before the final output is decided. This step helps reduce the model’s complexity and can improve the efficiency of the model’s learning process.
Finally, there is an extra layer normalization step added after token embeddings. In this case, the initial token embedding, and the additional component of layer normalization for the embeddings. The inclusion of two consecutive layer normalization steps ensures the model’s stability and assists in efficient learning.
In summary, the architecture of the BloombergGPT model is designed to efficiently learn from and generate text data, with several features intended to enhance the model’s stability and learning speed.
Leverage LeewayHertz’s expertise in domain-specific LLMs!
We train your preferred foundation model on your proprietary data to create a domain-specific LLM that aligns perfectly with your business needs.
How was BloombergGPT trained?
Step 1: Constructing the FinPile dataset
The research team responsible for the creation of BloombergGPT made use of an extensive and diverse dataset titled “FinPile” for its training process. FinPile comprises various English financial documents from multiple sources, namely news articles, filings, press releases, and social media content, all from the extensive Bloomberg archives.
This wealth of financial data includes elements like company filings, financial news, and other information relevant to market trends. Some of this data is publicly accessible, while other parts can only be accessed through purchase or exclusive access via the Bloomberg Terminal. This data is meticulously cleaned to remove any unnecessary elements such as markup, special formatting, and templates. Moreover, each document is time-stamped, spanning from March 1, 2007, to July 31, 2022, indicating a gradual improvement in data quality and volume over time.
While the team can’t release the FinPile dataset, they aim to share their insights and experiences in building this highly domain-specific model. They also intend to leverage date information in future research work.
FinPile’s financial data is paired with public data commonly used for LLM training to create a balanced training corpus, having an equal representation of domain-specific and general-purpose text. The team undertook a de-duplication process for each dataset in order to maintain high data quality, which could result in differing statistical reports compared to other studies.
Step 2: Tokenization
The research team behind BloombergGPT decided to employ the Unigram tokenizer, in preference to other greedy merge-based sub-word tokenizers like BPE or Wordpiece. This decision was based on the Unigram tokenizer’s promising performance in prior research studies. They adopted a byte-based approach to data handling instead of the more common practice of treating data as a sequence of Unicode characters.
Additionally, the team employed a pre-tokenization process similar to that used in GPT-2, which breaks the input into distinct chunks. This method facilitates the learning of multi-word tokens, thereby increasing information density and reducing context lengths.
The team used the Pile dataset to train their tokenizer. Given its diverse content, The Pile was considered a suitable choice for their purposes. Handling such a large dataset required a parallel tokenizer training method, and the team adopted a split and merge strategy for this. They trained a Unigram tokenizer on each chunk of data and then merged them in a hierarchical manner to create the final tokenizer.
The resulting tokenizer consisted of 7 million tokens, which was further condensed to 131,072 tokens. The decision of vocabulary size was influenced by several factors, including the need to fit more information into the context window and considerations related to overhead. The team arrived at a vocabulary size of 131,072 tokens following experiments with different vocabulary sizes and the smallest encoded representation of the C4 dataset. Consequently, the tokenizer used for BloombergGPT is relatively larger than the standard vocabulary size of approximately 50,000 tokens.
Step 3: Building the model
The Bloomberg team builds the BloombergGPT model upon the structural foundation provided by the BLOOM model. This involves utilizing a decoder-only causal language model comprising 70 layers of transformer decoder blocks. These blocks consist of multi-head self-attention, layer normalization, and a feed-forward network with a single hidden layer. A GELU function is employed as the non-linear element within the network.
Additionally, ALiBi positional encoding, or Attention with Linear Biases, is implemented and the input token embeddings are connected to the linear mapping preceding the final softmax application. An additional layer normalization is executed following the token embeddings.
Leveraging the scaling laws from Chinchilla, the Bloomberg team determines the size of the BloombergGPT model. They set their total compute budget at 1.3 million GPU hours on 40GB A100 GPUs. The costs associated with activation checkpointing are accounted for and the Chinchilla equations are used to ascertain the optimum balance between the number of parameters and tokens.
Despite possessing a significant dataset of approximately 700 billion tokens, the Bloomberg team found it too limited for the optimal configuration given their compute budget. To circumnavigate this constraint, they opt for the largest possible model, having 50 billion parameters, while preserving around 30% of the total compute budget as a safeguard for potential challenges that may emerge.
To set the structure of the BloombergGPT model, which has 50 billion parameters, the Bloomberg team used recommendations from Levine et al. (2020). According to these recommendations, the ideal size (D) for the hidden layers depends on the model’s number of self-attention layers (L). The team chose 70 self-attention layers (L = 70) and a target size of 7510 for the hidden layers (D = 7510).
To enhance the performance of the model’s Tensor Core operations, the team used a design with 40 heads, each having a dimension of 192. This resulted in a total hidden size (D) of 7680, bringing the overall parameter count to 50.6 billion.
Step 4: Training the model
The BloombergGPT model, built on PyTorch, is trained following a standard causal language modeling objective, with a directionality from left to right. The team has strategically set the length of the training sequences at 2,048 tokens to make optimal use of GPU resources.
The AdamW optimizer is utilized in the optimization process, and the learning rate aligns with a cosine decay schedule that includes a linear warmup. Additionally, a batch size warmup is applied for enhanced performance.
The training and evaluation stages of the model are conducted on Amazon SageMaker, leveraging 64 instances of p4d.24xlarge, each equipped with 8 NVIDIA 40GB A100 GPUs. To ensure efficient data access and high-speed throughput, the team employs Amazon FSX for Lustre, which supports up to 1000 MB/s for both read and write operations per TiB of the storage unit.
Step 5: Large-scale optimization
To train the substantial BloombergGPT model within the constraints of GPU memory, the Bloomberg team implements a series of optimization techniques:
- ZeRO optimization (stage 3): This approach involves dividing the training state across a group of GPUs. In the case of BloombergGPT, the team deploys 128 GPUs for the purpose of sharding and maintains four replicas of the model throughout the training process.
- MiCS: This system is designed to lower the communication overhead and memory requirements for training clusters in the cloud. It utilizes hierarchical communication, a 2-hop gradient update mechanism, and a scale-aware model partitioning scheme.
- Activation checkpointing: This technique discards activations and recalculates intermediate tensors during backward passes, as and when necessary to minimize the memory consumed during the training process.
- Mixed precision training: This strategy helps save memory by executing forward and backward passes in BFloat16 precision, while storing and updating the parameters in full precision (FP32).
- Fused kernels: This method amalgamates multiple operations into a singular GPU operation. The result is a reduction in peak memory usage due to the avoidance of storage of intermediate results and an enhancement in speed.
By applying these optimizations, the Bloomberg team successfully manages to yield an average computational performance of 102 TFLOPs, with each training step taking around 32.5 seconds.
How ethical and privacy issues are handled in BloombergGPT?
BloombergGPT, designed for financial applications, is developed with a thorough focus on ethical and privacy considerations. These aspects are meticulously addressed in the development process through a series of well-structured measures.
- Risk assessment and compliance procedures
Recognizing the sensitivity of the financial sector, Bloomberg has established a robust risk assessment and testing process. Accuracy and factual information are deemed crucial for the firm’s reputation and the satisfaction of its clients, who are increasingly seeking to incorporate advanced technology into their operations.
This process involves stringent annotation guidelines, multi-level pre-launch reviews involving central risk and compliance units, and continuous post-launch monitoring. All research, development, and deployment procedures for BloombergGPT align with relevant regulations, maintaining high standards for compliance.
- Toxicity and bias control
Bloomberg exerts extraordinary effort to control any potential toxicity and bias in its content, produced either by humans or AI models. The team behind BloombergGPT maintains an ongoing interest in evaluating and quantifying potential generation of harmful language in various application contexts. Particular attention is given to studying the influence of the FinPile corpus, which is characterized by a lower incidence of biased or toxic language, on the model’s tendencies. As BloombergGPT is integrated into new products, these rigorous testing procedures and compliance controls will be applied to ensure safe usage.
- Model release and openness
The question of how to release LLMs is a subject of ongoing debate in the AI community. Given the potential for misuse of LLMs, especially ones like BloombergGPT that are trained on a wealth of sensitive data, Bloomberg employs a strategic approach to this issue. Various strategies are considered, ranging from releasing trained models under specific usage licenses, to granting API access without revealing underlying model parameters or detailed training data.
In light of the significant risk of data leakage attacks, Bloomberg has opted for a conservative approach, withholding the release of BloombergGPT model parameters. This decision is made in alignment with the company’s mission to provide access to data collected over decades, while also ensuring its protection from potential misuse.
- Contribution to the field
Despite not releasing the model, BloombergGPT’s development journey provides valuable insights to the broader NLP community, especially those creating domain-specific models. The experience gained in training and evaluating BloombergGPT can help better understand these models, contributing to the shared knowledge in the field of NLP and LLMs. Bloomberg acknowledges the influence of various other models and projects in shaping the development of BloombergGPT, and continues its efforts in contributing to this collective intelligence.
How to train an open-source foundation model into a domain-specific LLM?
The objective here is to develop a classification model leveraging the robust capabilities of BERT, a pre-trained large language model. To customize this model to fit a specific domain, it will be fine-tuned using domain-specific data, specifically the Toxic Comment Classification Challenge data, sourced from the realm of social media.
The dataset essential for executing this project is available on Kaggle. Only the ‘train.csv’ file among the available datasets will be utilized for this particular case. The complete code is available on GitHub.
Let’s code!
Step 1: Loading the data set
import numpy as np import pandas as pd import torch import torch.nn as nn from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report import transformers from transformers import AutoModel, BertTokenizerFast # specify GPU if you have a GPU device = torch.device("cuda") # --------------------------------------------------------------------------------- # Lead dataset df = pd.read_csv("train.csv") # mydata.csv has only 2 columns "text" and "label" (binary) df = df.sample(6400)[['comment_text', 'toxic']] df.columns = ['text', 'label'] # print data and take a look at it df.head() # check for class distribution print(df['label'].value_counts(normalize=True)) # Split data into train and test train_text, temp_text, train_labels, temp_labels = train_test_split( df['text'], df['label'], random_state=1234, test_size=0.3, stratify=df['label']) # Using temp_text and temp_labels created in previous step generate validation and test set val_text, test_text, val_labels, test_labels = train_test_split( temp_text, temp_labels, random_state=1234, test_size=0.5, stratify=temp_labels) # This will make 70% training data, 15% validation, and 15% test
Step 2: Tokenizaton
# Import BERT tokenizer from the pretained model bert = AutoModel.from_pretrained('bert-base-uncased') tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased') # Get length of all the sentences in the data seq_len = [len(i.split()) for i in train_text] pd.Series(seq_len).hist(bins=100) # Select the max_seq_len that we will be keeping based on the distribution above - target 85-90th percentile of text max_seq_len = 100 # Tokenize sequences trimmed at max length determined above in the training, validation and testing sets # tokenize and encode sequences in the tokens_train = tokenizer.batch_encode_plus(train_text.tolist(), max_length=max_seq_len, pad_to_max_length=True, truncation=True, return_token_type_ids=False) tokens_val = tokenizer.batch_encode_plus(val_text.tolist(), max_length=max_seq_len, pad_to_max_length=True, truncation=True, return_token_type_ids=False) tokens_test = tokenizer.batch_encode_plus(test_text.tolist(), max_length=max_seq_len, pad_to_max_length=True, truncation=True, return_token_type_ids=False) # Convert tokens to tensors for train, validation, and test sts train_seq = torch.tensor(tokens_train['input_ids']) train_mask = torch.tensor(tokens_train['attention_mask']) train_y = torch.tensor(train_labels.tolist()) val_seq = torch.tensor(tokens_val['input_ids']) val_mask = torch.tensor(tokens_val['attention_mask']) val_y = torch.tensor(val_labels.tolist()) test_seq = torch.tensor(tokens_test['input_ids']) test_mask = torch.tensor(tokens_test['attention_mask']) test_y = torch.tensor(test_labels.tolist())
Step 3: Creating a data loader to prepare the data for training
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler #define a batch size (You may want to tweak batch size based on your sample size ) batch_size = 32 # wrap tensors, random sampler, and prep dataloader for train data train_data = TensorDataset(train_seq, train_mask, train_y) train_sampler = RandomSampler(train_data) train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size) # wrap tensors, random sampler, and prep dataloader for vaidation data val_data = TensorDataset(val_seq, val_mask, val_y) val_sampler = SequentialSampler(val_data) val_dataloader = DataLoader(val_data, sampler=val_sampler, batch_size=batch_size) # freeze all the parameters by setting requires_grad = False for param in bert.parameters(): param.requires_grad = False
Step 4: Defining the model architecture: Adding 3 layers and 1 binary output layer to the NN
class BERT_Arch(nn.Module): def __init__(self, bert): super(BERT_Arch, self).__init__() self.bert = bert # dropout layer self.dropout = nn.Dropout(0.1) # relu activation function self.relu = nn.ReLU() # dense layer 1 # Note 768 is immutable for BERT, you can choose a dimension for the fc2 layer self.fc1 = nn.Linear(768, 512) # dense layer 2 (Output layer) self.fc2 = nn.Linear(512, 128) # dense layer 3 (Output layer) self.fc3 = nn.Linear(128, 32) # dense layer 4 (Output layer) self.fc4 = nn.Linear(32, 2) #softmax activation function self.softmax = nn.LogSoftmax(dim=1) #define the forward pass def forward(self, sent_id, mask): #pass the inputs to the model _, cls_hs = self.bert(sent_id, attention_mask=mask, return_dict=False) # Execute 1st layer x = self.fc1(cls_hs) x = self.relu(x) x = self.dropout(x) # Execute 2nd layer x = self.fc2(x) x = self.relu(x) x = self.dropout(x) # Execute 3rd layer x = self.fc3(x) x = self.relu(x) x = self.dropout(x) # output layer x = self.fc4(x) # apply softmax activation x = self.softmax(x) return x
Step 5: Passing the pre-trained BERT to prepare the loss function
model = BERT_Architecture(bert) # push the model to GPU model = model.to(device) # optimizer from hugging face transformers from transformers import AdamW # define the optimizer optimizer = AdamW(model.parameters(), lr=1e-3) # Find class weights and push those to GPU from sklearn.utils.class_weight import compute_class_weight class_weights = compute_class_weight(class_weight="balanced", classes=np.unique(train_labels), y=train_labels) # class_weights = dict(zip(np.unique(train_labels), class_weights)) # convert class weights to tensor weights = torch.tensor(class_weights, dtype=torch.float) weights = weights.to(device) cross_entropy = nn.NLLLoss(weight=weights) # number of training epochs epochs = 100 print(class_weights)
Step 6: Train the model
def train(): # Train model model.train() # Initiate the loss and accuracy as zero to start with (it will update over time) total_loss, total_accuracy = 0, 0 # empty list to save model predictions total_preds = [] # iterate over batches for step, batch in enumerate(train_dataloader): # push the batch to gpu batch = [r.to(device) for r in batch] sent_id, mask, labels = batch # clear previously calculated gradients model.zero_grad() # get model predictions for the current batch preds = model(sent_id, mask) # compute the loss between actual and predicted values loss = cross_entropy(preds, labels) # add on to the total loss total_loss = total_loss + loss.item() # backward pass to calculate the gradients loss.backward() # clip the the gradients to 1.0. It helps in preventing the exploding gradient problem torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) # update parameters optimizer.step() # model predictions are stored on GPU. So, push it to CPU preds = preds.detach().cpu().numpy() # append the model predictions total_preds.append(preds) # compute the training loss of the epoch avg_loss = total_loss / len(train_dataloader) # predictions are in the form of (no. of batches, size of batch, no. of classes). # We will have to reshape the predictions in form of (number of samples, no. of classes) total_preds = np.concatenate(total_preds, axis=0) #returns the loss and predictions return avg_loss, total_preds
Step 7: Evaluate the model
def evaluate(): print("\nEvaluating...") # deactivate dropout layers model.eval() # Again starting loss and accuracy as zero total_loss, total_accuracy = 0, 0 # empty list to save the model predictions total_preds = [] # iterate over batches for step, batch in enumerate(val_dataloader): # push the batch to gpu batch = [t.to(device) for t in batch] sent_id, mask, labels = batch # deactivate autograd with torch.no_grad(): # model predictions preds = model(sent_id, mask) # compute the validation loss between actual and predicted values loss = cross_entropy(preds, labels) total_loss = total_loss + loss.item() preds = preds.detach().cpu().numpy() total_preds.append(preds) # compute the validation loss of the epoch avg_loss = total_loss / len(val_dataloader) # reshape the predictions in form of (number of samples, no. of classes) total_preds = np.concatenate(total_preds, axis=0) return avg_loss, total_preds
Step 8: Run the pipeline to iteratively train and evaluate model and get the output for every epoch
# set initial loss to infinite best_valid_loss = float('inf') # empty lists to store training and validation loss of each epoch train_losses = [] valid_losses = [] #for each epoch for epoch in range(epochs): #train model train_loss, _ = train() #evaluate model valid_loss, _ = evaluate() #save the best model if valid_loss < best_valid_loss: best_valid_loss = valid_loss torch.save(model.state_dict(), 'saved_weights.pt') # append training and validation loss train_losses.append(train_loss) valid_losses.append(valid_loss) print( f'Epoch {epoch + 1}/{epochs}\tTraining Loss: {train_loss:.3f}\tValidation Loss: {valid_loss:.3f}' ) #load weights of best model path = 'saved_weights.pt' model.load_state_dict(torch.load(path)) # get predictions for test data with torch.no_grad(): preds = model(test_seq.to(device), test_mask.to(device)) preds = preds.detach().cpu().numpy() # model's performance preds = np.argmax(preds, axis=1) print(classification_report(test_y, preds)) pd.crosstab(test_y, preds)
Evaluation metrics for domain-specific LLMs
Evaluating the performance of domain-specific Language Models (LLMs) is a critical aspect to ensure their effectiveness in real-world applications. Various criteria and metrics are employed to assess the capabilities of these models, ranging from general metrics like precision, recall, and F1 score to domain-specific measures tailored to the nature of the targeted tasks.
Precision, recall, and F1 score:
- Precision: This metric measures the accuracy of positive predictions. It is the ratio of true positive predictions to the total positive predictions made by the model. In the context of domain-specific LLMs, precision is crucial for ensuring that the model provides accurate information within the specified domain.
- Recall: Recall, commonly referred to as sensitivity or the true positive rate, gauges the model’s capability to identify and capture all pertinent instances within the dataset. It is the ratio of true positive predictions to the total actual positive instances. In the context of domain-specific LLMs, high recall ensures that the model doesn’t miss important information related to the domain.
- F1 score: The F1 score, a measure derived from the harmonic mean of precision and recall, provides a balanced metric that harmonizes the evaluation of a model’s overall performance. For domain-specific LLMs, achieving a high F1 score is indicative of a model that can both accurately identify relevant information and avoid false positives.
Domain-specific metrics:
- Task-specific accuracy: Depending on the application, domain-specific tasks may require task-specific accuracy metrics. For example, in a finance-specific LLM, accuracy in predicting financial terms or market trends could be crucial.
- Domain relevance: Evaluate the relevance of the generated text to the specified domain. This can involve human evaluators assessing the output based on domain-specific knowledge and context.
- Specialized knowledge metrics: Some domains may have specialized metrics to gauge the model’s understanding of specific terminology or concepts. For instance, in a medical LLM, metrics could assess the model’s grasp of medical terminology and its ability to generate contextually accurate information.
User satisfaction and feedback:
Collecting user feedback is an invaluable aspect of evaluating domain-specific LLMs. Assess user satisfaction, ease of interaction, and the practical utility of the generated content in real-world scenarios.
Generalization across tasks:
Evaluate how well the domain-specific LLM generalizes across various tasks within the specified domain. A robust model should demonstrate consistent performance across a range of related tasks.
Robustness to adversarial inputs:
Assess the model’s resilience to adversarial inputs or attempts to mislead it. This is particularly important in domains where misinformation or manipulation may be a concern.
The evaluation of domain-specific LLMs requires a multifaceted approach, combining traditional metrics like precision, recall, and F1 score with domain-specific measures that reflect the intricacies of the targeted tasks. A holistic evaluation ensures that these models not only perform well in a generic sense but also excel in their intended applications within specific domains.
Endnote
As we conclude, it’s clear that domain-specific large language models are not just beneficial, but indeed necessary for modern organizations. The shift from generic models to more tailored language processing models addresses unique industry needs, enhances performance, enables personalized interactions, and streamlines operations. Moreover, these custom models provide improved data privacy control and financial efficiency.
In the rapidly evolving world of AI and natural language processing, adopting domain-specific LLMs represents a crucial step towards staying competitive and innovative. These models, fine-tuned to understand specific industry language and context, transform how organizations interact with data and digital interfaces. The need for domain-specific LLMs is undeniable, and it is now a question of how quickly organizations can adapt and integrate this powerful tool into their processes to unlock its full potential.
Leverage the power of customized LLMs! Contact LeewayHertz to create your own domain-specific LLM for highly accurate results and unparalleled performance within your domain.
Start a conversation by filling the form
All information will be kept confidential.
Insights
Large Multimodal Models: Transforming AI with cross-modal integration
Large multimodal models (LMMs) represent a significant advancement in artificial intelligence, enabling AI systems to process and comprehend multiple types of data modalities such as text, images, audio, and video.
How to use LLMs for creating a content-based recommendation system for entertainment platforms?
Content-based recommendation systems leverage the intrinsic features of items (such as movies, songs, or books) to make personalized suggestions.
Financial fraud detection using machine learning models
As we navigate the complexities of financial fraud, the role of machine learning emerges not just as a tool but as a transformative force, reshaping the landscape of fraud detection and prevention.