This document explores foundation models and large language models, covering their training methodology, advantages in performance and productivity, as well as challenges related to compute costs and trustworthiness in enterprise applications.
This document examines foundation models as a transformative AI paradigm, explaining how these models are trained on vast amounts of unstructured data to perform generative tasks and can be adapted to multiple applications. The discussion covers large language models, their advantages in performance and productivity, along with critical challenges in compute costs and trustworthiness.
Large language models such as ChatGPT have demonstrated remarkable capabilities, from creative writing to complex planning tasks. These models represent a step change in AI performance and their potential to drive enterprise value. Large language models are actually part of a different class of models called foundation models.
The term foundation models was first coined by a team from Stanford. They observed that the field of AI was converging to a new paradigm, representing a fundamental shift in how AI applications are built and deployed.
Before foundation models, AI applications were built by training a library of different AI models where each AI model was trained on very task-specific data to perform a very specific task. The conventional approach required separate models for each distinct application or use case.
The new paradigm centers on having a foundational capability or a foundation model that can drive all of these same use cases and applications. The same exact applications envisioned with conventional AI can now be powered by a single foundation model, and the same model can drive any number of additional applications.
The distinguishing characteristic of foundation models is their ability to be transferred to any number of tasks. This transferability gives these models the capability to perform multiple different functions across diverse applications.
The superpower enabling foundation models to transfer to multiple different tasks and perform multiple different functions comes from how they are trained. These models are trained on a huge amount of data in an unsupervised manner using unstructured data.
In the language domain, this training process involves feeding vast quantities of sentences to the model. The scale of data used for training these models is measured in terabytes, representing an unprecedented amount of training information.
The core training mechanism for foundation models in the language domain works through sentence completion. Consider a simple example where the start of a sentence might be “no use crying over spilled,” and the end of the sentence would be “milk.”
The model learns to predict the last word of the sentence based on the words it observed before. This generative capability of predicting and generating the next word based on previous words is fundamental to how foundation models operate.
It is this generative capability of the model, predicting and generating the next word based on previous words seen beforehand, that explains why foundation models are part of the field of AI called Generative AI. These models generate something new, specifically the next word in a sentence.
Even though these models are trained to perform at their core a generation task of predicting the next word in the sentence, they can be adapted to perform traditional NLP tasks. With the introduction of a small amount of labeled data to the equation, these models can be tuned to perform tasks like classification or named entity recognition, functions not normally associated with generative-based models or capabilities.
This adaptation process is called tuning. Through tuning, a foundation model can be customized by introducing a small amount of data. The process updates the parameters of the model, enabling it to perform a very specific natural language task tailored to particular requirements.
Foundation models work remarkably well even in low label data domains. When labeled data is scarce or unavailable, there are still effective ways to leverage these powerful models.
In a process called prompting or prompt engineering, these models can be applied to perform specific tasks without requiring labeled training data. This approach enables foundation models to be useful even when traditional supervised learning would be impractical due to data constraints.
An example of prompting a model to perform a classification task demonstrates this capability. The process involves providing the model with a sentence and then asking it a question such as “Is this sentence have a positive sentiment or negative sentiment?”
The model attempts to finish generating words in that sentence, and the next natural word in that sentence would be the answer to the classification problem. The model would respond either positive or negative, depending on where it estimated the sentiment of the sentence would be.
These models work surprisingly well when applied to these new settings and domains, demonstrating their adaptability and robustness across different tasks and contexts.
Foundation models offer significant benefits that make them attractive for enterprise applications. These advantages stem from their unique training approach and architecture.
The chief advantage of foundation models is their performance. These models have been exposed to enormous amounts of data, measured in terabytes, giving them extensive knowledge and pattern recognition capabilities.
By the time foundation models are applied to specific tasks, they can drastically outperform models that were only trained on just a few data points. The comprehensive pre-training on massive datasets provides foundation models with a knowledge base that smaller, task-specific models cannot match.
The second major advantage of foundation models is the productivity gains they enable. Through prompting or tuning, far less labeled data is needed to achieve a task-specific model compared to starting from scratch.
The model takes advantage of all the unlabeled data it processed during its pre-training phase when the generative task was created. This means organizations can develop effective AI solutions much more quickly and with significantly less investment in data labeling and preparation.
While foundation models offer substantial advantages, they also present important challenges that must be considered, particularly for enterprise deployment.
The first major disadvantage relates to compute costs. The penalty for having these models process so much data is that they are very expensive to train. This high cost makes it difficult for smaller enterprises to train a foundation model on their own.
By the time foundation models reach a size of several billion parameters, they also become very expensive to run inference. Multiple GPUs may be required simultaneously just to host these models and execute inference, making them a more costly method than traditional approaches.
The substantial computational requirements create barriers to entry and ongoing operational costs that organizations must carefully evaluate.
The second disadvantage of foundation models centers on trustworthiness. While the vast amount of data these models have processed is a huge advantage, it also comes at a cost, especially in the domain of language.
Many of these models are trained on language data scraped from the Internet. The volume of data these models have been trained on is so enormous that even with a whole team of human annotators, it would be impossible to vet every single data point to ensure it is not biased or does not contain hate speech or other toxic information.
This challenge is compounded by the fact that for many open source models that have been released, the exact datasets used for training are often unknown. This lack of transparency leads to trustworthiness issues that are particularly concerning in business settings where accountability and reliability are paramount.
IBM recognizes the tremendous potential of foundation model technologies while acknowledging the challenges they present. Research efforts at IBM focus on multiple innovations to address the limitations of these models.
IBM Research partners are working on innovations to improve the efficiency of foundation models. Parallel efforts address the trustworthiness and reliability of these models to make them more relevant and suitable for business settings.
These research initiatives aim to make foundation models more accessible to enterprises of all sizes while ensuring they meet the rigorous standards required for production deployment.
All the examples discussed so far have focused on the language domain, but foundation models can be applied to many other domains. The foundation model paradigm extends far beyond text processing.
Foundation models for vision have gained significant attention. Models such as DALL-E 2 take text data as input and use it to generate custom images. This capability opens up new possibilities for creative applications and visual content generation.
Foundation models for code have emerged as powerful tools for software development. Products like Copilot can help complete code as it is being authored, accelerating development and assisting programmers with complex coding tasks.
IBM is innovating across all of these domains and more. The company integrates foundation models into various products and services:
| Domain | Product Integration | Capability |
|---|---|---|
| Language | Watson Assistant and Watson Discovery | Natural language understanding and generation |
| Vision | Maximo Visual Inspection | Image analysis and defect detection |
| Code | Project Wisdom with Red Hat | Ansible code assistance and generation |
IBM’s foundation model work extends into specialized domains that address critical global challenges.
IBM recently published and released MolFormer, a foundation model designed to promote molecule discovery for different targeted therapeutics. This model demonstrates how foundation models can accelerate scientific discovery in pharmaceutical and chemical research.
Foundation models are being developed for climate change research. IBM is building earth science foundation models using geospatial data to improve climate research and modeling.
These specialized foundation models demonstrate the broad applicability of the foundation model paradigm across scientific and technical domains.
IBM’s work focuses on addressing the disadvantages of foundation models to make them suitable for enterprise deployment. This includes efforts to improve trustworthiness and efficiency.
Research initiatives focus on making foundation models more trustworthy by addressing bias, ensuring transparency in model behavior, and implementing governance frameworks. These efforts are critical for enterprise adoption where accountability and ethical AI practices are essential.
Work on improving efficiency aims to reduce the computational costs of both training and inference. More efficient foundation models can be deployed by a wider range of organizations and can operate more cost-effectively at scale.
Foundation models represent a paradigm shift in artificial intelligence, moving from task-specific models to versatile foundational capabilities that can be adapted to multiple applications. These models, trained on massive amounts of unstructured data through generative tasks, offer significant advantages in performance and productivity while presenting challenges in compute costs and trustworthiness. Large language models exemplify foundation models in the language domain, demonstrating remarkable capabilities through both tuning with small amounts of labeled data and prompt engineering with little to no labeled data. IBM’s multi-domain approach to foundation models spans language, vision, code, chemistry, and climate science, with ongoing research focused on improving efficiency and trustworthiness to make these powerful technologies more suitable for enterprise applications.
| Aspect | Description |
|---|---|
| A. Performance Advantage | 1. Models are very expensive to train and require multiple GPUs for inference |
| B. Productivity Advantage | 2. Difficult to vet all training data for bias and toxic content |
| C. Compute Cost Disadvantage | 3. Models drastically outperform those trained on few data points |
| D. Trustworthiness Disadvantage | 4. Far less labeled data needed compared to training from scratch |
A-3, B-4, C-1, D-2.
Foundation models present two compute cost challenges:
For most open source foundation models, the exact datasets used for training are publicly documented and fully transparent.
False. For many open source models that have been released, the exact datasets used for training are often unknown. This lack of transparency leads to trustworthiness issues that are particularly concerning in business settings where accountability and reliability are paramount.
(3) Many foundation models are trained on language data scraped from the Internet. The volume of data is so enormous that even with a whole team of human annotators, it would be impossible to vet every single data point to ensure it is not biased or does not contain hate speech or other toxic information. This creates significant trustworthiness concerns, especially for enterprise applications.
IBM is innovating across multiple foundation model domains including:
IBM Research is working on multiple innovations to improve foundation models in two key areas:
(3) Autonomous vehicle navigation is not mentioned as a foundation model application domain in the document. The domains discussed include language, vision, code, chemistry, and climate/earth science.