Foundation Models

This document explores foundation models and large language models, covering their training methodology, advantages in performance and productivity, as well as challenges related to compute costs and trustworthiness in enterprise applications.

This document examines foundation models as a transformative AI paradigm, explaining how these models are trained on vast amounts of unstructured data to perform generative tasks and can be adapted to multiple applications. The discussion covers large language models, their advantages in performance and productivity, along with critical challenges in compute costs and trustworthiness.


The Emergence of Foundation Models

Large language models such as ChatGPT have demonstrated remarkable capabilities, from creative writing to complex planning tasks. These models represent a step change in AI performance and their potential to drive enterprise value. Large language models are actually part of a different class of models called foundation models.

Origins of the Foundation Model Paradigm

The term foundation models was first coined by a team from Stanford. They observed that the field of AI was converging to a new paradigm, representing a fundamental shift in how AI applications are built and deployed.

The Paradigm Shift

Before foundation models, AI applications were built by training a library of different AI models where each AI model was trained on very task-specific data to perform a very specific task. The conventional approach required separate models for each distinct application or use case.

The new paradigm centers on having a foundational capability or a foundation model that can drive all of these same use cases and applications. The same exact applications envisioned with conventional AI can now be powered by a single foundation model, and the same model can drive any number of additional applications.


What Makes Foundation Models Different

The distinguishing characteristic of foundation models is their ability to be transferred to any number of tasks. This transferability gives these models the capability to perform multiple different functions across diverse applications.

Training on Massive Unstructured Data

The superpower enabling foundation models to transfer to multiple different tasks and perform multiple different functions comes from how they are trained. These models are trained on a huge amount of data in an unsupervised manner using unstructured data.

In the language domain, this training process involves feeding vast quantities of sentences to the model. The scale of data used for training these models is measured in terabytes, representing an unprecedented amount of training information.

The Generative Training Task

The core training mechanism for foundation models in the language domain works through sentence completion. Consider a simple example where the start of a sentence might be “no use crying over spilled,” and the end of the sentence would be “milk.”

The model learns to predict the last word of the sentence based on the words it observed before. This generative capability of predicting and generating the next word based on previous words is fundamental to how foundation models operate.

Connection to Generative AI

It is this generative capability of the model, predicting and generating the next word based on previous words seen beforehand, that explains why foundation models are part of the field of AI called Generative AI. These models generate something new, specifically the next word in a sentence.


Beyond Generation: Versatility Through Tuning

Even though these models are trained to perform at their core a generation task of predicting the next word in the sentence, they can be adapted to perform traditional NLP tasks. With the introduction of a small amount of labeled data to the equation, these models can be tuned to perform tasks like classification or named entity recognition, functions not normally associated with generative-based models or capabilities.

The Tuning Process

This adaptation process is called tuning. Through tuning, a foundation model can be customized by introducing a small amount of data. The process updates the parameters of the model, enabling it to perform a very specific natural language task tailored to particular requirements.


Working with Limited or No Labeled Data

Foundation models work remarkably well even in low label data domains. When labeled data is scarce or unavailable, there are still effective ways to leverage these powerful models.

Prompt Engineering

In a process called prompting or prompt engineering, these models can be applied to perform specific tasks without requiring labeled training data. This approach enables foundation models to be useful even when traditional supervised learning would be impractical due to data constraints.

Classification Through Prompting

An example of prompting a model to perform a classification task demonstrates this capability. The process involves providing the model with a sentence and then asking it a question such as “Is this sentence have a positive sentiment or negative sentiment?”

The model attempts to finish generating words in that sentence, and the next natural word in that sentence would be the answer to the classification problem. The model would respond either positive or negative, depending on where it estimated the sentiment of the sentence would be.

These models work surprisingly well when applied to these new settings and domains, demonstrating their adaptability and robustness across different tasks and contexts.


Advantages of Foundation Models

Foundation models offer significant benefits that make them attractive for enterprise applications. These advantages stem from their unique training approach and architecture.

Superior Performance

The chief advantage of foundation models is their performance. These models have been exposed to enormous amounts of data, measured in terabytes, giving them extensive knowledge and pattern recognition capabilities.

By the time foundation models are applied to specific tasks, they can drastically outperform models that were only trained on just a few data points. The comprehensive pre-training on massive datasets provides foundation models with a knowledge base that smaller, task-specific models cannot match.

Productivity Gains

The second major advantage of foundation models is the productivity gains they enable. Through prompting or tuning, far less labeled data is needed to achieve a task-specific model compared to starting from scratch.

The model takes advantage of all the unlabeled data it processed during its pre-training phase when the generative task was created. This means organizations can develop effective AI solutions much more quickly and with significantly less investment in data labeling and preparation.


Disadvantages and Challenges

While foundation models offer substantial advantages, they also present important challenges that must be considered, particularly for enterprise deployment.

Compute Costs

The first major disadvantage relates to compute costs. The penalty for having these models process so much data is that they are very expensive to train. This high cost makes it difficult for smaller enterprises to train a foundation model on their own.

By the time foundation models reach a size of several billion parameters, they also become very expensive to run inference. Multiple GPUs may be required simultaneously just to host these models and execute inference, making them a more costly method than traditional approaches.

The substantial computational requirements create barriers to entry and ongoing operational costs that organizations must carefully evaluate.

Trustworthiness Concerns

The second disadvantage of foundation models centers on trustworthiness. While the vast amount of data these models have processed is a huge advantage, it also comes at a cost, especially in the domain of language.

Many of these models are trained on language data scraped from the Internet. The volume of data these models have been trained on is so enormous that even with a whole team of human annotators, it would be impossible to vet every single data point to ensure it is not biased or does not contain hate speech or other toxic information.

This challenge is compounded by the fact that for many open source models that have been released, the exact datasets used for training are often unknown. This lack of transparency leads to trustworthiness issues that are particularly concerning in business settings where accountability and reliability are paramount.


IBM’s Approach to Foundation Models

IBM recognizes the tremendous potential of foundation model technologies while acknowledging the challenges they present. Research efforts at IBM focus on multiple innovations to address the limitations of these models.

Focus Areas for Improvement

IBM Research partners are working on innovations to improve the efficiency of foundation models. Parallel efforts address the trustworthiness and reliability of these models to make them more relevant and suitable for business settings.

These research initiatives aim to make foundation models more accessible to enterprises of all sizes while ensuring they meet the rigorous standards required for production deployment.


Foundation Models Beyond Language

All the examples discussed so far have focused on the language domain, but foundation models can be applied to many other domains. The foundation model paradigm extends far beyond text processing.

Vision Foundation Models

Foundation models for vision have gained significant attention. Models such as DALL-E 2 take text data as input and use it to generate custom images. This capability opens up new possibilities for creative applications and visual content generation.

Code Foundation Models

Foundation models for code have emerged as powerful tools for software development. Products like Copilot can help complete code as it is being authored, accelerating development and assisting programmers with complex coding tasks.

IBM’s Multi-Domain Innovation

IBM is innovating across all of these domains and more. The company integrates foundation models into various products and services:

DomainProduct IntegrationCapability
LanguageWatson Assistant and Watson DiscoveryNatural language understanding and generation
VisionMaximo Visual InspectionImage analysis and defect detection
CodeProject Wisdom with Red HatAnsible code assistance and generation

Specialized Foundation Models

IBM’s foundation model work extends into specialized domains that address critical global challenges.

Chemistry Foundation Models

IBM recently published and released MolFormer, a foundation model designed to promote molecule discovery for different targeted therapeutics. This model demonstrates how foundation models can accelerate scientific discovery in pharmaceutical and chemical research.

Climate and Earth Science Models

Foundation models are being developed for climate change research. IBM is building earth science foundation models using geospatial data to improve climate research and modeling.

These specialized foundation models demonstrate the broad applicability of the foundation model paradigm across scientific and technical domains.


Making Foundation Models Enterprise-Ready

IBM’s work focuses on addressing the disadvantages of foundation models to make them suitable for enterprise deployment. This includes efforts to improve trustworthiness and efficiency.

Improving Trustworthiness

Research initiatives focus on making foundation models more trustworthy by addressing bias, ensuring transparency in model behavior, and implementing governance frameworks. These efforts are critical for enterprise adoption where accountability and ethical AI practices are essential.

Enhancing Efficiency

Work on improving efficiency aims to reduce the computational costs of both training and inference. More efficient foundation models can be deployed by a wider range of organizations and can operate more cost-effectively at scale.


Conclusion

Foundation models represent a paradigm shift in artificial intelligence, moving from task-specific models to versatile foundational capabilities that can be adapted to multiple applications. These models, trained on massive amounts of unstructured data through generative tasks, offer significant advantages in performance and productivity while presenting challenges in compute costs and trustworthiness. Large language models exemplify foundation models in the language domain, demonstrating remarkable capabilities through both tuning with small amounts of labeled data and prompt engineering with little to no labeled data. IBM’s multi-domain approach to foundation models spans language, vision, code, chemistry, and climate science, with ongoing research focused on improving efficiency and trustworthiness to make these powerful technologies more suitable for enterprise applications.


FAQs


FAQs

Foundation models are AI models with broad capabilities that can be adapted to create more specialized models or tools for specific use cases. They represent a paradigm shift where a single foundational capability can drive multiple applications rather than requiring separate models for each task.

The term foundation models was first coined by a team from Stanford who observed that the field of AI was converging to a new paradigm centered on foundational capabilities rather than task-specific models.

In conventional AI, applications were built by training a library of different AI models where each model was trained on very task-specific data to perform a very specific task. The foundation model paradigm uses a single foundational capability that can drive multiple use cases and applications, with the same model transferable to any number of tasks.

Foundation models are trained on massive amounts of data in an unsupervised manner using unstructured data. In the language domain, this involves training on terabytes of sentence data where the model learns to predict the next word based on previous words, giving it broad knowledge that can be applied to diverse tasks.

The model is fed vast quantities of sentences and learns to predict the last word based on the words it observed before. For example, given “no use crying over spilled,” the model learns to predict “milk.” This process of predicting and generating the next word based on previous words is the core training mechanism.

Foundation models are part of Generative AI because they generate something new—specifically, the next word in a sentence. The generative capability of predicting and generating the next word based on previous words seen beforehand is what categorizes them as generative models.

Tuning is the process of adapting a foundation model by introducing a small amount of labeled data to update the model’s parameters. This allows the foundation model to perform very specific natural language tasks like classification or named entity recognition, even though it was originally trained for generation.

Prompt engineering, also called prompting, is a process that allows foundation models to be applied to specific tasks without requiring labeled training data. It involves providing the model with a sentence and a question, and the model generates the next natural word as the answer to the task.

To perform classification through prompting, provide the model with a sentence and ask it a question such as “Is this sentence have a positive sentiment or negative sentiment?” The model attempts to finish generating words in that sentence, and the next natural word would be the answer (either positive or negative) based on its estimation of the sentiment.

AspectDescription
A. Performance Advantage1. Models are very expensive to train and require multiple GPUs for inference
B. Productivity Advantage2. Difficult to vet all training data for bias and toxic content
C. Compute Cost Disadvantage3. Models drastically outperform those trained on few data points
D. Trustworthiness Disadvantage4. Far less labeled data needed compared to training from scratch
A-3, B-4, C-1, D-2.

The chief advantage of foundation models is their superior performance. Because these models have been exposed to enormous amounts of data (measured in terabytes), they can drastically outperform models trained on just a few data points when applied to specific tasks.

Foundation models achieve productivity gains because through prompting or tuning, far less labeled data is needed to create task-specific models compared to starting from scratch. The models take advantage of all the unlabeled data they processed during pre-training, allowing organizations to develop AI solutions more quickly with less investment in data preparation.

Foundation models present two compute cost challenges:

  • They are very expensive to train, making it difficult for smaller enterprises to train models on their own
  • Models with billions of parameters are very expensive to run inference, often requiring multiple GPUs simultaneously just to host and execute the models, making them more costly than traditional approaches

For most open source foundation models, the exact datasets used for training are publicly documented and fully transparent.

False. For many open source models that have been released, the exact datasets used for training are often unknown. This lack of transparency leads to trustworthiness issues that are particularly concerning in business settings where accountability and reliability are paramount.

  1. Foundation models are too small to be reliable
  2. The models cannot perform classification tasks
  3. Many models are trained on data scraped from the Internet, and the volume is so enormous that it’s impossible to vet every data point for bias or toxic content
  4. Foundation models only work with structured data
(3) Many foundation models are trained on language data scraped from the Internet. The volume of data is so enormous that even with a whole team of human annotators, it would be impossible to vet every single data point to ensure it is not biased or does not contain hate speech or other toxic information. This creates significant trustworthiness concerns, especially for enterprise applications.

IBM is innovating across multiple foundation model domains including:

  • Language models integrated into Watson Assistant and Watson Discovery
  • Vision models in Maximo Visual Inspection
  • Code models through Project Wisdom with Red Hat for Ansible
  • Chemistry models like MolFormer for molecule discovery
  • Earth science models using geospatial data for climate research

DALL-E 2 is a foundation model for vision that takes text data as input and uses it to generate custom images. It demonstrates how foundation models can be applied beyond language to create visual content from textual descriptions.

MolFormer is a foundation model published and released by IBM that is designed to promote molecule discovery for different targeted therapeutics. It demonstrates how foundation models can accelerate scientific discovery in pharmaceutical and chemical research.

IBM Research is working on multiple innovations to improve foundation models in two key areas:

  • Efficiency improvements to reduce computational costs of both training and inference, making models more accessible to organizations of all sizes
  • Trustworthiness and reliability enhancements to address bias, ensure transparency, and implement governance frameworks suitable for enterprise deployment

  1. Language processing with Watson Assistant
  2. Image generation with DALL-E 2
  3. Autonomous vehicle navigation
  4. Climate research with geospatial data
(3) Autonomous vehicle navigation is not mentioned as a foundation model application domain in the document. The domains discussed include language, vision, code, chemistry, and climate/earth science.