This document covers the complete AI application development journey, from ideation and model selection through building with RAG and fine-tuning to production deployment with MLOps best practices.
This document explores the three main phases of AI application development—ideation and experimentation, building with frameworks and techniques like RAG and fine-tuning, and operationalizing with MLOps—providing developers with practical guidance for creating production-ready AI-powered applications using open source tools and technologies.
Recent data from Gartner indicates that 80% of enterprises will have used some type of generative AI through models or APIs by 2026. While many developers have experience using AI through co-pilots in IDEs and popular large language models online, building applications that actually use AI represents a different challenge. The accessibility of AI development has improved significantly, making it easier than ever for developers to get started.
The journey from a simple proof of concept to a production application involves three main steps: ideation and experimentation, building the application, and development and operations. Open source tools and technologies are available to help throughout the inner loop development process of building, running, and testing applications with AI capabilities.
The first step in building an application that uses generative AI is ideating around exploration and proof of concepts. This phase can be broken down into several essential steps.
Use cases are specialized, requiring specialized models that can do the job effectively. The process begins with researching and evaluating models from popular repositories like Hugging Face or the open source community. Different factors must be considered during evaluation:
| Factor | Consideration |
|---|---|
| Model Size | Resource requirements and computational needs |
| Performance | Speed, accuracy, and task-specific capabilities |
| Benchmarking | Evaluation using popular benchmark tools |
Several general principles guide model selection decisions:
Self-hosting a large language model will generally be cheaper than a cloud-based service. Small language models (SLMs) versus large language models (LLMs) will generally perform better with lower latency and are specialized for specific tasks.
Understanding various prompting techniques is essential when working with models:
Zero-shot prompting involves asking a model a question without any examples of how to respond. This tests the model’s baseline capabilities without additional context.
Few-shot prompting provides a few different examples of how to respond, demonstrating the behavior desired from the LLM as work progresses with the AI.
Chain of thought prompting asks the model to explain its thinking process step by step, making the reasoning transparent and verifiable.
The different capabilities and limitations of models must be understood early in the process. Experimenting with data during this phase helps identify potential challenges that might arise as the AI journey progresses.
After evaluating models for the use case, the building phase begins. Similar to how databases and different services can be locally run on a machine, AI can also be served locally and accessed through its API from localhost.
Running models locally provides the added benefit of keeping data secure and private on premise. This approach becomes particularly important when working with sensitive or proprietary information.
RAG is a method for using data with large language models. The approach takes a pre-trained foundational model and supplements it with relevant and accurate data. This helps provide better and more accurate responses by grounding the model’s outputs in specific, up-to-date information.
Fine-tuning represents an alternative approach where the large language model includes the data within it. This process bakes information about desired behavior, styles, and intuition directly into the model itself. When the model is inferenced, it has domain-specific data available every time, without requiring external retrieval.
These represent just two approaches among many available methods for incorporating data into AI applications.
Having the right tools and frameworks, such as LangChain, simplifies development work. These frameworks enable focus on building new features for popular generative AI use cases:
The simplification comes through streamlining the different calls made through the model using sequences of prompts and model calls to accomplish more complex tasks. Problems are broken down into smaller, more manageable steps, with evaluation of flows during model calls in both development and production environments.
The final step involves deploying the AI-powered application to production to enable scaling. This falls under the umbrella of machine learning operations, or MLOps.
Infrastructure must be able to handle efficient model deployment and scaling. Technologies such as containers and orchestrators like Kubernetes help accomplish this by:
Production-ready runtime environments, such as vLLM, can be used for model serving to ensure optimal performance.
Organizations are increasingly taking a hybrid approach with both models and infrastructure:
Multi-model approach: Different models for different use cases, similar to a Swiss Army knife with specialized tools for specific tasks.
Hybrid infrastructure: A combination of on-premise and cloud infrastructure to maximize resource utilization and budget efficiency.
With AI-powered applications in production, the work continues. Ongoing requirements include:
Similar to how DevOps practices ensure smooth deployments, MLOps ensures models go into production in a controlled, reliable fashion.
Recent innovations in AI have made this topic much more accessible for developers. Numerous tools are available throughout the process to support development efforts. While AI represents new capabilities, it functions as another tool that can be added to the developer tool belt.
The complete process flows from ideation through building to deployment. These steps enable developers to make real impact with their work by leveraging generative AI capabilities in practical, production-ready applications.
The three-phase approach to AI application development provides a structured path forward:
The ideation and experimentation phase focuses on model research, evaluation, and understanding prompting techniques. The building phase leverages frameworks like LangChain and techniques like RAG or fine-tuning to create functional applications. The operationalization phase applies MLOps practices to deploy, scale, and monitor AI-powered applications in production environments.
Throughout all phases, open source tools and technologies provide the foundation for efficient development. Local model serving offers security and privacy benefits, while hybrid approaches to both models and infrastructure maximize flexibility and resource efficiency.
Building applications with generative AI follows a clear progression from ideation to production. The accessibility of modern tools and frameworks has lowered the barrier to entry for developers, making AI application development a practical skill to acquire. By understanding the three main phases—ideation and experimentation, building, and operationalization—developers can successfully create AI-powered applications that deliver real value. The key is to approach AI as another tool in the developer toolkit, using established development practices adapted for the unique requirements of machine learning operations.
(2) The three main phases of AI application development are ideation and experimentation, building the application, and development and operations (MLOps). These phases guide developers from proof of concept to production-ready AI-powered applications.
| Technique | Description |
|---|---|
| A. Zero-shot prompting | 1. Asking the model to explain its thinking process step by step |
| B. Few-shot prompting | 2. Asking a model without any examples of how to respond |
| C. Chain of thought prompting | 3. Providing several examples demonstrating desired behavior |
| D. Fine-tuning | 4. Baking domain-specific information directly into the model |
A-2, B-3, C-1, D-4.
Self-hosting a large language model is generally more expensive than using a cloud-based service.
False. Self-hosting a large language model will generally be cheaper than a cloud-based service. However, this comes with the trade-off of managing infrastructure and resources internally.
(2) RAG supplements a pre-trained model with relevant external data during inference, while fine-tuning bakes domain-specific information, desired behaviors, and styles directly into the model itself. This means RAG retrieves data externally while fine-tuning modifies the model’s parameters.
Experimenting with data early in the ideation phase helps identify potential challenges that might arise as the AI journey progresses. This early experimentation allows developers to:
Running models locally provides several benefits:
This approach becomes particularly important when working with sensitive or proprietary information.
(3) LangChain is mentioned as a framework that simplifies development work by enabling focus on building new features for popular generative AI use cases like chatbots, IT process automation, and data management through streamlined model calls and prompt sequences.
Once an AI-powered application is deployed to production, the development work is complete and no further monitoring is needed.
False. With AI-powered applications in production, the work continues. Ongoing requirements include benchmarking performance, monitoring behavior, and handling different exceptions from the application. MLOps ensures continuous management of production models.
| Technology | Purpose |
|---|---|
| A. Kubernetes | 1. Production-ready runtime for model serving |
| B. Containers | 2. Managing resource allocation and balancing traffic |
| C. vLLM | 3. Auto-scaling based on demand |
| D. Hybrid infrastructure | 4. Combining on-premise and cloud resources for efficiency |
A-3, B-2, C-1, D-4.
(3) Chain of thought prompting asks the model to explain its thinking process step by step, making the reasoning transparent and verifiable. This is essential for compliance purposes where understanding the model’s decision-making process is required.
(4) is incorrect. The document states that self-hosting a large language model will generally be cheaper than a cloud-based service. Both self-hosted and cloud-based models should be considered, with cost being one of several evaluation factors.
Popular generative AI use cases mentioned include:
Frameworks like LangChain help simplify development for these use cases by streamlining model calls and enabling sequences of prompts to accomplish more complex tasks.
Breaking down problems into smaller, manageable steps enables:
This approach aligns with the principle of using frameworks to simplify complex AI workflows.
(3) The document states that recent innovations in AI have made this topic much more accessible for developers, and that it has become easier than ever to get started. Open source tools and technologies provide the foundation for efficient development.