Bolstering enterprise LLMs with machine learning operations foundations
Generative AI, particularly large language models (LLMs), will play a crucial role in the future of customer and employee experiences, software development, and more. Building a solid foundation in machine learning operations (MLOps) will be critical for companies to effectively deploy and scale LLMs, and generative AI capabilities broadly. In this uncharted territory, improper management can lead to complexities organizations may not be equipped to handle.
Back to basics for emerging AI
To develop and scale enterprise-grade LLMs, companies should demonstrate five core characteristics of a successful MLOps program, starting with deploying ML models consistently. Standardized, consistent processes and controls monitor production models for drift, and data and feature quality. Companies should be able to replicate and retrain ML models with confidence: through quality assurance and governance processes to deployment, without much manual work or rewriting. Lastly, they should ensure their ML infrastructure is resilient (ensuring multiregional availability and failure recovery), consistently scanned for cyber vulnerabilities, and well managed.
Once these components are in place, more complex LLM challenges will require nuanced approaches and considerations—from infrastructure to capabilities, risk mitigation, and talent.
Deploying LLMs as a backend
Inferencing with traditional ML models typically involves packaging a model object as a container and deploying it on an inferencing server. As the demands on the model increase—more requests and more customers require more run-time decisions (higher QPS within a latency bound)—all it takes to scale the model is to add more containers and servers. In most enterprise settings, CPUs work fine for traditional model inferencing. But hosting LLMs is a much more complex process which requires additional considerations.
LLMs are comprised of tokens—the basic units of a word that the model uses to generate human-like language. They generally make predictions on a token-by-token basis in an autoregressive manner, based on previously generated tokens until a stop word is reached. The process can become cumbersome quickly: tokenizations vary based on the model, task, language, and computational resources. Engineers deploying LLMs need not only infrastructure experience, such as deploying containers in the cloud, they also need to know the latest techniques to keep the inferencing cost manageable and meet performance SLAs.
Vector databases as knowledge repositories
Deploying LLMs in an enterprise context means vector databases and other knowledge bases must be established, and they work together in real time with document repositories and language models to produce reasonable, contextually relevant, and accurate outputs. For example, a retailer may use an LLM to power a conversation with a customer over a messaging interface. The model needs access to a database with real-time business data to call up accurate, up-to-date information about recent interactions, the product catalog, conversation history, company policies regarding return policy, recent promotions and ads in the market, customer service guidelines, and FAQs. These knowledge repositories are increasingly developed as vector databases for fast retrieval against queries via vector search and indexing algorithms.
Training and fine-tuning with hardware accelerators
LLMs have an additional challenge: fine-tuning for optimal performance against specific enterprise tasks. Large enterprise language models could have billions of parameters. This requires more sophisticated approaches than traditional ML models, including a persistent compute cluster with high-speed network interfaces and hardware accelerators such as GPUs (see below) for training and fine-tuning. Once trained, these large models also need multi-GPU nodes for inferencing with memory optimizations and distributed computing enabled.
To meet computational demands, organizations will need to make more extensive investments in specialized GPU clusters or other hardware accelerators. These programmable hardware devices can be customized to accelerate specific computations such as matrix-vector operations. Public cloud infrastructure is an important enabler for these clusters.
A new approach to governance and guardrails
Risk mitigation is paramount throughout the entire lifecycle of the model. Observability, logging, and tracing are core components of MLOps processes, which help monitor models for accuracy, performance, data quality, and drift after their release. This is critical for LLMs too, but there are additional infrastructure layers to consider.
LLMs can “hallucinate,” where they occasionally output false knowledge. Organizations need proper guardrails—controls that enforce a specific format or policy—to ensure LLMs in production return acceptable responses. Traditional ML models rely on quantitative, statistical approaches to apply root cause analyses to model inaccuracy and drift in production. With LLMs, this is more subjective: it may involve running a qualitative scoring of the LLM’s outputs, then running it against an API with pre-set guardrails to ensure an acceptable answer.
Governance of enterprise LLMs will be both an art and science, and many organizations are still understanding how to codify them into actionable risk thresholds. With new advances emerging rapidly, it’s wise to experiment with both open source and commercial solutions that can be tailored for specific use cases and governance requirements. This requires a very flexible ML platform, especially the control plane with high levels of abstraction as a foundation. This allows the platform team to add or subtract capabilities, and keep pace with the broader ecosystem, without impacting its users and applications. Capital One views the importance of building out a scaled, well-managed platform control plane with high levels of abstraction and multitenancy as critical to address these requirements.
Recruiting and retaining specialized talent
Depending on how much context the LLM is trained on and the tokens it generates, performance can vary significantly. Training or fine-tuning very large models and serving them in production at scale poses significant scientific and engineering challenges. This will require companies to recruit and retain a wide array of AI experts, engineers, and researchers.
For example, deploying LLMs and vector databases for a service agent assistant to tens of thousands of employees across a company means bringing together engineers experienced in a wide variety of domains such as low-latency/high throughput serving, distributed computing, GPUs, guardrails, and well-managed APIs. LLMs also need to deploy on well-tailored prompts to provide accurate answers, which requires sophisticated prompt engineering expertise.
A deep bench of AI research experts is required to stay abreast of the latest developments, build and fine-tune models, and contribute research to the AI community. This virtuous cycle of open contribution and adoption is key to a successful AI strategy. Long-term success for any AI program will involve a diverse set of talent and experience combining data science, research, design, product, risk, legal, and engineering experts that keep the human user at the center.
Balancing opportunity with safeguards
While it is still early days for enterprise LLMs and new technical capabilities evolve on a daily basis, one of the keys to success is a solid foundational ML and AI infrastructure.
AI will continue accelerating rapidly, particularly in the LLM space. These advances promise to transform in ways that haven’t been possible before. As with any emerging technology, the potential benefits must be balanced with well-managed operational practices and risk management. A targeted MLOps strategy that considers the entire spectrum of models can offer a comprehensive approach to accelerating broader AI capabilities.
This content was produced by Capital One. It was not written by MIT Technology Review’s editorial staff.