Inferencing has emerged as among the most exciting aspects of generative AI large language models (LLMs).
A quick explainer: In AI inferencing, organizations take a LLM that is pretrained to recognize relationships in large datasets and generate new content based on input, such as text or images. Crunching mathematical calculations, the model then makes predictions based on what it has learned during training.
Inferencing crunches millions or even billions of data points, requiring a lot of computational horsepower. As with many data-hungry workloads, the instinct is to offload LLM applications into a public cloud, whose strengths include speedy time-to-market and scalability.
Yet the calculus may not be so simple when one considers the costs to operate there as well as the fact that GenAI systems sometimes produce outputs that even data engineers, data scientists, and other data-obsessed individuals struggle to understand.
Inferencing and… Sherlock Holmes???
Data-obsessed individuals such as Sherlock Holmes knew full well the importance of inferencing in making predictions, or in his case, solving mysteries.
Holmes, the detective populating the pages of Sir Arthur Conan Doyle’s 19th-century detective novels, knew well the importance of data for inferencing, as he said: “It is a capital mistake to theorize before one has data.” Without data, Holmes’ argument proceeds, one can twist facts to suit their theories, rather than use theories to suit facts.
Just as Holmes gathers clues, parses evidence, and presents deductions he believes are logical, inferencing uses data to make predictions that power critical applications, including chatbots, image recognition, and recommendation engines.
To understand how inferencing works in the real world, consider recommendation engines. As people frequent e-commerce or streaming platforms, the AI models track the interactions, “learning” what people prefer to purchase or watch. The engines use this information to recommend content based on users’ preference history.
An LLM is only as strong as its inferencing capabilities. Ultimately, it takes a combination of the trained model and new inputs working in near real-time to make decisions or predictions. Again—AI inferencing is like Holmes because it uses its data magnifying glass to detect patterns and insights—the clues—hidden in datasets.
As practiced at solving mysteries as Holmes was, he often relied on a faithful sleuthing sidekick, Dr. Watson. Similarly, organizations may benefit from help refining their inferencing outputs with context-specific information.
One such assistant—or Dr. Watson—comes in the form of retrieval-augmented generation (RAG), a technique for improving the accuracy of LLMs’ inferencing using corporate datasets, such as product specifications.
Inferencing funneled through RAG must be efficient, scalable, and optimized to make GenAI applications useful. This inferencing and RAG combination also helps curb inaccurate information, as well as biases and other inconsistencies that can prevent correct predictions. Just as Holmes and Dr. Watson piece together clues that may solve the mystery underlying the data they collected.
Cost-effective GenAI, on premises
Of course, here’s something that may not be mysterious for IT leaders: building, training, and augmenting AI stacks can consume large chunks of budget.
Because LLMs consume significant computational resources as model parameters expand, consideration of where to allocate GenAI workloads is paramount.
With the potential to incur high compute, storage, and data transfer fees running LLMs in a public cloud, the corporate datacenter has emerged as a sound option for controlling costs.
It turns out LLM inferencing with RAG running open-source models on-premises can be 38% to 75% more cost-effective as compared to the public cloud, according to new research from Enterprise Strategy Group commissioned by Dell Technologies. The percentage varies as the size of the model and the number of users grows.
Cost concerns aren’t the only reason to conduct inferencing on premises. IT leaders understand that controlling their sensitive IP is critical. Thus, the ability to run a model held closely in one’s datacenter is an attractive value proposition for organizations for whom bringing AI to their data is key.
AI factories power next-gen LLMs
Many GenAI systems require significant compute and storage, as well as chips and hardware accelerators primed to handle AI workloads.
Servers equipped with multiple GPUs to accommodate parallel processing techniques that support large-scale inferencing form the core of emerging AI factories, which includes end-to-end solutions tailored to handle organizations’ unique requirements for AI solutions.
Orchestrating the right balance of platforms and tools requires an ecosystem of trusted partners. Dell Technologies is working closely with NVIDIA, Meta, HuggingFace, and others to provide solutions, tools, and validated reference designs that span compute, storage, and networking gear, as well as client devices.
True, sometimes the conclusions GenAI models arrive at remain mysterious. But IT leaders shouldn’t have to pretend to be Sherlock Holmes to figure out how to run them cost-effectively while delivering the desired outcomes.
Learn more about Dell Generative AI.