Business analytics and intelligence is the next AI application area most likely to make a business case, and the one that leads most enterprises to believe that they need to self-host AI in the first place. IBM accounts tend to rely on IBM’s watsonx strategy here, and of all enterprises show the most confidence in their approach to selecting a model. Meta’s Llama is now the favored strategy for other enterprises, surpassing BLOOM and Falcon models. But the shift was fairly recent, so Llama is still a bit behind in deployment though ahead in planning.
Business users of chatbots in customer-facing missions, those in the healthcare vertical, and even many planning AI in business analytics are increasingly interested in small language models (SLM) as opposed to LLMs. SLMs are smaller in terms of number of rules, and they’re trained for a specific mission on specialized data, even your own data. This training scope radically reduces the risk of hallucinations and generates more useful results in specialized areas. Some SLMs are essentially LLMs adapted to special missions, so the best way to find one is to search for an LLM for the mission you’re looking to support. If you have a vendor you trust in AI strategy, talking with them about mission-specific SLMs is a wise step. Enterprises who have used specialized SLMs (14 overall) agree that the SLM was a smart move, and one that can save you a lot of money in hosting.
GPUs and Ethernet networks
How about hosting? Enterprises tend to think of Nvidia GPUs, but they actually buy servers with GPUs included – so companies like Dell, HPE, and Supermicro may dictate GPU policy for enterprises. The number of GPUs enterprises commit to hosting has varied from about 50 to almost 600, but two-thirds of enterprises with less than 100 GPUs have reported adding them during early testing, and some with over 500 say they now believe they have too many. Most enterprise self-hosting planners expect to deploy between 200 and 400, and only two enterprises said they thought they’d use more than 450.
The fact that enterprises are unlikely to try to install GPUs on boards in computers, and most aren’t in favor of buying GPU boards for standard servers, links in part to their realization that you can’t put a Corvette engine into a stock 1958 Edsel and expect to win many races. Good GPUs need fast memory, a fast bus architecture, and fast I/O and network adapters.
Ah, networks. The old controversy over whether to use Ethernet or Infiniband has been settled for the enterprises either using or planning for self-hosted AI. They agree that Ethernet is the answer, and they also agree it should be as fast as possible. 800G Ethernet with both Priority Flow Control and Explicit Congestion Notification is recommended by enterprises, and it is even offered as a white-box device. Enterprises agree that AI shouldn’t be mixed with standard servers, so think of AI deployment as a new cluster with its own fast cluster network. It’s also important to have a fast connection to the data center for access to company data, either for training or prompts, and to the VPN for user access.
If you expect to have multiple AI applications, you may need more than one AI cluster. It’s possible to load an SLM or LLM onto a cluster as needed, but more complicated to have multiple models running at the same time in the same cluster while protecting the data. Some enterprises had thought they might pick one LLM tool, train it for customer support, financial analysis, and other applications, and then use it for them all in parallel. The problem, they report, is the difficulty in keeping the responses isolated. Do you want your support chatbot to answer questions about your financial strategy? If not, it’s probably not smart to mix missions within a model.