Infrastructure optimizations in AI

Managing GPU requirements from Training to Inference

VERONICA NIGRO

Jul 23, 2024

The generative AI landscape is experiencing a seismic shift. As the focus moves from model training to model inference, new challenges arise, including GPU scarcity and the long-term sustainability of cloud-centric approaches. Concerns encompass cost, power consumption, latency, and the security and privacy of processed data. Training AI models is a resource-heavy process that requires vast computational power and significant time investments. However, once these models are trained, the focus shifts to inference; deploying these models in real-world applications where they generate outputs based on new data. Unlike the relatively consistent demands of training, inference workloads are highly variable, with spikes in demand driven by real-time user interactions.

Training and inference have different computational requirements

The Challenge for GenAI Companies

For GenAI companies, the situation is particularly challenging. When an app or new features go viral, the sudden influx of users can overwhelm existing infrastructure, leading to slow response times, increased costs, and frustrated users. Startups need flexible, scalable solutions that can handle these demand spikes without breaking the bank. Currently, due to GPU scarcity, companies are forced into 12-month+ fixed contracts with hyperscalers like AWS and GCP. Such contracts, though expensive, may be viable for training purposes given the constant requirements but are far from suitable for inference. Companies are left without enough GPUs during demand spikes and with idle GPUs during demand dips, resulting in lost sales and wasted money.

Hyperscalers force GenAI companies into 12+ month fixed contracts

New startups have emerged alongside cloud providers to provide model inference. These startups are looking to differentiate themselves via developer experience, product design, and more flexible terms, however with premiums added to cover their own costs from long-term contracts with hyperscalers. The GPU scarcity problem expands further, because even if there were sufficient and affordable GPUs on demand for inference, inefficiency persists at another layer. Individual users may make more computationally intensive queries than others, even when inferring the same model (e.g. asking ChatGPT for a list of all Roman emperors vs. generating an image). Hence, also the single use of a GPU is very “spiky.” Over the course of a minute, real-world utilization of a single GPU might look like this:

Individual GPUs operate on average at <15% utilisation

So, while an organization might know the maximum performance of their GPUs, and that 100% of them are in use, each individual GPU (on average) is operating at below 15% due to spiky utilization patterns. Companies like Run AI, recently acquired by Nvidia for $700 million, are addressing such dynamics by building a solution that enables sharing capacity across multiple users working on the same node. However, implementing such sharing across multiple companies and hundreds of distributed nodes is a totally different game. Ideally, though, “true sharing” will soon be a possibility, allowing many clients to access GPU chunks remotely and simultaneously, flexing up to 100% utilization when their workloads spike but leaving the full GPU capacity available to other clients when needed.

Simultaneous GPU sharing remains a challenge

Directly optimizing an ML model is essential for high-performance inference, but the infrastructure used to serve that optimized model can have just as much impact (if not more) on how the model performs when delivered to the end user.

Tips for accessing more GPUs for Inference

To help companies find more GPUs for inference on flexible and affordable terms, consider these tips: - Multi-region: Run workloads across multiple regions within a given cloud provider. AWS, for example, tends to have more available on-demand GPUs in their data centers in Virginia and Ohio compared to Dublin and Frankfurt. A multi-region approach can also help decrease latency by having GPUs closer to users. - Multi-cloud, widen your selection: Use multiple GPU cloud providers to prevent vendor lock-in and increase your chances of finding available resources when needed. - Select the right GPUs: Test and choose GPUs that match your specific workload needs. H100 and A100 are extremely performant, but opting for a “less powerful card” while inferencing, like an L40S or an A6000, might only result in a <10% loss in performance while enabling you to cut costs by 60% or more. - Form tight partnerships with “smaller” providers: Partner with GPU providers for benefits that hyperscalers can't offer. Besides offering shorter commitments, they can let you know in advance when they are receiving new stock or if other clients with existing contracts might not renew their commitments. - Use free credits strategically: Diversify usage by switching to another cloud when credits run out or become too limiting due to quotas or cost inefficiencies. - Adopt a distributed solution like mkinf: Leverage platforms that abstract away infrastructure decisions, taking care of running globally distributed infrastructure across cloud providers, independent data centers, and geographic regions.

mkinf's Solution

At mkinf, we are committed to providing innovative solutions that empower startups and companies to navigate the challenges of AI compute. By aggregating buffer capacity from independent data centers and providers worldwide (now counting 33 and growing), we offer a scalable, cost-effective, and sustainable approach to AI compute. Our multi-cloud and multi-region strategy ensures high GPU availability, reduced latency, and significant cost savings. For startups and companies facing the pressures of viral model deployment, mkinf offers a lifeline via its API (now in beta) and platform (soon-to-be-released), enabling them to meet user demands without compromising performance, while reducing compute costs by about 70%. Additionally, we also rent capacity from companies with long-term contracts during idle periods, optimizing resource utilization and providing an additional revenue stream. Join us in redefining the future of AI inference and discover how our platform can help you scale efficiently in an increasingly competitive landscape. Let's innovate and optimize AI inference together, making the most of every GPU. For more information or to discuss how our solutions can benefit your organization, please reach out to us at info@mkinf.io.

Don’t miss out on what’s coming and follow mkinf on X or join our slack community