Runpod: Our Experience Using On-Demand GPUs to Serve Open-Source AI Models

Authored by Santhosh Sivan

At Velatura Public Benefit Corporation, we are committed to building efficient, scalable, and cost-effective AI infrastructure. Our engineering team recently evaluated Runpod, a platform that provides on-demand GPU compute for AI/ML workloads.

Runpod’s model is straightforward — pay only when you use GPU compute — much like “Uber for GPUs.” This makes it ideal for organizations that want to access cutting-edge compute power without maintaining expensive hardware.

Problems It Solves

One of the biggest challenges in AI development is the high cost and complexity of accessing GPUs like NVIDIA A100 and H100. Building an internal GPU cluster involves high capital expense, ongoing maintenance, and constant scaling efforts.

Runpod effectively addresses these challenges by offering:

A consumption-based model that scales seamlessly with project demand.

Simplified infrastructure setup, removing configuration overhead.

Flexibility to run both experimental and production workloads efficiently.

Use Case We Tried

Our team leveraged Runpod to deploy a less-common open-source model that had previously suffered from high latency and cold-start issues.

While we initially explored Amazon Bedrock, limitations like the lack of streaming API support made it less than ideal. Runpod allowed us to customize the deployment environment to meet our performance needs.

Why Runpod Helped

Runpod’s support for vLLM and native Hugging Face integration enabled our engineers to quickly spin up a serverless API for model serving.

We made use of Pods, which helped minimize cold starts and lower costs. Our team also developed a lightweight automation script to dynamically start and stop Pods, ensuring we only paid for compute during active use.

Where Runpod Could Improve

While Runpod is a strong solution, there are opportunities to further optimize the enterprise experience:

Support & Onboarding: A more guided onboarding journey would benefit teams scaling into production.

GPU Availability: The community pool can experience queuing delays during peak times.

Intelligent Recommendations: Auto-suggesting GPU types and model-specific optimizations would further streamline deployment.

Overall Conclusion

With guidance from Velatura’s Chief AI Officer, our evaluation concluded that Runpod is an excellent choice for serving specialized open-source models, offering a great balance of cost efficiency, scalability, and developer control.

Though there is room for improvement in availability and customer experience, Runpod has become a valuable part of Velatura’s AI infrastructure strategy.

What’s Next

At Velatura, we have also built an internal framework that monitors and orchestrates these GPU Pods — giving us real-time visibility, cost tracking, and automated scaling.

We are preparing to open-source this framework so other teams can benefit from what we’ve learned. Stay tuned for the upcoming repository release — we can’t wait to share it with the community!