
Batch Inference is Scaling, But It Doesn't Have to Be Scary
Inference at scale is evolving fast. GPU workloads are growing, costs are rising, and developer experience matters more than ever. Comparing self-hosted options shows trade-offs between performance and usability. Kubetorch brings the best of Kubernetes with fast iteration, strong observability, and zero added complexity, enabling efficient large-scale batch inference.

ML @ 🏃♀️Runhouse🏠
In this post, we cover three adjacent topics about inference. First, we share a few higher-level observations about trends in non-LLM inference. Then, we share a comparison between Kubetorch and a few of the systems we have heard most commonly used in inference. Finally, we share a few measured results of throughput on Kubetorch + GPUs + Knative or Ray and compare them to show a) how any approach destroys third-party OCR hosting costs and b) how using a Kubernetes-native implementation is roughly as performant as Ray.
Batch Inference is Scaling
While LLMs have dominated headlines, they represent only a tiny sliver of use cases for inference. You probably already run large-scale batch inference workloads across traditional and generative AI use cases. Maybe it’s generating embeddings at scale, whether embeddings of multimodal data for search and AI agent memory or user, item, and content embeddings for recommendations. On the computer vision side of the world, some teams run millions of pages of OCR while others scan satellite images daily. Other teams might test billions of sample molecules for suitability for drug targets or billions of sub-nanosecond physics simulations.
Across all use cases, there are three notable trends affecting model inference:
- Model accuracy seems to magically improve with model and data scale, leading to larger models than before. This is obvious in the domain of LLMs, but has generally held true outside of language models as well. Netflix reports a nearly perfectly log-linear relationship between parameter count and accuracy. Deployment is no longer universally trivial.
- Even for smaller models, optimized GPU inference for large batch sizes leads to dramatically better cost efficiency. In one benchmark we ran, we saw a 6.25x throughput per dollar improvement when using L4 GPUs versus CPUs on a 250 million parameter model. Beyond just training, teams are increasingly experimenting with the GPU capacity they have for inference.
- Widely available open-source pre-trained models can be easily deployed by open-source inference libraries like vLLM, SGLang, Triton, etc., dramatically lowering the barrier to entry for first-party models. Doing some back-of-the-envelope math
The result is that inference is quietly extremely costly, since many teams run recurring inference at scale for millions to billions of inputs, frequently with GPU acceleration. To motivate the rest of this post with some real impact, we delivered 90% cheaper inference for a team previously spending more than $1MM in cloud OCR services a year, over ~4 million pages of complex text extraction with a single offline batch inference service.
Fast, Cheap, or Easy, Pick Two
Building scaled applications capable of robustly handling millions of documents or billions of data points is hard. Teams now need to solve bin packing, batch sizing, and autoscaling, especially if accelerating inference with expensive GPUs. Worse, Kubernetes has emerged as the de facto compute substrate for ML, and devs are exposed to Kubernetes and develop through deployment.
Comparing a few common ways to self-host scaled inference, it’s obvious that the deployment experience is not great:
- Cloud serverless like AWS Lambda, Cloud Run, work great until you run into rigidity and scale walls. These solutions are most suitable for small teams without a proper platform.
- Kubernetes is obviously good at serving at scale, but it imposes steep devX costs. The lack of ability to locally iterate and test applications means redeploying things to Kubernetes over and over, with 10-30 minutes to test any minor code change (and lots of YAML).
- Ray is a popular option since it offers performance and scale while retaining Pythonic APIs. However, the adoption curve of Ray is typically steep, introducing new platform, debugging, and management challenges.
- Databricks Spark or other managed services but impose steep markups on compute.
- Kubetorch was built to enable an extremely good developer experience on Kubernetes without limiting you. Use Python to deploy horizontally scaled resources at any scale, whether your program is regular Python classes and functions or a Ray program.
| Category | AWS Lambda (or “Serverless”) | Ray / KubeRay | Databricks Spark | Kubernetes with Inference Server | Kubernetes with Kubetorch |
|---|---|---|---|---|---|
| Developer Experience | Good, but limited; simple to write regular Python for simple tasks, poor SDLC, and dependency management | Mixed; push-button scale makes things easy, but there is an initial learning curve for Ray APIs that is steep, and logs can be incredibly opaque. | Good; the process of spinning up distributed compute is abstracted from developers simply dispatch work in. | Poor; YAML hell to wrap everything, and every dev loop takes 30 minutes to rebuild the image and redeploy to Kubernetes. | Good, simple Pythonic APIs to define and launch compute. No YAML. Iteration dev loops of a warm service are <2 seconds. |
| Debuggability | Good, but limited; funnels you into CloudWatch, which lacks the configurability of other solutions. | Poor; framework layering makes it hard to debug, workloads cannot be easily observed using standard platform tooling, and logging is not transparent. | Mixed; JVM stack traces and lazy computation introduce challenges, even as Databricks has added custom tooling to improve debugging. | Mixed; depends heavily on the existence of observability already baked into the software platform. | Great; logs stream back to local when during development, pods are accessible to SSH for direct development, and it integrates with any existing Kubernetes logging. |
| GPU Support | No GPU support. Significant limitations to scale on CPU and memory as well. | Great; Ray was built with accelerated workloads in mind. | Mixed; RAPIDS enabled better GPU support, but still at a limited plugin level. | Great; GPU support on Kubernetes is rich | Great; GPU support by Kubetorch is rich |
| Scale | Good, until you hit your quota limit of concurrent executions (1K-10K) | Good with occasional brittleness at extreme scale | Good with occasional brittleness at extreme scale | Good; Kubernetes is battle-tested at scale | Good; Kubetorch launches regular Kubernetes services. |
| Efficiency | Low, cold start is paid on every single startup, and the compute is marked up. | Good; Ray is a distributed DSL designed for efficiently scheduling distributed jobs. | Bad; Databricks imposes a steep 100% markup on the underlying compute to run it for you. | Good; Kubernetes was built to saturate horizontally scaling services with many requests. | Good; Kubetorch does not add any framework layering or cluster-in-cluster structures. |
Whether you use Ray or regular deployments, Kubetorch is designed to be a zero-cost abstraction that gives you fast iteration and transparent debuggability without imposing any additional points of failure or disintermediating you from the compute.
Case Study: High Quality OCR Run Fast and Cheap
Context: For simple image to text, Google Cloud charges roughly $1.50 / 1000 pages; but to run complex tasks like table extraction, invoice parsing, it costs considerably more - from $10 to $30 per 1000 pages. Previously, the markup was worth it; the complexity of structured information was difficult to implement first-party without a ton of expertise. However, a slate of recent improvements to open source vision language models, including Baidu’s PaddleOCR-VL at 1B parameters and DeepSeek’s DeepSeek-OCR at 3B parameters, has made OCR incredibly accessible for most teams. The richness of VLMs enables extremely custom extraction simply by playing around with the prompt.
Approach: We deployed a DeepSeek OCR service with Kubetorch to L4 GPUs (~$0.8 / hour on-demand in GCP & AWS per replica). We used Kubetorch both to deploy a regular autoscaling Kubernetes service and a Ray cluster with the same resources available. We used L4 GPUs because they are widely available (almost never running into quota issues or even on-demand unavailability), and because for inference use cases, their performance per dollar tends to be as good as A100s and you actually run into less saturation issues with less VRAM and FLOPs per replica.
In part, as part of this exercise, we wanted to either prove or disprove the "Ray-FOMO" that many teams have; many people have asked whether Ray is worth adopting or not, and the answer is, "it depends."
- Why Ray: Ray Data has simplified APIs for batch data processing that wrap the process nicely. The same code may also work out-of-the-box across heterogeneous hardware and deployment sizes. Kubetorch is usable out of the box with Ray, so experimentation is easy.
- Why Not Ray: Avoiding Ray reduces framework layering; adding Ray in Kubernetes essentially creates a cluster-in-cluster structure that makes it more challenging to observe, autoscale, and schedule at a platform layer. It also adds debugging friction relative to a simpler implementation. Teams without existing Ray experience or looking for a more straightforward, regular, DSL-free implementation might avoid Ray.
Code Samples
- Deploying OCR as a horizontally scaling Kubernetes service: This example uses regular Python deployed to regular Kubernetes deployments, and achieves excellent performance without any framework layering or DSLs. The service we deploy here is already usable for online serving; the only advised change would be setting min_scale to 0 or 1 to enable scale-down behavior.
- Running OCR with Ray Data: We use Kubetorch to launch Ray clusters to enable the same fast iteration and hot syncing of local code changes that we have with regular code, but this time using the Ray APIs to run the inference.
- Running OCR with Ray Data and Ray Serve: Ray Serve is the Ray ecosystem’s library for optimized inference, and we use it here to increase the throughput, though we add another layer, potentially adding more tracing and debugging challenges in production.
Results: We saw approximately 0.22 pages / second per deployed GPU. These led to roughly comparable throughput, and the choice to use Ray is entirely up to your team’s familiarity with it.

In dollar terms: about $1 / 1000 pages of arbitrarily complex document extraction. This is a 10x improvement against the Google Cloud Document APIs for structured extraction and a 30x improvement against their invoice parser. The cloud offerings for Azure and AWS are priced identically.

Implementation Recommendation: Several teams we talked to expressed hesitation about taking on the maintenance burden of deploying and managing their own services. However, in practice, this is not a huge concern, and you can always start with the following logic: where you call out to a third-party API today, simply try calling your first part inference service, and upon exception/timeout, call out to the 3rd-party hosted service.
This way, you can access significant cost savings most of the time, without actually interrupting uptime or increasing debugging burden; your worst-case traversal is just the original code path.
Stay up to speed 🏃♀️📩
Subscribe to our newsletter to receive updates about upcoming Runhouse features and announcements.
Read More

Actors for Kubernetes

The Best Way to Use Kubeflow Trainer Is with Kubetorch
