
Batch Inference is Scaling, But It Doesn't Have to Be Scary
Inference at scale is evolving fast. GPU workloads are growing, costs are rising, and developer experience matters more than ever. Comparing self-hosted options shows trade-offs between performance and usability. Kubetorch brings the best of Kubernetes with fast iteration, strong observability, and zero added complexity, enabling efficient large-scale batch inference.

ML @ 🏃♀️Runhouse🏠
In this post, we cover three adjacent topics about inference. First, we share a few higher-level observations about trends in non-LLM inference. Then, we share a comparison between Kubetorch and a few of the systems we have heard most commonly used in inference. Finally, we share a few measured results of throughput on Kubetorch + GPUs + Knative/Ray and compare them to show relatively little necessity to use Ray and, in general, destroying third-party OCR hosting costs across three different implementations.
Batch Inference is Scaling
While LLMs have dominated headlines, they represent only a tiny sliver of use cases for inference. You probably already run large-scale batch inference workloads across traditional and generative AI use cases. Maybe it’s generating embeddings at scale, whether embeddings of multimodal data for search and AI agent memory or user, item, and content embeddings for recommendations. On the computer vision side of the world, some teams run millions of pages of OCR while others scan satellite images daily. Other teams might test billions of sample molecules for suitability for drug targets or billions of sub-nanosecond physics simulations.
Across all use cases, there are three notable trends affecting model inference:
- Model accuracy seems to magically improve with model and data scale, leading to larger models than before. This is obvious in the domain of LLMs, but has generally held true outside of language models as well. Netflix reports a nearly perfectly log-linear relationship between parameter count and accuracy. Deployment is no longer universally trivial.
- Even for smaller models, optimized GPU inference for large batch sizes leads to dramatically better cost efficiency. In one benchmark we ran, we saw a 6.25x throughput per dollar improvement when using L4 GPUs versus CPUs on a 250 million parameter model. Beyond just training, teams are increasingly experimenting with the GPU capacity they have for inference.
- Widely available open-source pre-trained models can be easily deployed by open-source inference libraries like vLLM, SGLang, Triton, etc., dramatically lowering the barrier to entry for first-party models. Doing some back-of-the-envelope math
The result is that inference is quietly extremely costly, since many teams run recurring inference at scale for millions to billions of inputs, frequently with GPU acceleration. To motivate the rest of this post with some real impact, we delivered 90% cheaper inference for a team previously spending more than $1MM in cloud OCR services a year, over ~4 million pages of complex text extraction with a single offline batch inference service.
Fast, Cheap, or Easy, Pick Two
Building scaled applications capable of robustly handling millions of documents or billions of data points is hard. Teams now need to solve bin packing, batch sizing, and autoscaling, especially if accelerating inference with expensive GPUs. Worse, Kubernetes has emerged as the de facto compute substrate for ML, and devs are exposed to Kubernetes and develop through deployment.
Comparing a few common ways to self-host scaled inference, it’s obvious that the deployment experience is not great:
- Cloud serverless like AWS Lambda, Cloud Run, work great until you run into rigidity and scale walls. These solutions are most suitable for small teams without a proper platform.
- Kubernetes is obviously good at serving at scale, but it imposes steep devX costs. The lack of ability to locally iterate and test applications means redeploying things to Kubernetes over and over, with 10-30 minutes to test any minor code change (and lots of YAML).
- Ray is a popular option since it offers performance and scale while retaining Pythonic APIs. However, the adoption curve of Ray is typically steep, introducing new platform, debugging, and management challenges.
- Databricks Spark or other managed services but impose steep markups on compute.
- Kubetorch was built to enable an extremely good developer experience on Kubernetes without limiting you. Use Python to deploy horizontally scaled resources at any scale, whether your program is regular Python classes and functions or a Ray program.
| Category | AWS Lambda (or “Serverless”) | Ray / KubeRay | Databricks Spark | Kubernetes with Inference Server | Kubernetes with Kubetorch |
|---|---|---|---|---|---|
| Developer Experience | Good, but limited; simple to write regular Python for simple tasks, poor SDLC, and dependency management | Mixed; push-button scale makes things easy, but there is an initial learning curve for Ray APIs that is steep, and logs can be incredibly opaque. | Good; the process of spinning up distributed compute is abstracted from developers simply dispatch work in. | Poor; YAML hell to wrap everything, and every dev loop takes 30 minutes to rebuild the image and redeploy to Kubernetes. | Good, simple Pythonic APIs to define and launch compute. No YAML. Iteration dev loops of a warm service are <2 seconds. |
| Debuggability | Good, but limited; funnels you into CloudWatch, which lacks the configurability of other solutions. | Poor; framework layering makes it hard to debug, workloads cannot be easily observed using standard platform tooling, and logging is not transparent. | Mixed; JVM stack traces and lazy computation introduce challenges, even as Databricks has added custom tooling to improve debugging. | Mixed; depends heavily on the existence of observability already baked into the software platform. | Great; logs stream back to local when during development, pods are accessible to SSH for direct development, and it integrates with any existing Kubernetes logging. |
| GPU Support | No GPU support. Significant limitations to scale on CPU and memory as well. | Great; Ray was built with accelerated workloads in mind. | Mixed; RAPIDS enabled better GPU support, but still at a limited plugin level. | Great; GPU support on Kubernetes is rich | Great; GPU support by Kubetorch is rich |
| Scale | Good, until you hit your quota limit of concurrent executions (1K-10K) | Good with occasional brittleness at extreme scale | Good with occasional brittleness at extreme scale | Good; Kubernetes is battle-tested at scale | Good; Kubetorch launches regular Kubernetes services. |
| Efficiency | Low, cold start is paid on every single startup, and the compute is marked up. | Good; Ray is a distributed DSL designed for efficiently scheduling distributed jobs. | Bad; Databricks imposes a steep 100% markup on the underlying compute to run it for you. | Good; Kubernetes was built to saturate horizontally scaling services with many requests. | Good; Kubetorch does not add any framework layering or cluster-in-cluster structures. |
Whether you use Ray or regular deployments, Kubetorch is designed to be a zero-cost abstraction that gives you fast iteration and transparent debuggability without imposing any additional points of failure or disintermediating you from the compute.
Case Study: High Quality OCR Run Fast and Cheap
Context: For simple image to text, Google Cloud charges roughly $1.50 / 1000 pages; but to run complex tasks like table extraction, invoice parsing, it costs considerably more - from $10 to $30 per 1000 pages. Previously, the markup was worth it; the complexity of structured information was difficult to implement first-party without a ton of expertise. However, a slate of recent improvements to open source vision language models, including Baidu’s PaddleOCR-VL at 1B parameters and DeepSeek’s DeepSeek-OCR at 3B parameters, has made OCR incredibly accessible for most teams. The richness of VLMs enables extremely custom extraction simply by playing around with the prompt.
Approach: We deployed a DeepSeek OCR service with Kubetorch to L4 GPUs (~$0.8 / hour on demand per replica). We tried using Kubetorch both to deploy a regular autoscaling Kubernetes service and a Ray cluster with the same resources available. Avoiding Ray helps reduce framework layering and keeps the deployments Kubernetes-native and significantly more manageable with a regular software platform. Using Ray simplifies the code and may work faster with less tuning across heterogeneous hardware and deployment sizes.
Code Samples
- Deploying OCR as a horizontally scaling Kubernetes service: This example uses regular Python deployed to regular Kubernetes deployments, and achieves excellent performance without any framework layering or DSLs. The service we deploy here is already usable for online serving; the only advised change would be setting min_scale to 0 or 1 to enable scale-down behavior.
- Running OCR with Ray Data: We use Kubetorch to launch Ray clusters to enable the same fast iteration and hot syncing of local code changes that we have with regular code, but this time using the Ray APIs to run the inference.
- Running OCR with Ray Data and Ray Serve: Ray Serve is the Ray ecosystem’s library for optimized inference, and we use it here to increase the throughput, though we add another layer, potentially adding more tracing and debugging challenges in production.
Results: We saw approximately 0.22 pages / second of throughput per deployed GPU. These led to roughly comparable throughput, and the choice to use Ray is entirely up to your team’s familiarity with it. Kubetorch is usable out of the box with Ray, so experimentation is easy.

In dollar terms: about $1 / 1000 pages of arbitrarily complex document extraction. This is a 10x improvement against the Google Cloud Document APIs for structured extraction and a 30x improvement against their invoice parser. The cloud offerings for Azure and AWS are priced identically.

Stay up to speed 🏃♀️📩
Subscribe to our newsletter to receive updates about upcoming Runhouse features and announcements.
Read More


