Kubetorch Introduction

Kubetorch allows any ML developer to rapidly build scalable AI/ML applications on cloud compute in regular Python.

It is comprised of a Python library and a Kubernetes operator, which together provide a simple, powerful, and flexible way to build, deploy, and manage AI/ML applications. You can use Kubetorch incrementally within an existing ML stack and codebase, or as a complete replacement for training, batch processing, inference, HPO, and pipelining tools in systems like Kubeflow, SageMaker, Vertex, etc.

Key features include:

  • Fast iteration loops of 1-2 seconds, compared to 30-60 minutes for systems like Kubeflow, SageMaker, etc.
  • Pythonic APIs which abstract infrastructure complexity from ML Engineers, Data Scientists, and Researchers
  • General purpose, not tied to specific AI/ML frameworks, libraries, methods, or tools, or how you organize your code
  • Works anywhere you run Python (IDEs, notebook, CI, FastAPI app, orchestrator), no specialized environment required
  • Push-button deployment and scaling of ML applications on Kubernetes, no "translation for production" required
  • Installs on your own Kubernetes cluster, without restricting how you choose to manage or configure it
  • Powerful fault-tolerance, reliability, and observability tooling out of the box

Perhaps most importantly, Kubetorch eliminates human error and the need for manual process in the ML lifecycle by moving it into code, improving reproduciblity and allowing you to focus on building great AI/ML applications.