Cluster

A Cluster is a Runhouse primitive used for abstracting a particular hardware configuration. This can be either an on-demand cluster (requires valid cloud credentials or a local Kube config if launching on Kubernetes), or a BYO (bring-your-own) cluster (requires IP address and ssh creds).

A cluster is assigned a name, through which it can be accessed and reused later on.

Cluster Factory Methods

Builds an instance of Cluster.

If Cluster with same name is found in Den and load_from_den is True, load it down from Den
If arguments corresponding to ondemand clusters are provided, arguments are fed through to rh.ondemand_cluster factory function
If arguments are mismatched with loaded Cluster, return a new Cluster with the provided args

Parameters:

name (str) – Name for the cluster.
host (str or List[str], optional) – Hostname (e.g. domain or name in .ssh/config), IP address, or list of IP addresses for the cluster (the first of which is the head node). (Default: None).
ssh_creds (Dict or str, optional) – SSH credentials, passed as dictionary or the name of an SSHSecret object. Example: ssh_creds={'ssh_user': '...', 'ssh_private_key':'<path_to_key>'} (Default: None).
ssh_port (int, optional) – Port to use for ssh. If not provided, will default to 22.
client_port (int, optional) – Port to use for the client. If not provided, will default to the server port.
server_port (bool, optional) – Port to use for the server. If not provided will use 80 for a server_connection_type or none, 443 for tls and 32300 for all other SSH connection types.
server_host (bool, optional) – Host from which the server listens for traffic (i.e. the –host argument runhouse server start run on the cluster). Defaults to "0.0.0.0" unless connecting to the server with an SSH connection, in which case localhost is used. (Default: None).
server_connection_type (ServerConnectionType or str, optional) – Type of connection to use for the Runhouse API server. ssh will use start with server via an SSH tunnel. tls will start the server with HTTPS on port 443 using TLS certs without an SSH tunnel. none will start the server with HTTP without an SSH tunnel. (Default: None).
ssl_keyfile (str, optional) – Path to SSL key file to use for launching the API server with HTTPS. (Default: None).
ssl_certfile (str, optional) – Path to SSL certificate file to use for launching the API server with HTTPS. (Default: None).
domain (str, optional) – Domain name for the cluster. Relevant if enabling HTTPs on the cluster. (Default: None).
image (Image, optional) – Default image containing setup steps to run during cluster setup. See Image. (Default: None)
den_auth (bool, optional) – Whether to use Den authorization on the server. If True, will validate incoming requests with a Runhouse token provided in the auth headers of the request with the format: {"Authorization": "Bearer <token>"}. (Default: None).
load_from_den (bool) – Whether to try loading the Cluster resource from Den. (Default: True)
dryrun (bool) – Whether to create the Cluster if it doesn’t exist, or load a Cluster object as a dryrun. (Default: False)

Returns:

The resulting cluster.

Return type:

Union[Cluster, OnDemandCluster]

Example

>>> # using private key
>>> gpu = rh.cluster(host='<hostname>',
>>>                  ssh_creds={'ssh_user': '...', 'ssh_private_key':'<path_to_key>'},
>>>                  name='rh-a10x').save()

>>> # using password
>>> gpu = rh.cluster(host='<hostname>',
>>>                  ssh_creds={'ssh_user': '...', 'password':'*****'},
>>>                  name='rh-a10x').save()

>>> # using the name of an SSHSecret object
>>> gpu = rh.cluster(host='<hostname>',
>>>                  ssh_creds="my_ssh_secret",
>>>                  name='rh-a10x').save()

>>> # Load cluster from above
>>> reloaded_cluster = rh.cluster(name="rh-a10x")

Builds an instance of OnDemandCluster.

If Cluster with same name is found in Den and load_from_den is True, load it down from Den
If launch arguments are mismatched with loaded Cluster, return a new Cluster with the provided args. These args are passed through to SkyPilot’s Resource constructor: instance_type, num_nodes, provider, use_spot, region, memory, disk_size, num_cpus, gpus (accelerators), open_ports, autostop_mins, sky_kwargs.
If runhouse related arguments are mismatched with loaded Cluster, override those Cluster properties

Parameters:

name (str) – Name for the cluster, to re-use later on.
instance_type (int, optional) – Type of cloud VM type to use for the cluster, e.g. “r5d.xlarge”. Optional, as may instead choose to specify resource requirements (e.g. memory, disk_size, num_cpus, gpus).
num_nodes (int, optional) – Number of nodes to use for the cluster.
provider (str, optional) – Cloud provider to use for the cluster.
autostop_mins (int, optional) – Number of minutes to keep the cluster up after inactivity, or -1 to keep cluster up indefinitely. (Default: 60).
use_spot (bool, optional) – Whether or not to use spot instance. (Default: False)
region (str, optional) – The region to use for the cluster. (Default: None)
memory (int or str, optional) – Amount of memory to use for the cluster, e.g. 16 or “16+”. (Default: None)
disk_size (int, optional) – Amount of disk space to use for the cluster in GiB, e.g. 100. (Default: None)
num_cpus (int or str, optional) – Number of CPUs to use for the cluster, e.g. 4 or “4+”. (Default: None)
gpus (int or str, optional) – Type and number of GPU to use for the cluster e.g. “A101” or “L4:8”. (Default: None)
open_ports (int or str or List[int], optional) – Ports to open in the cluster’s security group. Note that you are responsible for ensuring that the applications listening on these ports are secure. (Default: None)
vpc_name (str, optional) – Specific VPC used for launching the cluster. If not specified, cluster will be launched in the default VPC.
sky_kwargs (dict, optional) – Additional keyword arguments to pass to the SkyPilot Resource or launch APIs. Should be a dict of the form {“resources”: {<resources_kwargs>}, “launch”: {<launch_kwargs>}}, where resources_kwargs and launch_kwargs will be passed to the SkyPilot Resources API (See SkyPilot docs) and launch API (See SkyPilot docs), respectively. Duplicating arguments passed to the ondemand_cluster factory method will raise an error. (Default: None)
kube_namespace (str, optional) – Namespace for kubernetes cluster, if applicable. (Default: None)
kube_config_path (str, optional) – Path to the kube_config, for a kubernetes cluster. (Default: None)
kube_context (str, optional) – Context for kubernetes cluster, if applicable. (Default: None)
server_port (bool, optional) – Port to use for the server. If not provided will use 80 for a server_connection_type of none, 443 for tls and 32300 for all other SSH connection types. (Default: None)
server_host (bool, optional) – Host from which the server listens for traffic (i.e. the –host argument runhouse server start run on the cluster). Defaults to “0.0.0.0” unless connecting to the server with an SSH connection, in which case localhost is used. (Default: None)
server_connection_type (ServerConnectionType or str, optional) – Type of connection to use for the Runhouse API server. ssh will use start with server via an SSH tunnel. tls will start the server with HTTPS on port 443 using TLS certs without an SSH tunnel. none will start the server with HTTP without an SSH tunnel. (Default: None)
launcher (LauncherType or str, optional) – Method for launching the cluster. If set to local, will launch locally via Sky. If set to den, launching will be handled by Runhouse. If not provided, will be set to your configured default launcher, which defaults to local. (Default: None)
ssl_keyfile (str, optional) – Path to SSL key file to use for launching the API server with HTTPS. (Default: None)
ssl_certfile (str, optional) – Path to SSL certificate file to use for launching the API server with HTTPS. (Default: None)
domain (str, optional) – Domain name for the cluster. Relevant if enabling HTTPs on the cluster. (Default: None)
image (Image, optional) – Default image containing setup steps to run during cluster setup. See Image. (Default: None)
den_auth (bool, optional) – Whether to use Den authorization on the server. If True, will validate incoming requests with a Runhouse token provided in the auth headers of the request with the format: {"Authorization": "Bearer <token>"}. (Default: None).
load_from_den (bool) – Whether to try loading the Cluster resource from Den. (Default: True)
dryrun (bool) – Whether to create the Cluster if it doesn’t exist, or load a Cluster object as a dryrun. (Default: False)

Returns:

The resulting cluster.

Return type:

OnDemandCluster

Example

>>> # On-Demand SkyPilot Cluster (OnDemandCluster)
>>> gpu = rh.ondemand_cluster(name='rh-4-a100s',
>>>                  instance_type='A100:4',
>>>                  provider='gcp',
>>>                  autostop_mins=-1,
>>>                  use_spot=True,
>>>                  region='us-east-1',
>>>                  ).save()

>>> # Load cluster from above
>>> reloaded_cluster = rh.ondemand_cluster(name="rh-4-a100s")

Cluster Class

class runhouse.Cluster(name: str | None = None, ips: List[str] = None, creds: Secret = None, server_host: str = None, server_port: int = None, ssh_port: int = None, client_port: int = None, server_connection_type: str = None, ssl_keyfile: str = None, ssl_certfile: str = None, domain: str = None, ssh_properties: Dict = None, den_auth: bool = False, dryrun: bool = False, image: Image | None = None, **kwargs)[source]

__init__(name: str | None = None, ips: List[str] = None, creds: Secret = None, server_host: str = None, server_port: int = None, ssh_port: int = None, client_port: int = None, server_connection_type: str = None, ssl_keyfile: str = None, ssl_certfile: str = None, domain: str = None, ssh_properties: Dict = None, den_auth: bool = False, dryrun: bool = False, image: Image | None = None, **kwargs)[source]

The Runhouse cluster, or system. This is where you can run Functions or access/transfer data between. You can BYO (bring-your-own) cluster by providing cluster IP and ssh_creds, or this can be an on-demand cluster that is spun up/down through SkyPilot, using your cloud credentials.

Note

To build a cluster, please use the factory method cluster().

call(module_name: str, method_name: str, *args, stream_logs: bool = True, run_name: str = None, remote: bool = False, run_async: bool = False, save: bool = False, **kwargs)[source]

Call a method on a module that is in the cluster’s object store.

Parameters:

module_name (str) – Name of the module saved on system.
method_name (str) – Name of the method.
stream_logs (bool, optional) – Whether to stream logs from the method call. (Default: True)
run_name (str, optional) – Name for the run. (Default: None)
remote (bool, optional) – Return a remote object from the function, rather than the result proper. (Default: False)
run_async (bool, optional) – Run the method asynchronously and return an awaitable. (Default: False)
save (bool, optional) – Whether or not to save the call. (Default: False)
*args – Positional arguments to pass to the method.
**kwargs – Keyword arguments to pass to the method.

Example

>>> cluster.call("my_module", "my_method", arg1, arg2, kwarg1=kwarg1)

clear()[source]: Clear the cluster’s object store.

connect_dask(port: int = 8786, scheduler_options: Dict = None, worker_options: Dict = None, client_timeout: str = '3s')[source]

Connect to Dask client.

Parameters:

port (int, optional) – Port to connect Dask. (Default: 8786)
scheduler_options (Dict, optional) – Dict of scheduler options. (Default: None)
worker_options (Dict, optional) – Dict of worker options. (Default: None)
client_timeout (str, optional) – Timeout, in string representation. (Default: 3s)

create_conda_env(conda_env_name: str, conda_config: Dict)[source]

Create a new Conda Env on the cluster.

Parameters:

conda_env_name (str) – Name of the conda env to create.
conda_config (Dict) – Dict representing conda config yaml, used to construct the conda environment. Name in conda config must match conda_env_name.

create_process(name: str, env_vars: Dict | None = None, compute: Dict | None = None, runtime_env: Dict | None = None) → str[source]

Create a new process on the cluster with the given arguments. If the process already exists on the cluster with the same arguments, it is a no-op. If the process exists but was constructed with different arguments, will throw an error.

Parameters:

name (str) – Name to give the process.
env_vars (Dict, optional) – Dict of env vars to set on the process.
compute (Dict, optional) – Mapping of node index or compute resources (e.g. {"GPU": 1}, {"node_idx": 1})
runtime_env (Dict, optional) – Runtime env to be used for the process.

delete(keys: None | str | List[str])[source]

Delete the given items from the cluster’s object store. To delete all items, use cluster.clear()

Parameters:: keys (str or List[str]) – key or list of keys to delete from the object store.

delete_configs(delete_creds: bool = False)[source]: Delete configs for the cluster

disable_den_auth()[source]: Disable Den auth on the cluster.

disconnect()[source]

Disconnect the RPC tunnel.

Example

>>> cluster.disconnect()

download_cert()[source]: Download certificate from the cluster (Note: user must have access to the cluster)

enable_den_auth(flush: bool = True)[source]

Enable Den auth on the cluster.

Parameters:: flush (bool, optional) – Whether to flush the auth cache. (Default: True)

endpoint(external: bool = False)[source]

Endpoint for the cluster’s Daemon server.

Parameters:: external (bool, optional) – If True, will only return the external url, and will return None otherwise (e.g. if a tunnel is required). If set to False, will either return the external url if it exists, or will set up the connection (based on connection_type) and return the internal url (including the local connected port rather than the sever port). If cluster is not up, returns None`. (Default: False)

ensure_process_created(name: str, env_vars: Dict | None = None, compute: Dict | None = None, runtime_env: Dict | None = None) → str[source]

Retrieve the process with the given name on the cluster if it already exists, or create a new process on the cluster with the given arguments.

Parameters:

name (str) – Name to give the process.
env_vars (Dict, optional) – Dict of env vars to set on the process.
compute (Dict, optional) – Mapping of node index or compute resources (e.g. {"GPU": 1}, {"node_idx": 1})
runtime_env (Dict, optional) – Runtime env to be used for the process.

classmethod from_config(config: Dict, dryrun: bool = False, _resolve_children: bool = True)[source]

Load or construct resource from config.

Parameters:

config (Dict) – Resource config.
dryrun (bool, optional) – Whether to construct resource or load as dryrun (Default: False)

classmethod from_name(name: str, load_from_den: bool = True, dryrun: bool = False, _resolve_children: bool = True)[source]

Load existing Resource via its name.

Parameters:

name (str) – Name of the resource to load from name.
load_from_den (bool, optional) – Whether to try loading the module from Den. (Default: True)
dryrun (bool, optional) – Whether to construct the object or load as dryrun. (Default: False)

get(key: str, default: Any = None, remote=False)[source]

Get the result for a given key from the cluster’s object store.

Parameters:

key (str) – Key to get from the cluster’s object store.
default (Any, optional) – What to return if the key is not found. To raise an error, pass in KeyError. (Default: None)
remote (bool, optional) – Whether to get the remote object, rather than the object in full. (Default: False)

property head_ip: Head IP

install_packages(reqs: List[Package | str], node: str | None = None, conda_env_name: str | None = None, force_sync_local: bool = False)[source]

Install the given packages on the cluster.

Parameters:

reqs (List[Package or str]) – List of packages to install on cluster.
node (str, optional) – Cluster node to install the package on. If specified, will use ssh to install the package. (Default: None)
conda_env_name (str, optional) – Name of conda env to install the package in, if relevant. If left empty, defaults to base environment. (Default: None)
force_sync_local (bool, optional) – If the package exists both locally and remotely, whether to override the remote version with the local version. By default, the local version will be installed only if the package does not already exist on the cluster. (Default: False)

Example

>>> cluster.install_packages(reqs=["accelerate", "diffusers"])
>>> cluster.install_packages(reqs=["accelerate", "diffusers"], conda_env_name="my_conda_env", force_sync_local=True)

property internal_ips: Internal cluster IPs

property ips: Cluster IPs

is_connected()[source]

Whether the RPC tunnel is up.

Example

>>> connected = cluster.is_connected()

is_up() → bool[source]

Check if the cluster is up.

Example

>>> rh.cluster("rh-cpu").is_up()

keys(process: str = None)[source]

List all keys in the cluster’s object store.

Parameters:: process (str, optional) – Process in which to list out the keys for.

kill_dask()[source]: Kill Dask client connection.

kill_process(process: str)[source]

Kill a process on the cluster.

Parameters:: process (str) – Process to kill.

Example

>>> cluster.create_process("my_process")
>>> cluster.kill("my_process")

classmethod list(show_all: bool = False, since: str | None = None, status: str | ClusterStatus | None = None, force: bool = False) → Dict[str, List[Dict]][source]

Loads Runhouse clusters saved in Den and locally via Sky. If filters are provided, only clusters that are matching the filters are returned. If no filters are provided, all running clusters will be returned.

Parameters:

show_all (bool, optional) – Whether to list all clusters saved in Den. Maximum of 200 will be listed. (Default: False).
since (str, optional) – Clusters that were active in the specified time period will be returned. Value can be in seconds, minutes, hours or days.
status (str or ClusterStatus, optional) – Clusters with the provided status will be returned. Options include: running, terminated, initializing, unknown.
force (bool, optional) – Whether to force a status update for all relevant clusters, or load the latest values. (Default: False).

Examples

>>> Cluster.list(since="75s")
>>> Cluster.list(since="3m")
>>> Cluster.list(since="2h", status="running")
>>> Cluster.list(since="7d")
>>> Cluster.list(show_all=True)

list_processes()[source]: List all workers on the cluster.

notebook(persist: bool = False, sync_package_on_close: str | None = None, port_forward: int = 8888)[source]

Tunnel into and launch notebook from the cluster.

Example

>>> rh.cluster("test-cluster").notebook()

on_this_cluster()[source]: Whether this function is being called on the same cluster.

pause_autostop()[source]: Context manager to temporarily pause autostop. Only for OnDemand clusters. There is no autostop for static clusters.

put(key: str, obj: Any, process: str = None)[source]

Put the given object on the cluster’s object store at the given key.

Parameters:

key (str) – Key to assign the object in the object store.
obj (Any) – Object to put in the object store
process (str, optional) – Process of the object store to put the object in. (Default: None)

put_resource(resource: Resource, state: Dict = None, dryrun: bool = False, process: str | None = None)[source]

Put the given resource on the cluster’s object store. Returns the key (important if name is not set).

Parameters:

resource (Resource) – Key to assign the object in the object store.
state (Dict, optional) – Dict of resource attributes to override. (Default: False)
dryrun (bool, optional) – Whether to put the resource in dryrun mode or not. (Default: False)
process (str, optional) – Process of the object store to put the object in. (Default: None)

remove_conda_env(conda_env_name: str)[source]

Remove conda env from the cluster.

Parameters:: conda_env_name (str) – Name of conda env to remove from the cluster.

Example

>>> rh.ondemand_cluster("rh-cpu").remove_conda_env("my_conda_env")

rename(old_key: str, new_key: str)[source]

Rename a key in the cluster’s object store.

Parameters:

old_key (str) – Original key to rename.
new_key (str) – Name to reassign the object.

restart_server(_rh_install_url: str = None, resync_rh: bool | None = None, restart_ray: bool = True, restart_proxy: bool = False)[source]

Restart the RPC server.

Parameters:

resync_rh (bool) – Whether to Resync runhouse. If False will not resync Runhouse onto the cluster. If not specified, will sync if Runhouse is not installed on the cluster or if locally it is installed as editable. (Default: None)
restart_ray (bool) – Whether to restart Ray. (Default: True)
restart_proxy (bool) – Whether to restart Caddy on the cluster, if configured. (Default: False)

Example

>>> rh.cluster("rh-cpu").restart_server()

rsync(source: str, dest: str, up: bool = True, node: str = None, src_node: str = None, contents: bool = False, filter_options: str = None, stream_logs: bool = False, ignore_existing: bool = False, parallel: bool = False)[source]

Sync the contents of the source directory into the destination.

Parameters:

source (str) – The source path.
dest (str) – The target path.
up (bool) – The direction of the sync. If True, will rsync from local to cluster. If False will rsync from cluster to local.
node (str, optional) – Specific cluster node to rsync to. If not specified will use the address of the cluster’s head node.
src_node (str, optional) – Specific cluster node to rsync from, for node-to-node rsyncs.
contents (bool, optional) – Whether the contents of the source directory or the directory itself should be copied to destination. If True the contents of the source directory are copied to the destination, and the source directory itself is not created at the destination. If False the source directory along with its contents are copied ot the destination, creating an additional directory layer at the destination. (Default: False).
filter_options (str, optional) – The filter options for rsync.
stream_logs (bool, optional) – Whether to stream logs to the stdout/stderr. (Default: False).
ignore_existing (bool, optional) – Whether the rsync should skip updating files that already exist on the destination. (Default: False).

Note

Ending source with a slash will copy the contents of the directory into dest, while omitting it will copy the directory itself (adding a directory layer).

run_bash(commands: str | List[str], node: int | str | None = None, process: str | None = None, stream_logs: bool = True, require_outputs: bool = True)[source]

Run bash commands on the cluster through the Runhouse server. These are run via subprocess.run in the relevant Python process or node on the cluster. If neither process or node are specified, run on the head node.

Parameters:

commands (str or List[str]) – Commands to run on the cluster.
node (int, str or None) – Node to run the command on. Node can an int referring to the node index, string referring to the ips, or “all” to run on all nodes. (Default: None)
process (str or None) – Process to run the command on. (Default: None)
stream_logs (bool) – Whether to stream logs. (Default: True)
require_outputs (bool) – Whether to return stdout/stderr in addition to status code. (Default: True)

run_bash_over_ssh(commands: str | List[str], node: int | str | None = None, stream_logs: bool = True, require_outputs: bool = True, _ssh_mode: str = 'interactive', conda_env_name: str | None = None)[source]

Run bash commands on the cluster over SSH. Will not work directly on the cluster, works strictly over ssh.

Parameters:

commands (str or List[str]) – Commands to run on the cluster.
node (int, str or None) – Node to run the command on. Node can an int referring to the node index, string referring to the ips, or “all” to run on all nodes. If not specified, run the command on the head node. (Default: None)
stream_logs (bool) – Whether to stream logs. (Default: True)
require_outputs (bool) – Whether to return stdout/stderr in addition to status code. (Default: True)
conda_env_name (str or None) – Name of conda env to run the command in, if applicable. (Defaut: None)

run_python(commands: List[str], conda_env_name: str | None = None, stream_logs: bool = True, node: str = None)[source]

Run a list of python commands on the cluster, or a specific cluster node if its IP is provided.

Parameters:

commands (List[str]) – List of commands to run.
process (str, optional) – Process to run the commands in. (Default: None)
stream_logs (bool, optional) – Whether to stream logs. (Default: True)
node (str, optional) – Node to run commands on. If not specified, runs on head node. (Default: None)

Example

>>> cpu.run_python(['import numpy', 'print(numpy.__version__)'])
>>> cpu.run_python(["print('hello')"])
>>> cpu.run_python(["print('hello')"], node="3.89.174.234")

Note

Running Python commands with nested quotes can be finicky. If using nested quotes, try to wrap the outer quote with double quotes (”) and the inner quotes with a single quote (‘).

save(name: str = None, overwrite: bool = True, folder: str = None)[source]

Overrides the default resource save() method in order to also update the cluster config on the cluster itself.

Parameters:

name (str, optional) – Name to save the cluster as, if different from its existing name. (Default: None)
overwrite (bool, optional) – Whether to overwrite the existing saved resource, if it exists. (Default: True)
folder (str, optional) – Folder to save the config in, if saving locally. If None and saving locally, will be saved in the ~/.rh directory. (Default: None)

property server_address: Address to use in the requests made to the cluster. If creating an SSH tunnel with the cluster, ths will be set to localhost, otherwise will use the cluster’s domain (if provided), or its public IP address.

set_process_env_vars(name: str, env_vars: Dict)[source]

Set the env vars for a process on the cluster.

Parameters:

name (str) – Name of the process
env_vars (Dict) – Env vars and values to set on the process.

share(users: str | List[str] = None, access_level: ResourceAccess | str = ResourceAccess.READ, visibility: ResourceVisibility | str | None = None, notify_users: bool = True, headers: Dict | None = None) → Tuple[Dict[str, ResourceAccess], Dict[str, ResourceAccess]][source]

Grant access to the cluster for a single user or list of users. By default, the user(s) will receive an email notification of access (if they have a Runhouse account) or instructions on creating an account to access the cluster. If visibility is set to public, users will not be notified.

Parameters:

users (Union[str, list], optional) – Single user or list of user emails and / or Runhouse account usernames. If none are provided and visibility is set to public, cluster will be made publicly available to all users. (Default: None)
access_level (ResourceAccess, optional) – Access level to provide for the cluster. Note that for clusters only read access is currently supported.
visibility (ResourceVisibility, optional) – Type of visibility to provide for the shared resource. By default, the visibility is private. (Default: None)
notify_users (bool, optional) – Whether to send an email notification to users who have been given access. (Default: True)
headers (Dict, optional) – Request headers to provide for the request to Den. Contains the user’s auth token. Example: {"Authorization": f"Bearer {token}"}

Returns:

added_users:: Users who already have a Runhouse account and have been granted access to the cluster.
new_users:: Users who do not have Runhouse accounts and received notifications via their emails.
valid_users:: Set of valid usernames and emails from users parameter.

Return type:

Tuple(Dict, Dict, Set)

Example

>>> # Visibility will be set to private (users can search for and view resource in Den dashboard)
>>> cluster.share(users=["username1", "user2@gmail.com"])

ssh()[source]

SSH into the cluster

Example

>>> rh.cluster("rh-cpu").ssh()

start_server(_rh_install_url: str = None, resync_rh: bool | None = None, restart_ray: bool = True, restart_proxy: bool = False)[source]

Restart the RPC server.

Parameters:

resync_rh (bool) – Whether to Resync runhouse. If False will not resync Runhouse onto the cluster. If not specified, will sync if Runhouse is not installed on the cluster or if locally it is installed as editable. (Default: None)
restart_ray (bool) – Whether to restart Ray. (Default: True)
restart_proxy (bool) – Whether to restart Caddy on the cluster, if configured. (Default: False)

Example

>>> rh.cluster("rh-cpu").start_server()

status(send_to_den: bool = False)[source]

Load the status of the Runhouse daemon running on a cluster.

Parameters:: send_to_den (bool, optional) – Whether to send and update the status in Den. Only applies to clusters that are saved to Den. (Default: False)

stop_server(stop_ray: bool = False, cleanup_actors: bool = True, conda_env_name: str | None = None)[source]

Stop the RPC server.

Parameters:

stop_ray (bool, optional) – Whether to stop Ray. (Default: True)
process (str, optional) – Specified process to stop the server on. (Default: None)
cleanup_actors (bool, optional) – Whether to kill all Ray actors. (Default: True)

sync_secrets(providers: List[str] | None = None, process: str = None)[source]

Send secrets for the given providers.

Parameters:

providers (List[str] or None, optional) – List of providers to send secrets for. If None, all providers configured in the environment will by sent. (Default: None)
process (str, optional) – Process to sync secrets into, if setting env vars. (Default: None)

Example

>>> cpu.sync_secrets(secrets=["aws", "lambda"])

up_if_not(verbose: bool = True)[source]

Bring up the cluster if it is not up. No-op if cluster is already up. This only applies to on-demand clusters, and has no effect on self-managed clusters.

Parameters:: verbose (bool, optional) – Whether to stream logs from Den if the cluster is being launched. Only relevant if launching via Den. (Default: True)

Example

>>> rh.cluster("rh-cpu").up_if_not()

Cluster Hardware Setup

No additional setup is required. You will just need to have the IP address for the cluster and the path to SSH credentials ready to be used for the cluster initialization.

OnDemandCluster Class

A OnDemandCluster is a cluster that uses SkyPilot functionality underneath to handle various cluster properties.

class runhouse.OnDemandCluster(name, instance_type: str = None, num_nodes: int = None, provider: str = None, dryrun: bool = False, autostop_mins: int = None, use_spot: bool = False, memory: int | str = None, disk_size: int = None, num_cpus: int | str = None, gpus: str = None, open_ports: int | str | List[int] = None, server_host: int = None, server_port: int = None, server_connection_type: str = None, launcher: str = None, ssl_keyfile: str = None, ssl_certfile: str = None, domain: str = None, den_auth: bool = False, region: str = None, vpc_name: str = None, sky_kwargs: Dict = None, **kwargs)[source]

__init__(name, instance_type: str = None, num_nodes: int = None, provider: str = None, dryrun: bool = False, autostop_mins: int = None, use_spot: bool = False, memory: int | str = None, disk_size: int = None, num_cpus: int | str = None, gpus: str = None, open_ports: int | str | List[int] = None, server_host: int = None, server_port: int = None, server_connection_type: str = None, launcher: str = None, ssl_keyfile: str = None, ssl_certfile: str = None, domain: str = None, den_auth: bool = False, region: str = None, vpc_name: str = None, sky_kwargs: Dict = None, **kwargs)[source]

On-demand SkyPilot Cluster.

Note

To build a cluster, please use the factory method cluster().

async a_up(capture_output: bool | str = True)[source]

Up the cluster async in another process, so it can be parallelized and logs can be captured sanely.

Parameters:

capture_output (bool) – If True, supress the output of the cluster creation process. If False,
string (print the output normally. If a) –
path. (write the output to the file at that) –

static cluster_ssh_key(path_to_file: Path)[source]

Retrieve SSH key for the cluster.

Parameters:: path_to_file (Path) – Path of the private key associated with the cluster.

Example

>>> ssh_priv_key = rh.ondemand_cluster("rh-cpu").cluster_ssh_key("~/.ssh/id_rsa")

endpoint(external: bool = False)[source]

Endpoint for the cluster’s Daemon server.

Parameters:: external (bool, optional) – If True, will only return the external url, and will return None otherwise (e.g. if a tunnel is required). If set to False, will either return the external url if it exists, or will set up the connection (based on connection_type) and return the internal url (including the local connected port rather than the sever port). If cluster is not up, returns None`. (Default: False)

get_instance_type()[source]: Returns instance type of the cluster.

property internal_ips: Internal cluster IPs

property ips: Cluster IPs

is_up() → bool[source]

Whether the cluster is up.

Example

>>> rh.ondemand_cluster("rh-cpu").is_up()

keep_warm(mins: int = -1)[source]

Keep the cluster warm for given number of minutes after inactivity.

Parameters:: mins (int) – Amount of time (in min) to keep the cluster warm after inactivity. If set to -1, keep cluster warm indefinitely. (Default: -1)

num_cpus()[source]: Return the number of CPUs for a CPU cluster.

pause_autostop()[source]

Context manager to temporarily pause autostop.

Example

>>> with rh.ondemand_cluster.pause_autostop():
>>>     rh.ondemand_cluster.run_bash(["python train.py"])

ssh(node: str = None)[source]

SSH into the cluster.

Parameters:: node – Node to SSH into. If no node is specified, will SSH onto the head node. (Default: None)

Example

>>> rh.ondemand_cluster("rh-cpu").ssh()
>>> rh.ondemand_cluster("rh-cpu", node="3.89.174.234").ssh()

teardown(verbose: bool = True)[source]

Teardown cluster.

Parameters:: verbose (bool, optional) – Whether to stream logs from Den when the cluster is being downed. Only relevant when tearing down via Den. (Default: True)

Example

>>> rh.ondemand_cluster("rh-cpu").teardown()

teardown_and_delete(verbose: bool = True)[source]

Teardown cluster and delete it from configs.

Parameters:: verbose (bool, optional) – Whether to stream logs from Den when the cluster is being downed. Only relevant when tearing down via Den. (Default: True)

Example

>>> rh.ondemand_cluster("rh-cpu").teardown_and_delete()

up(verbose: bool = True, force: bool = False, start_server: bool = True)[source]

Up the cluster.

Parameters:

verbose (bool, optional) – Whether to stream logs from Den when the cluster is being launched. Only relevant if launching via Den. (Default: True)
force (bool, optional) – Whether to launch the cluster even if one with the same configs already exists. Only relevant if launching via Den. (Default: False)

Example

>>> rh.ondemand_cluster("rh-cpu").up()

OnDemandCluster Hardware Setup

On-Demand clusters use SkyPilot to automatically spin up and down clusters on the cloud. You will need to first set up cloud access on your local machine:

Run sky check to see which cloud providers are enabled, and how to set up cloud credentials for each of the providers.

$ sky check

For a more in depth tutorial on setting up individual cloud credentials, you can refer to SkyPilot setup docs.

Specifying a VPC

If you would like to launch an on-demand cluster within a specific VPC, you can specify its name in your local ~/.sky/config.yaml in the following format:

<cloud-provider>:
  vpc: <vpc-name>

See the SkyPilot docs for more details on configuring a VPC.

Cluster Authentication & Verification

Runhouse provides a couple of options to manage the connection to the Runhouse API server running on a cluster.

Server Connection

The below options can be specified with the server_connection_type parameter when initializing a cluster. By default the Runhouse API server will be started on the cluster on port 32300.

ssh: Connects to the cluster via an SSH tunnel, by default on port 32300.
tls: Connects to the cluster via HTTPS (by default on port 443) using either a provided certificate, or creating a new self-signed certificate just for this cluster. You must open the needed ports in the firewall, such as via the open_ports argument in the OnDemandCluster, or manually in the compute itself or cloud console.
none: Does not use any port forwarding or enforce any authentication. Connects to the cluster with HTTP by default on port 80. This is useful when connecting to a cluster within a VPC, or creating a tunnel manually on the side with custom settings.

Note

The tls connection type is the most secure and is recommended for production use if you are not running inside of a VPC. However, be mindful that you must secure the cluster with authentication (see below) if you open it to the public internet.

Server Authentication

If desired, Runhouse provides out-of-the-box authentication via users’ Runhouse token (generated when logging in) and set locally at: ~/.rh/config.yaml). This is crucial if the cluster has ports open to the public internet, as would usually be the case when using the tls connection type. You may also set up your own authentication manually inside of your own code, but you should likely still enable Runhouse authentication to ensure that even your non-user-facing endpoints into the server are secured.

When initializing a cluster, you can set the den_auth parameter to True to enable token authentication. Calls to the cluster server can then be made using an auth header with the format: {"Authorization": "Bearer <cluster-token>"}. The Runhouse Python library adds this header to its calls automatically, so your users do not need to worry about it after logging into Runhouse.

Note

Runhouse never uses your default Runhouse token for anything other than requests made to Runhouse Den. Your token will never be exposed or shared with anyone else.

TLS Certificates

Enabling TLS and Runhouse Den Dashboard Auth for the API server makes it incredibly fast and easy to stand up a microservice with standard token authentication, allowing you to easily share Runhouse resources with collaborators, teams, customers, etc.

Let’s illustrate this with a simple example:

import runhouse as rh

def concat(a: str, b: str):
    return a + b

# Launch a cluster with TLS and Den Auth enabled
cpu = rh.ondemand_cluster(instance_type="m5.xlarge",
                          provider="aws",
                          name="rh-cluster",
                          den_auth=True,
                          open_ports=[443],
                          server_connection_type="tls").up_if_not()

# Remote function stub which lives on the cluster
remote_func = rh.function(concat).to(cpu)

# Save to Runhouse Den
remote_func.save()

# Give read access to the function to another user - this will allow them to call this service remotely
# and view the function metadata in Runhouse Den
remote_func.share("user1@gmail.com", access_level="read")

# This other user (user1) can then call the function remotely from any python environment
res = remote_func("run", "house")
>> print(res)
>> "runhouse"

We can also call the function via an HTTP request, making it easy for other users to call the function with a Runhouse cluster token (Note: this assumes the user has been granted access to the function or write access to the cluster):

$ curl -X GET "https://<DOMAIN>/concat/call?a=run&b=house"
-H "Content-Type: application/json" -H "Authorization: Bearer <CLUSTER-TOKEN>"

Caddy

Runhouse gives you the option of using Caddy as a reverse proxy for the Runhouse API server, which is a FastAPI app launched with Uvicorn. Using Caddy provides you with a safer and more conventional approach running the FastAPI app on a higher, non-privileged port (such as 32300, the default Runhouse port) and then use Caddy as a reverse proxy to forward requests from the HTTP port (default: 80) or the HTTPS port (default: 443).

Caddy also enables generating and auto-renewing self-signed certificates, making it easy to secure your cluster with HTTPS right out of the box.

Note

Caddy is enabled by default when you launch a cluster with the server_port set to either 80 or 443.

Generating Certs

Runhouse offers two options for enabling TLS/SSL on a cluster with Caddy:

Using existing certs: provide the path to the cert and key files with the ssl_certfile and ssl_keyfile arguments. These certs will be used by Caddy as specified in the Caddyfile on the cluster. If no cert paths are provided and no domain is specified, Runhouse will issue self-signed certificates to use for the cluster. These certs will not be verified by a CA.
Using Caddy to generate CA verified certs: Provide the domain argument. Caddy will then obtain certificates from Let’s Encrypt on-demand when a client connects for the first time.

Using a Custom Domain

Runhouse supports using custom domains for deploying your apps and services. You can provide the domain ahead of time before launching the cluster by specifying the domain argument:

cluster = rh.cluster(name="rh-serving-cpu",
                     domain="<your domain>",
                     instance_type="m5.xlarge",
                     server_connection_type="tls",
                     open_ports=[443]).up_if_not()

Note

After the cluster is launched, make sure to add the relevant A record to your domain’s DNS settings to point this domain to the cluster’s public IP address.

You’ll need to also ensure the relevant ports are open (ex: 443) in the security group settings of the cluster. Runhouse will also automatically set up a TLS certificate for the domain via Caddy.

If you have an existing cluster, you can also configure a domain by including the IP and domain when initializing the Runhouse cluster object:

cluster = rh.cluster(name="rh-serving-cpu",
                     ips=["<public IP>"],
                     domain="<your domain>",
                     server_connection_type="tls",
                     open_ports=[443]).up_if_not()

Now we can send modules or functions to our cluster and seamlessly create endpoints which we can then share and call from anywhere.

Let’s take a look at an example of how to deploy a simple LangChain RAG app.

Once the app has been created and sent to the cluster, we can call it via HTTP directly:

import requests

resp = requests.get("https://<domain>/basic_rag_app/invoke?user_prompt=<prompt>")
print(resp.json())

Or via cURL:

$ curl "https://<domain>/basic_rag_app/invoke?user_prompt=<prompt>"

Previous
Function

Next
Image