Cluster

A Cluster is a Runhouse primitive used for abstracting a particular hardware configuration. This can be either an on-demand cluster (requires valid cloud credentials), a BYO (bring-your-own) cluster (requires IP address and ssh creds), or a SageMaker cluster (requires an ARN role).

A cluster is assigned a name, through which it can be accessed and reused later on.

Cluster Factory Methods

runhouse.cluster(name: str, host: str | List[str] | None = None, ssh_creds: dict | None = None, server_port: int | None = None, server_host: str | None = None, server_connection_type: ServerConnectionType | str | None = None, ssl_keyfile: str | None = None, ssl_certfile: str | None = None, domain: str | None = None, den_auth: bool = False, dryrun: bool = False, **kwargs) Cluster | OnDemandCluster | SageMakerCluster[source]

Builds an instance of Cluster.

Parameters:
  • name (str) – Name for the cluster, to re-use later on.

  • host (str or List[str], optional) – Hostname (e.g. domain or name in .ssh/config), IP address, or list of IP addresses for the cluster (the first of which is the head node).

  • ssh_creds (dict, optional) – Dictionary mapping SSH credentials. Example: ssh_creds={'ssh_user': '...', 'ssh_private_key':'<path_to_key>'}

  • server_port (bool, optional) – Port to use for the server. If not provided will use 80 for a server_connection_type of none, 443 for tls and 32300 for all other SSH connection types.

  • server_host (bool, optional) – Host from which the server listens for traffic (i.e. the –host argument runhouse start run on the cluster). Defaults to “0.0.0.0” unless connecting to the server with an SSH connection, in which case localhost is used.

  • server_connection_type (ServerConnectionType or str, optional) – Type of connection to use for the Runhouse API server. ssh will use start with server via an SSH tunnel. tls will start the server with HTTPS on port 443 using TLS certs without an SSH tunnel. none will start the server with HTTP without an SSH tunnel. aws_ssm will start the server with HTTP using AWS SSM port forwarding. ``paramiko``will use paramiko to create an SSH tunnel to the cluster.

  • ssl_keyfile (str, optional) – Path to SSL key file to use for launching the API server with HTTPS.

  • ssl_certfile (str, optional) – Path to SSL certificate file to use for launching the API server with HTTPS.

  • domain (str, optional) – Domain name for the cluster. Relevant if enabling HTTPs on the cluster.

  • den_auth (bool, optional) – Whether to use Den authorization on the server. If True, will validate incoming requests with a Runhouse token provided in the auth headers of the request with the format: {"Authorization": "Bearer <token>"}. (Default: False).

  • dryrun (bool) – Whether to create the Cluster if it doesn’t exist, or load a Cluster object as a dryrun. (Default: False)

Returns:

The resulting cluster.

Return type:

Union[Cluster, OnDemandCluster, SageMakerCluster]

Example

runhouse.ondemand_cluster(name: str, instance_type: str | None = None, num_instances: int | None = None, provider: str | None = None, autostop_mins: int | None = None, use_spot: bool = False, image_id: str | None = None, region: str | None = None, memory: int | str | None = None, disk_size: int | str | None = None, open_ports: int | str | List[int] | None = None, server_port: int | None = None, server_host: int | None = None, server_connection_type: ServerConnectionType | str | None = None, ssl_keyfile: str | None = None, ssl_certfile: str | None = None, domain: str | None = None, den_auth: bool | None = None, dryrun: bool = False, **kwargs) OnDemandCluster[source]

Builds an instance of OnDemandCluster. Note that image_id, region, memory, disk_size, and open_ports are all passed through to SkyPilot’s Resource constructor.

Parameters:
  • name (str) – Name for the cluster, to re-use later on.

  • instance_type (int, optional) – Type of cloud instance to use for the cluster. This could be a Runhouse built-in type, or your choice of instance type.

  • num_instances (int, optional) – Number of instances to use for the cluster.

  • provider (str, optional) – Cloud provider to use for the cluster.

  • autostop_mins (int, optional) – Number of minutes to keep the cluster up after inactivity, or -1 to keep cluster up indefinitely.

  • use_spot (bool, optional) – Whether or not to use spot instance.

  • image_id (str, optional) – Custom image ID for the cluster.

  • region (str, optional) – The region to use for the cluster.

  • memory (int or str, optional) – Amount of memory to use for the cluster, e.g. “16” or “16+”.

  • disk_size (int or str, optional) – Amount of disk space to use for the cluster, e.g. “100” or “100+”.

  • open_ports (int or str or List[int], optional) – Ports to open in the cluster’s security group. Note that you are responsible for ensuring that the applications listening on these ports are secure.

  • server_port (bool, optional) – Port to use for the server. If not provided will use 80 for a server_connection_type of none, 443 for tls and 32300 for all other SSH connection types.

  • server_host (bool, optional) – Host from which the server listens for traffic (i.e. the –host argument runhouse start run on the cluster). Defaults to “0.0.0.0” unless connecting to the server with an SSH connection, in which case localhost is used.

  • server_connection_type (ServerConnectionType or str, optional) – Type of connection to use for the Runhouse API server. ssh will use start with server via an SSH tunnel. tls will start the server with HTTPS on port 443 using TLS certs without an SSH tunnel. none will start the server with HTTP without an SSH tunnel. aws_ssm will start the server with HTTP using AWS SSM port forwarding. ``paramiko``will use paramiko to create an SSH tunnel to the cluster.

  • ssl_keyfile (str, optional) – Path to SSL key file to use for launching the API server with HTTPS.

  • ssl_certfile (str, optional) – Path to SSL certificate file to use for launching the API server with HTTPS.

  • domain (str, optional) – Domain name for the cluster. Relevant if enabling HTTPs on the cluster.

  • den_auth (bool, optional) – Whether to use Den authorization on the server. If True, will validate incoming requests with a Runhouse token provided in the auth headers of the request with the format: {"Authorization": "Bearer <token>"}. (Default: False).

  • dryrun (bool) – Whether to create the Cluster if it doesn’t exist, or load a Cluster object as a dryrun. (Default: False)

Returns:

The resulting cluster.

Return type:

OnDemandCluster

Example

runhouse.sagemaker_cluster(name: str, role: str = None, profile: str = None, ssh_key_path: str = None, instance_id: str = None, instance_type: str = None, num_instances: int = None, image_uri: str = None, autostop_mins: int = None, connection_wait_time: int = None, estimator: sagemaker.estimator.EstimatorBase | Dict = None, job_name: str = None, server_port: int = None, server_host: int = None, server_connection_type: ServerConnectionType | str = None, ssl_keyfile: str = None, ssl_certfile: str = None, domain: str = None, den_auth: bool = False, dryrun: bool = False, **kwargs) SageMakerCluster[source]

Builds an instance of SageMakerCluster. See SageMaker Hardware Setup section for more specific instructions and requirements for providing the role and setting up the cluster.

Parameters:
  • name (str) – Name for the cluster, to re-use later on.

  • role (str, optional) – An AWS IAM role (either name or full ARN). Can be passed in explicitly as an argument or provided via an estimator. If not specified will try using the profile attribute or environment variable AWS_PROFILE to extract the relevant role ARN. More info on configuring an IAM role for SageMaker here.

  • profile (str, optional) – AWS profile to use for the cluster. If provided instead of a role, will lookup the role ARN associated with the profile in the local AWS credentials. If not provided, will use the default profile.

  • ssh_key_path (str, optional) – Path (relative or absolute) to private SSH key to use for connecting to the cluster. If not provided, will look for the key in path ~/.ssh/sagemaker-ssh-gw. If not found will generate new keys and upload the public key to the default s3 bucket for the Role ARN.

  • instance_id (str, optional) – ID of the AWS instance to use for the cluster. SageMaker does not expose IP addresses of its instance, so we use an instance ID as a unique identifier for the cluster.

  • instance_type (str, optional) –

    Type of AWS instance to use for the cluster. More info on supported instance options here. (Default: ml.m5.large.)

  • num_instances (int, optional) – Number of instances to use for the cluster. (Default: 1.)

  • image_uri (str, optional) – Image to use for the cluster instead of using the default SageMaker image which will be based on the framework_version and py_version. Can be an ECR url or dockerhub image and tag.

  • estimator (Union[str, sagemaker.estimator.EstimatorBase], optional) –

    Estimator to use for a dedicated training job. Leave as None if launching the compute without running a dedicated job. More info on creating an estimator here.

  • autostop_mins (int, optional) – Number of minutes to keep the cluster up after inactivity, or -1 to keep cluster up indefinitely. Note: this will keep the cluster up even if a dedicated training job has finished running or failed.

  • connection_wait_time (int, optional) – Amount of time to wait inside the SageMaker cluster before continuing with normal execution. Useful if you want to connect before a dedicated job starts (e.g. training). If you don’t want to wait, set it to 0. If no estimator is provided, will default to 0.

  • job_name (str, optional) – Name to provide for a training job. If not provided will generate a default name based on the image name and current timestamp (e.g. pytorch-training-2023-08-28-20-57-55-113).

  • server_port (bool, optional) – Port to use for the server (Default: 32300).

  • server_host (bool, optional) – Host from which the server listens for traffic (i.e. the –host argument runhouse start run on the cluster). Note: For SageMaker, since we connect to the Runhouse API server via an SSH tunnel, the only valid host is localhost.

  • server_connection_type (ServerConnectionType or str, optional) – Type of connection to use for the Runhouse API server. Note: For SageMaker, only ``aws_ssm`` is currently valid as the server connection type.

  • ssl_keyfile (str, optional) – Path to SSL key file to use for launching the API server with HTTPS.

  • ssl_certfile (str, optional) – Path to SSL certificate file to use for launching the API server with HTTPS.

  • domain (str, optional) – Domain name for the cluster. Relevant if enabling HTTPs on the cluster.

  • den_auth (bool, optional) – Whether to use Den authorization on the server. If True, will validate incoming requests with a Runhouse token provided in the auth headers of the request with the format: {"Authorization": "Bearer <token>"}. (Default: False).

  • dryrun (bool) – Whether to create the SageMakerCluster if it doesn’t exist, or load a SageMakerCluster object as a dryrun. (Default: False)

Returns:

The resulting cluster.

Return type:

SageMakerCluster

Example

Cluster Class

class runhouse.Cluster(name: str | None = None, ips: List[str] | None = None, ssh_creds: Dict | None = None, server_host: str | None = None, server_port: int | None = None, ssh_port: int | None = None, client_port: int | None = None, server_connection_type: str | None = None, ssl_keyfile: str | None = None, ssl_certfile: str | None = None, domain: str | None = None, den_auth: bool = False, use_local_telemetry: bool = False, dryrun=False, **kwargs)[source]
__init__(name: str | None = None, ips: List[str] | None = None, ssh_creds: Dict | None = None, server_host: str | None = None, server_port: int | None = None, ssh_port: int | None = None, client_port: int | None = None, server_connection_type: str | None = None, ssl_keyfile: str | None = None, ssl_certfile: str | None = None, domain: str | None = None, den_auth: bool = False, use_local_telemetry: bool = False, dryrun=False, **kwargs)[source]

The Runhouse cluster, or system. This is where you can run Functions or access/transfer data between. You can BYO (bring-your-own) cluster by providing cluster IP and ssh_creds, or this can be an on-demand cluster that is spun up/down through SkyPilot, using your cloud credentials.

Note

To build a cluster, please use the factory method cluster().

add_secrets(provider_secrets: List[str], env: str | Env = None)[source]

Copy secrets from current environment onto the cluster

call(module_name, method_name, *args, stream_logs=True, run_name=None, remote=False, run_async=False, save=False, **kwargs)[source]

Call a method on a module that is in the cluster’s object store.

Parameters:
  • module_name (str) – Name of the module saved on system.

  • method_name (str) – Name of the method.

  • stream_logs (bool) – Whether to stream logs from the method call.

  • run_name (str) – Name for the run.

  • remote (bool) – Return a remote object from the function, rather than the result proper.

  • run_async (bool) – Run the method asynchronously and return an awaitable.

  • *args – Positional arguments to pass to the method.

  • **kwargs – Keyword arguments to pass to the method.

Example

clear()[source]

Clear the cluster’s object store.

delete(keys: None | str | List[str])[source]

Delete the given items from the cluster’s object store. To delete all items, use cluster.clear()

disconnect()[source]

Disconnect the RPC tunnel.

Example

download_cert()[source]

Download certificate from the cluster (Note: user must have access to the cluster)

enable_den_auth(flush=True)[source]

Enable Den auth on the cluster.

endpoint(external=False)[source]

Endpoint for the cluster’s Daemon server. If external is True, will only return the external url, and will return None otherwise (e.g. if a tunnel is required). If external is False, will either return the external url if it exists, or will set up the connection (based on connection_type) and return the internal url (including the local connected port rather than the sever port). If cluster is not up, returns None.

get(key: str, default: Any | None = None, remote=False)[source]

Get the result for a given key from the cluster’s object store. To raise an error if the key is not found, use cluster.get(key, default=KeyError).

install_packages(reqs: List[Package | str], env: Env | str = None)[source]

Install the given packages on the cluster.

Parameters:
  • reqs (List[Package or str) – List of packages to install on cluster and env

  • env (Env or str) – Environment to install package on. If left empty, defaults to base environment. (Default: None)

Example

is_connected()[source]

Whether the RPC tunnel is up.

Example

is_up() bool[source]

Check if the cluster is up.

Example

keys(env=None)[source]

List all keys in the cluster’s object store.

notebook(persist: bool = False, sync_package_on_close: str | None = None, port_forward: int = 8888)[source]

Tunnel into and launch notebook from the cluster.

Example

on_this_cluster()[source]

Whether this function is being called on the same cluster.

pause_autostop()[source]

Context manager to temporarily pause autostop. Mainly for OnDemand clusters, for BYO cluster there is no autostop.

put(key: str, obj: Any, env=None)[source]

Put the given object on the cluster’s object store at the given key.

put_resource(resource: Resource, state: Dict | None = None, dryrun: bool = False, env=None)[source]

Put the given resource on the cluster’s object store. Returns the key (important if name is not set).

remove_conda_env(env: str | CondaEnv)[source]

Remove conda env from the cluster.

Example

rename(old_key: str, new_key: str)[source]

Rename a key in the cluster’s object store.

restart_server(_rh_install_url: str = None, resync_rh: bool = True, restart_ray: bool = False, env: str | Env = None, restart_proxy: bool = False)[source]

Restart the RPC server.

Parameters:
  • resync_rh (bool) – Whether to resync runhouse. (Default: True)

  • restart_ray (bool) – Whether to restart Ray. (Default: True)

  • env (str or Env, optional) – Specified environment to restart the server on. (Default: None)

  • restart_proxy (bool) – Whether to restart Caddy on the cluster, if configured. (Default: False)

Example

run(commands: List[str], env: Env | str = None, stream_logs: bool = True, port_forward: None | int | Tuple[int, int] = None, require_outputs: bool = True, node: str | None = None, run_name: str | None = None) list[source]

Run a list of shell commands on the cluster. If run_name is provided, the commands will be sent over to the cluster before being executed and a Run object will be created.

Example

run_python(commands: List[str], env: Env | str = None, stream_logs: bool = True, node: str = None, port_forward: int | None = None, run_name: str | None = None)[source]

Run a list of python commands on the cluster, or a specific cluster node if its IP is provided.

Example

Note

Running Python commands with nested quotes can be finicky. If using nested quotes, try to wrap the outer quote with double quotes (”) and the inner quotes with a single quote (‘).

save(name: str | None = None, overwrite: bool = True)[source]

Overrides the default resource save() method in order to also update the cluster config on the cluster itself.

property server_address

Address to use in the requests made to the cluster. If creating an SSH tunnel with the cluster, ths will be set to localhost, otherwise will use the cluster’s domain (if provided), or its public IP address.

ssh()[source]

SSH into the cluster

Example

property ssh_creds

Retrieve SSH credentials.

stop_server(stop_ray: bool = True, env: str | Env = None)[source]

Stop the RPC server.

Parameters:
  • stop_ray (bool) – Whether to stop Ray. (Default: True)

  • env (str or Env, optional) – Specified environment to stop the server on. (Default: None)

sync_secrets(providers: List[str] | None = None, env: str | Env = None)[source]

Send secrets for the given providers.

Parameters:

providers (List[str] or None) – List of providers to send secrets for. If None, all providers configured in the environment will by sent.

Example

up_if_not()[source]

Bring up the cluster if it is not up. No-op if cluster is already up. This only applies to on-demand clusters, and has no effect on self-managed clusters.

Example

Cluster Hardware Setup

No additional setup is required. You will just need to have the IP address for the cluster and the path to SSH credentials ready to be used for the cluster initialization.

OnDemandCluster Class

A OnDemandCluster is a cluster that uses SkyPilot functionality underneath to handle various cluster properties.

class runhouse.OnDemandCluster(name, instance_type: str | None = None, num_instances: int | None = None, provider: str | None = None, dryrun=False, autostop_mins=None, use_spot=False, image_id=None, memory=None, disk_size=None, open_ports=None, server_host: str | None = None, server_port: int | None = None, server_connection_type: str | None = None, ssl_keyfile: str | None = None, ssl_certfile: str | None = None, domain: str | None = None, den_auth: bool = False, region=None, sky_state=None, live_state=None, **kwargs)[source]
__init__(name, instance_type: str | None = None, num_instances: int | None = None, provider: str | None = None, dryrun=False, autostop_mins=None, use_spot=False, image_id=None, memory=None, disk_size=None, open_ports=None, server_host: str | None = None, server_port: int | None = None, server_connection_type: str | None = None, ssl_keyfile: str | None = None, ssl_certfile: str | None = None, domain: str | None = None, den_auth: bool = False, region=None, sky_state=None, live_state=None, **kwargs)[source]

On-demand SkyPilot Cluster.

Note

To build a cluster, please use the factory method cluster().

static cluster_ssh_key(path_to_file)[source]

Retrieve SSH key for the cluster.

Example

is_up() bool[source]

Whether the cluster is up.

Example

keep_warm(autostop_mins: int = -1)[source]

Keep the cluster warm for given number of minutes after inactivity.

Parameters:

autostop_mins (int) – Amount of time (in min) to keep the cluster warm after inactivity. If set to -1, keep cluster warm indefinitely. (Default: -1)

pause_autostop()[source]

Context manager to temporarily pause autostop.

Example

ssh(node: str | None = None)[source]

SSH into the cluster. If no node is specified, will SSH onto the head node.

Example

property ssh_creds

Retrieve SSH creds for the cluster.

Example

teardown()[source]

Teardown cluster.

Example

teardown_and_delete()[source]

Teardown cluster and delete it from configs.

Example

up()[source]

Up the cluster.

Example

OnDemandCluster Hardware Setup

On-Demand clusters use SkyPilot to automatically spin up and down clusters on the cloud. You will need to first set up cloud access on your local machine:

Run sky check to see which cloud providers are enabled, and how to set up cloud credentials for each of the providers.

For a more in depth tutorial on setting up individual cloud credentials, you can refer to SkyPilot setup docs.

SageMakerCluster Class

Note

SageMaker support is an alpha and under active development. Please report any bugs or let us know of any feature requests.

A SageMakerCluster is a cluster that uses a SageMaker instance under the hood.

Runhouse currently supports two core usage paths for SageMaker clusters:

  • Compute backend: You can use SageMaker as a compute backend, just as you would a BYO (bring-your-own) or an on-demand cluster. Runhouse will handle launching the SageMaker compute and creating the SSH connection to the cluster.

  • Dedicated training jobs: You can use a SageMakerCluster class to run a training job on SageMaker compute. To do so, you will need to provide an estimator.

Note

Runhouse requires an AWS IAM role (either name or full ARN) whose credentials have adequate permissions to create create SageMaker endpoints and access AWS resources.

Please see SageMaker Hardware Setup for more specific instructions and requirements for providing the role and setting up the cluster.

class runhouse.SageMakerCluster(name: str, role: str | None = None, profile: str | None = None, region: str | None = None, ssh_key_path: str | None = None, instance_id: str | None = None, instance_type: str | None = None, num_instances: int | None = None, image_uri: str | None = None, autostop_mins: int | None = None, connection_wait_time: int | None = None, estimator: EstimatorBase | Dict | None = None, job_name: str | None = None, server_host: str | None = None, server_port: int | None = None, domain: str | None = None, server_connection_type: str | None = None, ssl_keyfile: str | None = None, ssl_certfile: str | None = None, den_auth: bool = False, dryrun=False, **kwargs)[source]
__init__(name: str, role: str | None = None, profile: str | None = None, region: str | None = None, ssh_key_path: str | None = None, instance_id: str | None = None, instance_type: str | None = None, num_instances: int | None = None, image_uri: str | None = None, autostop_mins: int | None = None, connection_wait_time: int | None = None, estimator: EstimatorBase | Dict | None = None, job_name: str | None = None, server_host: str | None = None, server_port: int | None = None, domain: str | None = None, server_connection_type: str | None = None, ssl_keyfile: str | None = None, ssl_certfile: str | None = None, den_auth: bool = False, dryrun=False, **kwargs)[source]

The Runhouse SageMaker cluster abstraction. This is where you can use SageMaker as a compute backend, just as you would an on-demand cluster (i.e. cloud VMs) or a BYO (i.e. on-prem) cluster. Additionally supports running dedicated training jobs using SageMaker Estimators.

Note

To build a cluster, please use the factory method sagemaker_cluster().

property connection_wait_time

Amount of time the SSH helper will wait inside SageMaker before it continues normal execution

property default_bucket

Default bucket to use for storing the cluster’s authorized public keys.

is_up() bool[source]

Check if the cluster is up.

Example

keep_warm(autostop_mins: int = -1)[source]

Keep the cluster warm for given number of minutes after inactivity.

Parameters:

autostop_mins (int) – Amount of time (in minutes) to keep the cluster warm after inactivity. If set to -1, keep cluster warm indefinitely. (Default: -1)

pause_autostop()[source]

Context manager to temporarily pause autostop.

restart_server(_rh_install_url: str = None, resync_rh: bool = True, restart_ray: bool = True, env: str | Env = None, restart_proxy: bool = False)[source]

Restart the RPC server on the SageMaker instance.

Parameters:
  • resync_rh (bool) – Whether to resync runhouse. (Default: True)

  • restart_ray (bool) – Whether to restart Ray. (Default: True)

  • env (str or Env) – Env to restart the server from. If not provided will use default env on the cluster.

  • restart_proxy (bool) – Whether to restart nginx on the cluster, if configured. (Default: False)

Example

ssh(interactive: bool = True)[source]

SSH into the cluster.

Parameters:

interactive (bool) – Whether to start an interactive shell or not (Default: True).

Example

property ssh_key_path

Relative path to the private SSH key used to connect to the cluster.

status() dict[source]

Get status of SageMaker cluster.

Example

teardown()[source]

Teardown the SageMaker instance.

Example

teardown_and_delete()[source]

Teardown the SageMaker instance and delete from RNS configs.

Example

up()[source]

Up the cluster.

Example

up_if_not()[source]

Bring up the cluster if it is not up. No-op if cluster is already up.

Example

SageMaker Hardware Setup

IAM Role

SageMaker clusters require AWS CLI V2 and configuring the SageMaker IAM role with the AWS Systems Manager.

In order to launch a cluster, you must grant SageMaker the necessary permissions with an IAM role, which can be provided either by name or by full ARN. You can also specify a profile explicitly or with the AWS_PROFILE environment variable.

For example, let’s say your local ~/.aws/config file contains:

There are several ways to provide the necessary credentials when initializing the cluster:

  • Providing the AWS profile name: profile="sagemaker"

  • Providing the AWS Role ARN directly: role="arn:aws:iam::123456789:role/service-role/AmazonSageMaker-ExecutionRole-20230717T192142"

  • Environment Variable: setting AWS_PROFILE to "sagemaker"

Note

If no role or profile is provided, Runhouse will try using the default profile. Note if this default AWS identity is not a role, then you will need to provide the role or profile explicitly.

Tip

If you are providing an estimator, you must provide the role ARN explicitly as part of the estimator object. More info on estimators here.

Please see the AWS docs for further instructions on creating and configuring an ARN Role.

AWS CLI V2

The SageMaker SDK uses AWS CLI V2, which must be installed on your local machine. Doing so requires one of two steps:

To confirm the installation succeeded, run aws --version in the command line. You should see something like:

If you are still seeing the V1 version, first try uninstalling V1 in case it is still present (e.g. pip uninstall awscli).

You may also need to add the V2 executable to the PATH of your python environment. For example, if you are using conda, it’s possible the conda env will try using its own version of the AWS CLI located at a different path (e.g. /opt/homebrew/anaconda3/bin/aws), while the system wide installation of AWS CLI is located somewhere else (e.g. /opt/homebrew/bin/aws).

To find the global AWS CLI path:

To ensure that the global AWS CLI version is used within your python environment, you’ll need to adjust the PATH environment variable so that it prioritizes the global AWS CLI path.

SSM Setup

The AWS Systems Manager service is used to create SSH tunnels with the SageMaker cluster.

To install the AWS Session Manager Plugin, please see the AWS docs or SageMaker SSH Helper. The SSH Helper package simplifies the process of creating SSH tunnels with SageMaker clusters. It is installed by default if you are installing Runhouse with the SageMaker dependency: pip install runhouse[sagemaker].

You can also install the Session Manager by running the CLI command:

To configure your SageMaker IAM role with the AWS Systems Manager, please refer to these instructions.

Cluster Authentication & Verification

Runhouse provides a couple of options to manage the connection to the Runhouse API server running on a cluster.

Server Connection

The below options can be specified with the server_connection_type parameter when initializing a cluster. By default the Runhouse API server will be started on the cluster on port 32300.

  • ssh: Connects to the cluster via an SSH tunnel, by default on port 32300.

  • tls: Connects to the cluster via HTTPS (by default on port 443) using either a provided certificate, or creating a new self-signed certificate just for this cluster. You must open the needed ports in the firewall, such as via the open_ports argument in the OnDemandCluster, or manually in the compute itself or cloud console.

  • none: Does not use any port forwarding or enforce any authentication. Connects to the cluster with HTTP by default on port 80. This is useful when connecting to a cluster within a VPC, or creating a tunnel manually on the side with custom settings.

  • aws_ssm: Uses the AWS Systems Manager to create an SSH tunnel to the cluster, by default on port 32300. Note: this is currently only relevant for SageMaker Clusters.

  • paramiko: Uses Paramiko to create an SSH tunnel to the cluster. This is relevant if you are using a cluster which requires a password to authenticate.

Note

The tls connection type is the most secure and is recommended for production use if you are not running inside of a VPC. However, be mindful that you must secure the cluster with authentication (see below) if you open it to the public internet.

Server Authentication

If desired, Runhouse provides out-of-the-box authentication via users’ Runhouse token (generated when logging in) and set locally at: ~/.rh/config.yaml). This is crucial if the cluster has ports open to the public internet, as would usually be the case when using the tls connection type. You may also set up your own authentication manually inside of your own code, but you should likely still enable Runhouse authentication to ensure that even your non-user-facing endpoints into the server are secured.

When initializing a cluster, you can set the den_auth parameter to True to enable token authentication. Calls to cluster server can then be made with an auth header with format: {"Authorization": "Bearer <token>"}. The Runhouse Python library adds this header to its calls automatically, so your users do not need to worry about it after logging into Runhouse.

TLS Certificates

Enabling TLS and Runhouse Den Dashboard Auth for the API server makes it incredibly fast and easy to stand up a microservice with standard token authentication, allowing you to easily share Runhouse resources with collaborators, teams, customers, etc.

Let’s illustrate this with a simple example:

We can also call the function via an HTTP request, making it easy for other users to call the function with a Runhouse token (Note: this assumes the user has been granted access to the function or write access to the cluster):

Here we use --cacert and the full path to the local cert for verification (assuming we are using self-signed certs generated by Runhouse), and serialization=pickle in order to return the pickled numpy array from the np_array function.

If you are using a domain and have obtained a CA verified certificate, you can use the domain name in place of the IP address and the --cacert flag is not needed. See below for more info on automatically obtaining CA verified certs with Caddy.

Tip

For more examples on using clusters and functions see the Compute Guide.

Caddy

Runhouse gives you the option of using Caddy as a reverse proxy for the Runhouse API server, which is a FastAPI app launched with Uvicorn. Using Caddy provides you with a safer and more conventional approach running the FastAPI app on a higher, non-privileged port (such as 32300, the default Runhouse port) and then use Caddy as a reverse proxy to forward requests from the HTTP port (default: 80) or the HTTPS port (default: 443).

Caddy also enables generating and auto-renewing self-signed certificates, making it easy to secure your cluster with HTTPS right out of the box.

Note

Caddy is enabled by default when you launch a cluster with the server_port set to either 80 or 443.

Generating Certs

Runhouse offers two options for enabling TLS/SSL on a cluster with Caddy:

  1. Using existing certs: provide the path to the cert and key files with the ssl_certfile and ssl_keyfile arguments. These certs will be used by Caddy as specified in the Caddyfile on the cluster. If no cert paths are provided and no domain is specified, Runhouse will issue self-signed certificates to use for the cluster. These certs will not be verified by a CA.

  2. Using Caddy to generate CA verified certs: Provide the domain argument. Caddy will then obtain certificates from Let’s Encrypt on-demand when a client connects for the first time.