Autoscaling
Autoscaler (beta)¶
Ray Autoscaler integration is beta since KubeRay 0.3.0 and Ray 2.0.0. While autoscaling functionality is stable, the details of autoscaler behavior and configuration may change in future releases.
See the official Ray documentation for even more information about Ray autoscaling on Kubernetes.
Prerequisite¶
Start by deploying the latest stable version of the KubeRay operator:
kubectl create -k "github.com/ray-project/kuberay/ray-operator/config/default?ref=v0.4.0&timeout=90s"
Deploy a cluster with autoscaling enabled¶
Next, to deploy a sample autoscaling Ray cluster, run
kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/release-0.3/ray-operator/config/samples/ray-cluster.autoscaler.yaml
See the above config file for details on autoscaling configuration.
Note
Ray container resource requests and limits in the example configuration above are too small to be used in production. For typical use-cases, you should use large Ray pods. If possible, each Ray pod should be sized to take up its entire K8s node. We don't recommend allocating less than 8 gigabytes of memory for Ray containers running in production. For an autoscaling configuration more suitable for production, see ray-cluster.autoscaler.large.yaml.
The output of kubectl get pods
should indicate the presence of
a Ray head pod with two containers,
the Ray container and the autoscaler container.
You should also see a Ray worker pod with a single Ray container.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
raycluster-autoscaler-head-mgwwk 2/2 Running 0 4m41s
raycluster-autoscaler-worker-small-group-fg4fv 1/1 Running 0 4m41s
Check the autoscaler container's logs to confirm that the autoscaler is healthy. Here's an example of logs from a healthy autoscaler.
kubectl logs -f raycluster-autoscaler-head-mgwwk autoscaler
2022-03-10 07:51:22,616 INFO monitor.py:226 -- Starting autoscaler metrics server on port 44217
2022-03-10 07:51:22,621 INFO monitor.py:243 -- Monitor: Started
2022-03-10 07:51:22,824 INFO node_provider.py:143 -- Creating KuberayNodeProvider.
2022-03-10 07:51:22,825 INFO autoscaler.py:282 -- StandardAutoscaler: {'provider': {'type': 'kuberay', 'namespace': 'default', 'disable_node_updaters': True, 'disable_launch_config_check': True}, 'cluster_name': 'raycluster-autoscaler', 'head_node_type': 'head-group', 'available_node_types': {'head-group': {'min_workers': 0, 'max_workers': 0, 'node_config': {}, 'resources': {'CPU': 1}}, 'small-group': {'min_workers': 1, 'max_workers': 300, 'node_config': {}, 'resources': {'CPU': 1}}}, 'max_workers': 300, 'idle_timeout_minutes': 5, 'upscaling_speed': 1, 'file_mounts': {}, 'cluster_synced_files': [], 'file_mounts_sync_continuously': False, 'initialization_commands': [], 'setup_commands': [], 'head_setup_commands': [], 'worker_setup_commands': [], 'head_start_ray_commands': [], 'worker_start_ray_commands': [], 'auth': {}, 'head_node': {}, 'worker_nodes': {}}
2022-03-10 07:51:23,027 INFO autoscaler.py:327 --
======== Autoscaler status: 2022-03-10 07:51:23.027271 ========
Node status
---------------------------------------------------------------
Healthy:
1 head-group
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/1.0 CPU
0.00/0.931 GiB memory
0.00/0.200 GiB object_store_memory
Demands:
(no resource demands)
Notes¶
-
To enable autoscaling, set your RayCluster CR's
spec.enableInTreeAutoscaling
field to true. The operator will then automatically inject a preconfigured autoscaler container to the head pod. The service account, role, and role binding needed by the autoscaler will be created by the operator out-of-box. The operator will also configure an empty-dir logging volume for the Ray head pod. The volume will be mounted into the Ray and autoscaler containers; this is necessary to support the event logging introduced in Ray PR #13434.spec: enableInTreeAutoscaling: true
-
If your RayCluster CR's
spec.rayVersion
field is at least2.0.0
, the autoscaler container will use the same image as the Ray container. For Ray versions older than 2.0.0, the imagerayproject/ray:2.0.0
will be used to run the autoscaler. -
Autoscaling functionality is supported only with Ray versions at least as new as 1.11.0. Autoscaler support is beta as of Ray 2.0.0 and KubeRay 0.3.0; while autoscaling functionality is stable, the details of autoscaler behavior and configuration may change in future releases.
Test autoscaling¶
Let's now try out the autoscaler. We can run the following command to get a Python interpreter in the head pod:
kubectl exec `kubectl get pods -o custom-columns=POD:metadata.name | grep raycluster-autoscaler-head` -it -c ray-head -- python
In the Python interpreter, run the following snippet to scale up the cluster:
import ray
ray.init()
ray.autoscaler.sdk.request_resources(num_cpus=4)
You should then see two extra Ray nodes (pods) scale up to satisfy the 4 CPU demand.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
raycluster-autoscaler-head-mgwwk 2/2 Running 0 4m41s
raycluster-autoscaler-worker-small-group-4d255 1/1 Running 0 40s
raycluster-autoscaler-worker-small-group-fg4fv 1/1 Running 0 4m41s
raycluster-autoscaler-worker-small-group-qzhvg 1/1 Running 0 40s