RayCluster¶
A Helm chart for deploying the RayCluster with the kuberay operator.
Homepage: https://github.com/ray-project/kuberay
Introduction¶
RayCluster is a custom resource definition (CRD). KubeRay operator will listen to the resource events about RayCluster and create related Kubernetes resources (e.g. Pod & Service). Hence, KubeRay operator installation and CRD registration are required for this guide.
Prerequisites¶
See kuberay-operator/README.md for more details.
- Helm
- Install custom resource definition and KubeRay operator (covered by the following end-to-end example.)
End-to-end example¶
# Step 1: Create a KinD cluster
kind create cluster
# Step 2: Register a Helm chart repo
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
# Step 3: Install both CRDs and KubeRay operator v1.1.0.
helm install kuberay-operator kuberay/kuberay-operator --version 1.1.0
# Step 4: Install a RayCluster custom resource
# (For x86_64 users)
helm install raycluster kuberay/ray-cluster --version 1.1.0
# (For arm64 users, e.g. Mac M1)
# See here for all available arm64 images: https://hub.docker.com/r/rayproject/ray/tags?page=1&name=aarch64
helm install raycluster kuberay/ray-cluster --version 1.1.0 --set image.tag=nightly-aarch64
# Step 5: Verify the installation of KubeRay operator and RayCluster
kubectl get pods
# NAME READY STATUS RESTARTS AGE
# kuberay-operator-6fcbb94f64-gkpc9 1/1 Running 0 89s
# raycluster-kuberay-head-qp9f4 1/1 Running 0 66s
# raycluster-kuberay-worker-workergroup-2jckt 1/1 Running 0 66s
# Step 6: Forward the port of Dashboard
kubectl port-forward svc/raycluster-kuberay-head-svc 8265:8265
# Step 7: Check 127.0.0.1:8265 for the Dashboard
# Step 8: Log in to Ray head Pod and execute a job.
kubectl exec -it ${RAYCLUSTER_HEAD_POD} -- bash
python -c "import ray; ray.init(); print(ray.cluster_resources())" # (in Ray head Pod)
# Step 9: Check 127.0.0.1:8265/#/job. The status of the job should be "SUCCEEDED".
# Step 10: Uninstall RayCluster
helm uninstall raycluster
# Step 11: Verify that RayCluster has been removed successfully
# NAME READY STATUS RESTARTS AGE
# kuberay-operator-6fcbb94f64-gkpc9 1/1 Running 0 9m57s
Values¶
Key | Type | Default | Description |
---|---|---|---|
image.repository | string | "rayproject/ray" |
Image repository. |
image.tag | string | "2.46.0" |
Image tag. |
image.pullPolicy | string | "IfNotPresent" |
Image pull policy. |
nameOverride | string | "kuberay" |
String to partially override release name. |
fullnameOverride | string | "" |
String to fully override release name. |
imagePullSecrets | list | [] |
Secrets with credentials to pull images from a private registry |
gcsFaultTolerance.enabled | bool | false |
|
common.containerEnv | list | [] |
containerEnv specifies environment variables for the Ray head and worker containers. Follows standard K8s container env schema. |
head.initContainers | list | [] |
Init containers to add to the head pod |
head.labels | object | {} |
Labels for the head pod |
head.serviceAccountName | string | "" |
|
head.restartPolicy | string | "" |
|
head.containerEnv | list | [] |
|
head.envFrom | list | [] |
envFrom to pass to head pod |
head.resources.limits.cpu | string | "1" |
|
head.resources.limits.memory | string | "2G" |
|
head.resources.requests.cpu | string | "1" |
|
head.resources.requests.memory | string | "2G" |
|
head.annotations | object | {} |
Extra annotations for head pod |
head.nodeSelector | object | {} |
Node labels for head pod assignment |
head.tolerations | list | [] |
Node tolerations for head pod scheduling to nodes with taints |
head.affinity | object | {} |
Head pod affinity |
head.podSecurityContext | object | {} |
Head pod security context. |
head.securityContext | object | {} |
Ray container security context. |
head.volumes[0].name | string | "log-volume" |
|
head.volumes[0].emptyDir | object | {} |
|
head.volumeMounts[0].mountPath | string | "/tmp/ray" |
|
head.volumeMounts[0].name | string | "log-volume" |
|
head.sidecarContainers | list | [] |
|
head.command | list | [] |
|
head.args | list | [] |
|
head.headService | object | {} |
|
head.topologySpreadConstraints | list | [] |
|
head.rayStartParams | object | {} |
|
worker.groupName | string | "workergroup" |
The name of the workergroup |
worker.replicas | int | 1 |
The number of replicas for the worker pod |
worker.minReplicas | int | 1 |
The minimum number of replicas for the worker pod |
worker.maxReplicas | int | 3 |
The maximum number of replicas for the worker pod |
worker.labels | object | {} |
Labels for the worker pod |
worker.serviceAccountName | string | "" |
|
worker.restartPolicy | string | "" |
|
worker.initContainers | list | [] |
Init containers to add to the worker pod |
worker.containerEnv | list | [] |
|
worker.envFrom | list | [] |
envFrom to pass to worker pod |
worker.resources.limits.cpu | string | "1" |
|
worker.resources.limits.memory | string | "1G" |
|
worker.resources.requests.cpu | string | "1" |
|
worker.resources.requests.memory | string | "1G" |
|
worker.annotations | object | {} |
Extra annotations for worker pod |
worker.nodeSelector | object | {} |
Node labels for worker pod assignment |
worker.tolerations | list | [] |
Node tolerations for worker pod scheduling to nodes with taints |
worker.affinity | object | {} |
Worker pod affinity |
worker.podSecurityContext | object | {} |
Worker pod security context. |
worker.securityContext | object | {} |
Ray container security context. |
worker.volumes[0].name | string | "log-volume" |
|
worker.volumes[0].emptyDir | object | {} |
|
worker.volumeMounts[0].mountPath | string | "/tmp/ray" |
|
worker.volumeMounts[0].name | string | "log-volume" |
|
worker.sidecarContainers | list | [] |
|
worker.command | list | [] |
|
worker.args | list | [] |
|
worker.topologySpreadConstraints | list | [] |
|
worker.rayStartParams | object | {} |
|
additionalWorkerGroups.smallGroup.disabled | bool | true |
|
additionalWorkerGroups.smallGroup.replicas | int | 0 |
The number of replicas for the additional worker pod |
additionalWorkerGroups.smallGroup.minReplicas | int | 0 |
The minimum number of replicas for the additional worker pod |
additionalWorkerGroups.smallGroup.maxReplicas | int | 3 |
The maximum number of replicas for the additional worker pod |
additionalWorkerGroups.smallGroup.labels | object | {} |
Labels for the additional worker pod |
additionalWorkerGroups.smallGroup.serviceAccountName | string | "" |
|
additionalWorkerGroups.smallGroup.restartPolicy | string | "" |
|
additionalWorkerGroups.smallGroup.containerEnv | list | [] |
|
additionalWorkerGroups.smallGroup.envFrom | list | [] |
envFrom to pass to additional worker pod |
additionalWorkerGroups.smallGroup.resources.limits.cpu | int | 1 |
|
additionalWorkerGroups.smallGroup.resources.limits.memory | string | "1G" |
|
additionalWorkerGroups.smallGroup.resources.requests.cpu | int | 1 |
|
additionalWorkerGroups.smallGroup.resources.requests.memory | string | "1G" |
|
additionalWorkerGroups.smallGroup.annotations | object | {} |
Extra annotations for additional worker pod |
additionalWorkerGroups.smallGroup.nodeSelector | object | {} |
Node labels for additional worker pod assignment |
additionalWorkerGroups.smallGroup.tolerations | list | [] |
Node tolerations for additional worker pod scheduling to nodes with taints |
additionalWorkerGroups.smallGroup.affinity | object | {} |
Additional worker pod affinity |
additionalWorkerGroups.smallGroup.podSecurityContext | object | {} |
Additional worker pod security context. |
additionalWorkerGroups.smallGroup.securityContext | object | {} |
Ray container security context. |
additionalWorkerGroups.smallGroup.volumes[0].name | string | "log-volume" |
|
additionalWorkerGroups.smallGroup.volumes[0].emptyDir | object | {} |
|
additionalWorkerGroups.smallGroup.volumeMounts[0].mountPath | string | "/tmp/ray" |
|
additionalWorkerGroups.smallGroup.volumeMounts[0].name | string | "log-volume" |
|
additionalWorkerGroups.smallGroup.sidecarContainers | list | [] |
|
additionalWorkerGroups.smallGroup.command | list | [] |
|
additionalWorkerGroups.smallGroup.args | list | [] |
|
additionalWorkerGroups.smallGroup.topologySpreadConstraints | list | [] |
|
additionalWorkerGroups.smallGroup.rayStartParams | object | {} |
|
service.type | string | "ClusterIP" |