Ray Job (alpha)¶
Note: This is the alpha version of Ray Job Support in KubeRay. There will be ongoing improvements for Ray Job in the future releases.
- Ray 1.10 or higher
- KubeRay v0.3.0+. (v0.4.0 is recommended)
What is a RayJob?¶
A RayJob manages 2 things: * Ray Cluster: Manages resources in a Kubernetes cluster. * Job: Manages jobs in a Ray Cluster.
What does the RayJob provide?¶
- Kubernetes-native support for Ray clusters and Ray Jobs. You can use a Kubernetes config to define a Ray cluster and job, and use
kubectlto create them. The cluster can be deleted automatically once the job is finished.
Make sure your KubeRay operator version is at least v0.3.0. The latest released KubeRay version (v0.4.0) is recommended. For installation instructions, please follow the documentation.
Run an example Job¶
There is one example config file to deploy a RayJob included here: ray_v1alpha1_rayjob.yaml
# Create a RayJob. $ kubectl apply -f config/samples/ray_v1alpha1_rayjob.yaml
# List running RayJobs. $ kubectl get rayjob NAME AGE rayjob-sample 7s
# RayJob sample will also create a raycluster. # raycluster will create few resources including pods and services. You can use the following commands to check them: $ kubectl get rayclusters $ kubectl get pod
entrypoint- The shell command to run for this job. job_id.
jobId- (Optional) Job ID to specify for the job. If not provided, one will be generated.
metadata- Arbitrary user-provided metadata for the job.
runtimeEnv- base64 string of the runtime json string.
shutdownAfterJobFinishes- whether to recycle the cluster after job finishes.
ttlSecondsAfterFinished- TTL to clean up the cluster. This only works if
You can use
kubectl logs to check the operator logs or the head/worker nodes logs.
You can also use
kubectl describe rayjobs rayjob-sample to check the states and event logs of your RayJob instance:
Status: Dashboard URL: rayjob-sample-raycluster-vnl8w-head-svc.ray-system.svc.cluster.local:8265 End Time: 2022-07-24T02:04:56Z Job Deployment Status: Complete Job Id: test-hehe Job Status: SUCCEEDED Message: Job finished successfully. Ray Cluster Name: rayjob-sample-raycluster-vnl8w Ray Cluster Status: Available Worker Replicas: 1 Endpoints: Client: 32572 Dashboard: 32276 Gcs - Server: 30679 Last Update Time: 2022-07-24T02:04:43Z State: ready Start Time: 2022-07-24T02:04:49Z Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Created 90s rayjob-controller Created cluster rayjob-sample-raycluster-vnl8w Normal Submitted 82s rayjob-controller Submit Job test-hehe Normal Deleted 15s rayjob-controller Deleted cluster rayjob-sample-raycluster-vnl8w
If the job doesn't run successfully, the above
describe command will provide information about that too:
Status: Dashboard URL: rayjob-sample-raycluster-nrdm8-head-svc.ray-system.svc.cluster.local:8265 End Time: 2022-07-24T02:01:39Z Job Deployment Status: Complete Job Id: test-hehe Job Status: FAILED Message: Job failed due to an application error, last available logs: python: can't open file '/tmp/code/script.ppy': [Errno 2] No such file or directory Ray Cluster Name: rayjob-sample-raycluster-nrdm8 Ray Cluster Status: Available Worker Replicas: 1 Endpoints: Client: 31852 Dashboard: 32606 Gcs - Server: 32436 Last Update Time: 2022-07-24T02:01:30Z State: ready Start Time: 2022-07-24T02:01:38Z Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Created 2m9s rayjob-controller Created cluster rayjob-sample-raycluster-nrdm8 Normal Submitted 2m rayjob-controller Submit Job test-hehe Normal Deleted 58s rayjob-controller Deleted cluster rayjob-sample-raycluster-nrdm8
Delete the RayJob instance¶
$ kubectl delete -f config/samples/ray_v1alpha1_rayjob.yaml