Ray Crash Course - Ray Clusters and the Ray CLI¶
© 2019-2021, Anyscale. All Rights Reserved
In the previous lessons, we let ray.init()
start a mini-cluster on your laptop or connect to the running Ray cluster in the Anyscale hosted platform. This lesson discusses using the Ray CLI command ray
to create and manage Ray clusters. We won't cover all the subcommands ray
supports. Try ray --help
and see the Ray CLI documentation for more details.
Tip: If any of the CLI commands used here print a lot of output, right click on the output and select Enable Scrolling for Outputs.
Notes:
- The Anyscale hosted platform has its own CLI command,
anyscale
, which integrates theray
CLI and provides other capabilities for managing and running Ray projects and sessions, including automated cluster integration, synchronization of code to your local development environment, etc. Further information on this service will be available soon. Contact us for details.- Ray can now be used with Docker. You can find the published Docker images here. For more details, see the documentation here and here.
ray --help¶
The typical help
information is available with --help
or with no arguments:
!ray --help
Some of these commands are aliases, e.g., down
and teardown
, get-head-ip
and get_head_ip
, etc. kill-random-node
looks strange, but it is useful for Chaos Engineering purposes.
For more details on a particular command, use ray <command> --help
:
!ray start --help
ray --version¶
Show the version of Ray you are using.
!ray --version
If Ray is running on this node, the output can be very long. It shows the status of the nodes, running worker processes and various other Python processes being executed, and Redis processes, which are used as part of the distributed object store for Ray. We discuss these services in greater detail in the Advance Ray tutorial.
If there are multiple Ray instances running on this node, you'll have to specify the correct address. Run ray stat
to see a list of those addresses, then pick the correct one:
ray stat --address IP:PORT
ray stat
returns the exit code 0
if Ray is running locally or a nonzero value if it isn't. The following command exploits this feature and starts a head node for Ray:
ray stat > /dev/null 2>&1 || ray start --head
All output of ray stat
is sent to /dev/null
(which throws it away) and if the status code is nonzero, then the command after the ||
is executed, ray start --head
.
You can also get cluster information inside your application using API methods.
See Inspect the Cluster State for details.
ray start and ray stop¶
As shown in the previous cell, ray start
is used to start the Ray processes on a node. When the --head
flag is used, it means this is the master node that will be used to bootstrap the cluster.
When you want to stop Ray running on a particular node, use ray stop
.
WARNING: Running
ray stop
will impact any Ray applications currently running on this node, including all other lesson notebooks currently running Ray, so if you intend to stop Ray, first save your work, close those notebooks, and stop their processes using the Running tab on the left of the Jupyter Lab UI. The tab might be labelled with a white square surrounded by a dark circle instead of Running.
We won't actually run ray start
or ray stop
in what follows, to avoid causing problems for other lessons. We'll just describe what they do and the output they print.
When you run ray start --head
you see output like the following (unless an error occurs):
$ ray start --head
2020-05-23 07:47:47,469 INFO scripts.py:357 -- Using IP address 192.168.1.149 for this node.
2020-05-23 07:47:47,489 INFO resource_spec.py:212 -- Starting Ray with 4.3 GiB memory available for workers and up to 2.17 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-05-23 07:47:47,865 INFO services.py:1170 -- View the Ray dashboard at localhost:8265
2020-05-23 07:47:47,912 INFO scripts.py:387 --
Started Ray on this node. You can add additional nodes to the cluster by calling
ray start --address='192.168.1.149:10552' --redis-password='5241590000000000'
from the node you wish to add. You can connect a driver to the cluster from Python by running
import ray
ray.init(address='auto', redis_password='5241590000000000')
If you have trouble connecting from a different machine, check that your firewall is configured properly. If you wish to terminate the processes that have been started, run
ray stop
(You'll see a different IP address.)
The output includes a line like this:
ray start --address='192.168.1.149:10552' --redis-password='5241590000000000'
This is the ray start
command you would use on the other machines where you want to start Ray and have them join the same cluster.
Note also the instructions for code to add to your application.
import ray
ray.init(address='auto', ignore_reinit_errors=True, redis_password='5241590000000000')
The redis_password
shown is the default value. We didn't specify this argument when we called ray.init()
in other notebooks.
You can actually call ray start --head
multiple times on the same node to create separate clusters. They may appear at first to be a bug, but it is actually useful for testing purposes.
The ray stop
command usually prints no output. Add the --verbose
flag for details.
Warning:
ray stop
stops all running Ray processes on this node. There is no command line option to specify which one to stop.
ray memory¶
A new feature of the Ray CLI is the memory
command which prints a snapshot of the current state of actors and tasks in memory in the cluster. It is useful debugging issues and understanding how Ray has distributed work around your cluster.
Here is an example captured on a laptop while the first two lessons in this tutorial were evaluating their cells:
$ ray memory
2020-06-26 06:08:55,158 INFO scripts.py:1042 -- Connecting to Ray instance at 192.168.1.149:6379.
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0626 06:08:55.163417 90759 489258432 global_state_accessor.cc:25] Redis server address = 192.168.1.149:6379, is test flag = 0
I0626 06:08:55.164857 90759 489258432 redis_client.cc:141] RedisClient connected.
I0626 06:08:55.167277 90759 489258432 redis_gcs_client.cc:88] RedisGcsClient Connected.
I0626 06:08:55.168231 90759 489258432 service_based_gcs_client.cc:75] ServiceBasedGcsClient Connected.
-----------------------------------------------------------------------------------------------------
Object ID Reference Type Object Size Reference Creation Site
=====================================================================================================
; driver pid=89861
ffffffffffffffff6ec7e2960c0000c001000000 LOCAL_REFERENCE ? (actor call) <ipython-input-7-a62036e0309c>:<module>:7
55be66b7df500ad56ec7e2960c0000c003000000 LOCAL_REFERENCE 23 (actor call) <ipython-input-7-a62036e0309c>:<module>:8
55be66b7df500ad56ec7e2960c0000c002000000 LOCAL_REFERENCE 15 (actor call) <ipython-input-7-a62036e0309c>:<module>:8
ffffffffffffffffffffffff0c00008001000000 LOCAL_REFERENCE 27 (put object) <ipython-input-9-57253d54e26a>:<module>:1
0f8aa561996c6719ffffffff0c0000c001000000 LOCAL_REFERENCE 88 (task call) <ipython-input-6-9667649da5b7>:<module>:13
55be66b7df500ad56ec7e2960c0000c001000000 LOCAL_REFERENCE 16 (actor call) <ipython-input-7-a62036e0309c>:<module>:8
; driver pid=90154
aa0e49cf6481351dffffffff100000c001000000 LOCAL_REFERENCE 23 (task call) <ipython-input-17-f5cad4404199>:<module>:1
082755fdfe469abcffffffff100000c001000000 LOCAL_REFERENCE ? (task call) <ipython-input-31-dd50cc550d0b>:<listcomp>:3
57c6dbda70012254ffffffff100000c001000000 LOCAL_REFERENCE ? (task call) <ipython-input-31-dd50cc550d0b>:<listcomp>:3
fab196f393a5de36ffffffff100000c001000000 LOCAL_REFERENCE 88 (task call) <ipython-input-31-dd50cc550d0b>:<listcomp>:3
10473efa8f620095ffffffff100000c001000000 LOCAL_REFERENCE 88 (task call) <ipython-input-31-dd50cc550d0b>:<listcomp>:3
dc7dc79e27e8e5b7ffffffff100000c001000000 LOCAL_REFERENCE 23 (task call) <ipython-input-19-e197d2c09385>:<listcomp>:1
16053fa58b987ab5ffffffff100000c001000000 LOCAL_REFERENCE ? (task call) <ipython-input-31-dd50cc550d0b>:<listcomp>:3
852d61559823797effffffff100000c001000000 LOCAL_REFERENCE 23 (task call) <ipython-input-19-e197d2c09385>:<listcomp>:1
2e1f2a844f6b2fd4ffffffff100000c001000000 LOCAL_REFERENCE 23 (task call) <ipython-input-19-e197d2c09385>:<listcomp>:1
a52080f6c7937c01ffffffff100000c001000000 LOCAL_REFERENCE ? (task call) <ipython-input-31-dd50cc550d0b>:<listcomp>:3
a1e6529f26e2773cffffffff100000c001000000 LOCAL_REFERENCE ? (task call) <ipython-input-31-dd50cc550d0b>:<listcomp>:3
9991ac8b6172b3f2ffffffff100000c001000000 LOCAL_REFERENCE 23 (task call) <ipython-input-18-a0b7fb747444>:<module>:1
3cdffb6f345ef8f3ffffffff100000c001000000 LOCAL_REFERENCE 88 (task call) <ipython-input-31-dd50cc550d0b>:<listcomp>:3
0a51ce9438517c13ffffffff100000c001000000 LOCAL_REFERENCE ? (task call) <ipython-input-31-dd50cc550d0b>:<listcomp>:3
-----------------------------------------------------------------------------------------------------
All references are local because this is the output for a single machine. There are tasks and actors running in the workers, all of which are associated with driver processes that originate with ipython
processes used by the notebooks.
ray status¶
A new feature of the Ray CLI is the status
command for printing various status information about the cluster.