Ray Crash Course - Why Ray?¶
© 2019-2021, Anyscale. All Rights Reserved
The first two lessons explored using Ray for task and actor concurrency. This lesson takes a step back and explains the challenges that led to the creation of Ray and the Ray ecosystem. The end of this lesson also has links for more or more information on Ray and Anyscale.
Ray is a system for scaling Python applications from your laptop to a cluster with relative ease. It emerged from the RISELab at Berkeley in response to the problems researchers faced writing advanced ML applications and libraries that could easily scale to a cluster. These researchers found that none of the existing solutions were flexible enough and easy enough to use for their needs. Hence, Ray was born.
Tip: For more about Ray, see ray.io or the Ray documentation.
Just Six API Methods¶
Almost everything you do with Ray is done with just six API methods:
ray.init()
¶
Description: Initialize Ray in your application.
Example:
ray.init() # Many optional arguments discussed in lesson 06.
@ray.remote
¶
Description: Decorate a function to make it a remote task. Decorate a class to make it a remote actor.
Example:
@ray.remote # Define a task
def train_model(source):
...
@ray.remote # Define an actor
class ActivityTracker():
def record(event):
...
return count</code>
x.remote()
¶
Description: Construct an actor instance or asynchronously run a task or an actor method.
Example:
m_id = train_model.remote(...) # Invoke a task
tracker = ActivityTracker.remote() # Construct an actor instance
tr_id = tracker.record.remote(...) # Invoke an actor method
ray.put()
¶
Description: Put a value in the distributed object store.
Example:
put_id = ray.put(my_object)
ray.get()
¶
Description: Get an object from the distributed object store, either placed there by ray.put()
explicitly or by a task or actor method, blocking until object is available.
Example:
model = ray.get(m_id) # Retrieve result of train_model task invocation
count = ray.get(tr_id) # Retrieve result of tracker.record method call
thing = ray.get(put_id) # Retrieve "my_object"
ray.wait()
¶
Description: Wait on a list of ids until one of the corresponding objects is available (e.g., the task completes). Return two lists, one with ids for the available objects and the other with ids for the still-running tasks or method calls.
Example:
finished, running = ray.wait([m_id, tr_id])
These six API methods are the essence of Ray. They provide Ray's concision, flexibility, and power.
There are other API methods for various administrative and informational purposes. See 06 Exploring Ray API Calls.
Why Do We Need Ray?¶
Consider the following charts:
ML/AI model sizes have grown enormously in recent years, roughly 35x every 18 months, which is considerably faster than Moore's Law! Hence, this growth is far outstripping the growth of hardware capabilities. The only way to meet the need for sufficient compute power is to go distributed, as Ion Stoica recently wrote.
At the same time, the use of Python is growing rapidly, because it is a very popular language for data science. Many of the ML/AI toolkits are Python-based. Hence, there is a pressing need for powerful, yet easy-to-use tools for scaling Python applications horizontally. This is the motivation for Ray. You saw Ray in action in lessons 01 and 02.
Why are tools needed? First, the Python interpreter itself is not designed for massive scalability and high performance. Many python libraries with these requirements use C/C++ backends to work around Python limitations, like the so-called global interpreter lock, which effectively makes Python scripts single threaded.
Some of the most popular, general-purpose tools for this purpose include the following:
- asyncio for async/await-style (coroutine) concurrency.
- multiprocessing.Pool for creating a pool of asynchronous workers
- joblib for creating lightweight pipelines
However, while all of them make it easier to exploit all the CPU cores on your machine, they don't provide distributed computing beyond the boundaries of your machine. In fact, Ray also provides implementations of these APIs, so you are no longer limited to the boundaries of a single machine, as we'll see in the next lesson, 04 Ray Multiprocessing.
Consider this image:
It shows major tasks required in many ML-based application development and deployment, all of which typically require distributed implementations to scale large enough to process the compute and data load in a timely manner:
- Featurization: Features are the data "attributes" that appear to be most useful for modeling the domain.
- Streaming: New data often arrives in realtime and may be processed in realtime, too.
- Hyperameter Tuning: What are the best kinds of models for this domain? When using neural networks, what is the ideal architecture for the network? This model "metadata" is also called the hyperparameters. Since discovering the hyperparameters can be an expensive process of training lots of candidates, specialized techniques in their own right have emerged for this purpose, as we'll learn in the Ray Tune module.
- Training: Once the best (or at least good enough) hyperparameters are chosen, the model has to be trained on real data and sometimes retrained periodically as new data arrives.
- Simulation: An important part of many reinforcement learning applications is running a simulator, such as a game engine or robot simulation, against which the RL system is trained to maximize the "reward" when operating in that environment or the real analog. The simulator is one example of a compute pattern that is quite a bit different from the normal "dataflow" or query-like patterns that many big data tools support well. Also, this simulator may be run many, many times as part of the hyperameter tuning or training process, requiring efficient, cluster-wide execution.
- Model Serving: Finally, when the model is trained, it needs to be served, so that it can be applied to new data, sometimes with low latency requirements.
Here is the Ray vision:
The core Ray system, which we'll explore in this module, provides the cluster-wide scheduling of work (which we'll call tasks) and management of distributed state, another important requirement in real-world distributed systems.
On top of Ray, a growing family of domain-specific libraries support many of the functions we've discussed, like the ones shown in the image. Other tutorial modules in this repo explore those libraries.
- Ray Tune: For hyperparameter tuning. Tune integrates several optimization algorithms and integrates with many ML frameworks.
- Ray SGD: For stochastic gradient descent (SGD). This is a relatively new library, currently supporting PyTorch with other support for other systems forthcoming.
- Ray RLlib: For reinforcement learning. Many of the widely-used and recent algorithms are implemented. RL often involves running and interoperating with a simulator for the environment (e.g., an actual game engine).
- Ray Serve: Primarily targeted at model serving, but flexible enough for many scalable web service scenarios.
All leverage Ray for cluster-wide scalability. All will be covered in depth in forthcoming tutorial modules. Many Ray users will never actually use the core Ray API, but instead use one or more of these domain-specific APIs. You might be one of those people ;) If you need to implement distributed applications, the current Ray Core tutorial module will help you understand Ray, how it gives you the tools you need for most requirements, and how it works.
Even if you never need to write code in the Ray API, this module will not only help you appreciate how Ray makes your Ray-based API work, but also how to understand and fix performance issues when they arise.
For More Information on Ray and Anyscale¶
- ray.io: The Ray website. In particular:
- Documentation: The full Ray documentation
- Blog: The Ray blog
- GitHub: The source code for Ray
- anyscale.com: The company developing Ray and these tutorials. In particular:
- Blog: The Anyscale blog
- Events: Online events, Ray Summit, and meetups
- Academy: Training for Ray and Anyscale products - What you're looking at!
- Jobs: Yes, we're hiring!
- Community:
- Ray Slack (Click here to join): The best forum for help on Ray. Use the
#tutorials
channel to ask for help on these tutorials! - ray-dev mailing list
- @raydistributed
- @anyscalecompute
- Ray Slack (Click here to join): The best forum for help on Ray. Use the
The next lesson, Ray Multiprocessing, discusses Ray's drop-in replacements for common parallelism APIs,multiprocessing.Pool
and joblib
, and Ray's integration with asyncio
.