This directory contains examples, guides and best practices for running Ray on Google Kubernetes Engine.
Most examples use the applications/ray
terraform module to install KubeRay and deploy RayCluster resources.
It is highly recommended to use the infrastructure terraform module to create your GKE cluster.
Edit templates/workloads.tfvars
with your environment specific variables and configurations.
The following variables require configuration:
- project_id
- cluster_name
- cluster_location
If you need a new cluster, you can specify create_cluster: true
.
Run the following commands to install KubeRay and deploy a Ray cluster onto your existing cluster.
cd templates/
terraform init
terraform apply --var-file=workloads.tfvars
Validate that the RayCluster is ready:
$ kubectl get raycluster
NAME DESIRED WORKERS AVAILABLE WORKERS STATUS AGE
ray-cluster-kuberay 1 1 ready 3m41s
See tfvars examples to explore different configuration options for the Ray cluster using the terraform templates.
Ensure Ray is installed in your environment. See Installing Ray for more details.
To submit a Ray job, first establish a connection to the Ray head. For this example we'll use kubectl port-forward
to connect to the Ray head via localhost.
$ kubectl -n ai-on-gke port-forward service/ray-cluster-kuberay-head-svc 8265 &
Submit a Ray job that prints resources available in your Ray cluster:
$ ray job submit --address http://localhost:8265 -- python -c "import ray; ray.init(); print(ray.cluster_resources())"
Job submission server address: http://localhost:8265
-------------------------------------------------------
Job 'raysubmit_4JBD9mLhh9sjqm8g' submitted successfully
-------------------------------------------------------
Next steps
Query the logs of the job:
ray job logs raysubmit_4JBD9mLhh9sjqm8g
Query the status of the job:
ray job status raysubmit_4JBD9mLhh9sjqm8g
Request the job to be stopped:
ray job stop raysubmit_4JBD9mLhh9sjqm8g
Tailing logs until the job exits (disable with --no-wait):
2024-03-19 20:46:28,668 INFO worker.py:1405 -- Using address 10.80.0.19:6379 set in the environment variable RAY_ADDRESS
2024-03-19 20:46:28,668 INFO worker.py:1540 -- Connecting to existing Ray cluster at address: 10.80.0.19:6379...
2024-03-19 20:46:28,677 INFO worker.py:1715 -- Connected to Ray cluster. View the dashboard at 10.80.0.19:8265
{'node:__internal_head__': 1.0, 'object_store_memory': 2295206707.0, 'memory': 8000000000.0, 'CPU': 4.0, 'node:10.80.0.19': 1.0}
Handling connection for 8265
------------------------------------------
Job 'raysubmit_4JBD9mLhh9sjqm8g' succeeded
------------------------------------------
The RayClient API enables Python scripts to interactively connect to remote Ray clusters. See Ray Client for more details.
To use the client, first establish a connection to the Ray head. For this example we'll use kubectl port-forward
to connect to the Ray head Service via localhost.
$ kubectl -n ai-on-gke port-forward service/ray-cluster-kuberay-head-svc 10001 &
Next, define a Python script containing remote code you want to run on your Ray cluster. Similar to the previous example, this remote function will print the resources available in the cluster:
# cluster_resources.py
import ray
ray.init("ray://localhost:10001")
@ray.remote
def cluster_resources():
return ray.cluster_resources()
print(ray.get(cluster_resources.remote()))
Run the Python script:
$ python cluster_resources.py
{'CPU': 4.0, 'node:__internal_head__': 1.0, 'object_store_memory': 2280821145.0, 'node:10.80.0.22': 1.0, 'memory': 8000000000.0}
See the following guides and tutorials for running Ray applications on GKE:
- Getting Started with KubeRay
- Serve an LLM on L4 GPUs with Ray
- Logging & Monitoring for Ray clusters
- TPU Guide
- Priority Scheduling with RayJob and Kueue
- Gang Scheduling with RayJob and Kueue
- RayTrain with GCSFuse CSI driver
- Configuring KubeRay to use Google Cloud Storage Buckets in GKE
- Example Notebooks with Ray
- Example templates for Ray clusterse