Ray on Golem troubleshooting
Ray on Golem's log files
Problem
When you encounter issues while running one of the various Ray commands that utilize Ray on Golem, it may not always be immediately obvious what the problem is from the output of the command itself.
Where possible, we try to display a meaningful error message - e.g. when ray up
fails to start up the Ray on Golem's webserver, we display a few last lines of the logs.
If that's not enough to suggest a way to fix the issue, an investigation of logs may prove more useful.
Solution
The log files that Ray on Golem produces are stored in the following files:
/tmp/ray_on_golem/webserver.log
- cluster manager log - basic Ray on Golem logs./tmp/ray_on_golem/webserver_debug.log
cluster manager debug log - more detailed Ray on Golem logs./tmp/ray_on_golem/yagna.log
- Golem node (yagna) logs.
Given these, you can either:
- Look at the
webserver.log
yourself - our aim is for you to be able to diagnose and resolve the most straightforward issues on your own. - Share the
webserver_debug.log
andyagna.log
with us on#Ray on Golem
discord channel - we will be more than happy to assist.
Lack of a complete cleanup on shutdown
Problem
It may happen that some of Ray on Golem's components are still up after the successful completion of ray down
. While it's usually not a problem in itself, you might wish to start with a clean slate on consecutive ray up
runs.
Solution
To perform a cleanup, let's first check if there are any orphaned components indeed:
ps axc | grep -v grep | grep -E 'yagna|ray-on-golem'
It produces an output like this:
71257 ? Ssl 0:02 ray-on-golem
71258 ? Sl 2:35 yagna
The above shows ray-on-golem
webserver and the yagna
daemon are running.
The surest way to stop them is to kill them (using the PID numbers as shown in the first column):
kill -9 71257 71258
After it is done, the above command should show no more hanging processes:
ps axc | grep -v grep | grep -E 'yagna|ray-on-golem'
It might also be a good idea to clean up Ray's configuration cache files:
rm /tmp/ray-config-*
How can I resolve the node not found error?
Problem
Sometimes, ray up
might time out with a note saying that there is no node available.
It might look like this:
ray_on_golem.client.exceptions.RayOnGolemClientError: Couldn't create node: {"error": "Creating activities timeout reached"}
Solution
This means, that there are not enough providers on the network.
If you are running Ray on Golem on the testnet (property network: "goerli"
in the cluster yaml) - most likely all the nodes are busy with requests of other users.
We are preparing a tool to check providers' availability. Another solution would be to move to mainnet - we are also working on enabling this option.
For now, the best course of action would be to report the situation on #Ray on Golem
discord channel.
Usually, the testnet isn't busy for a long time - it might be enough to wait a couple of minutes.
Additional packages required on the cluster
Problem
Your app might need non-standard packages to run on a Ray on Golem cluster. The default VM image includes nothing besides the bare minimum.
In such a case, the output for ray submit
may look like this:
Traceback (most recent call last):
File "/root/dds-with-ray.py", line 49, in <module>
results = get_lots_of_deals()
File "/root/dds-with-ray.py", line 42, in get_lots_of_deals
results = ray.get(result_ids)
File "/usr/local/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/ray/_private/worker.py", line 2547, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ModuleNotFoundError): ray::get_deal() (pid=871, ip=192.168.0.3)
File "/root/dds-with-ray.py", line 30, in get_deal
from endplay.dds import calc_dd_table
ModuleNotFoundError: No module named 'endplay'
Shared connection to 192.168.0.3 closed.
Error: Command failed:
Solution
Note that even if you have the needed dependencies installed, and your app runs on your local environment, you still need to tell Ray on Golem cluster the packages are needed.
The best way to do it is by adding an appropriate pip install
command to setup_commands
in the cluster yaml file. Check out the cluster yaml reference to get more information.
Passing arguments to your Ray script fails
Problem
Often, you need to pass arguments to the Ray script you are submitting:
ray submit golem-cluster.yaml examples/simple-task.py -n 20
While it looks correct, in this particular case it won't have the intended effect of passing the parameter to the simple-stask.py
script. Instead, it will cause Ray to, seemingly, suddenly require a password for the node, i.e.:
Checking External environment settings
Ray On Golem webserver
Not starting webserver, as it's already running
Fetched IP: 192.168.0.3
root@192.168.0.3's password:
Solution
ray submit
and other ray commands have their own arguments. If your script arguments are named the same, Ray will get confused.
In this example -n
stands for cluster name, which we don't support yet - hence the unexpected request for the SSH password.
The solution is to precede your arguments with the double-dash symbol (--
):
ray submit golem-cluster.yaml examples/simple-task.py -- -n 20
This informs Ray that everything after the double dash is not to be interpreted, but instead passed as-is to the executed script.