Ray on Golem troubleshooting

Ray on Golem's log files

Problem

When you encounter issues while running one of the various Ray commands that utilize Ray on Golem, it may not always be immediately obvious what the problem is from the output of the command itself.

Where possible, we try to display a meaningful error message - e.g. when ray up fails to start up the Ray on Golem's webserver, we display a few last lines of the logs.

If that's not enough to suggest a way to fix the issue, an investigation of logs may prove more useful.

Solution

The log files that Ray on Golem produces are stored in the following files:

/tmp/ray_on_golem/webserver.log - cluster manager log - basic Ray on Golem logs.
/tmp/ray_on_golem/webserver_debug.log cluster manager debug log - more detailed Ray on Golem logs.
/tmp/ray_on_golem/yagna.log - Golem node (yagna) logs.

Given these, you can either:

Look at the webserver.log yourself - our aim is for you to be able to diagnose and resolve the most straightforward issues on your own.
Share the webserver_debug.log and yagna.log with us on #Ray on Golem discord channel - we will be more than happy to assist.

Lack of a complete cleanup on shutdown

Problem

It may happen that some of Ray on Golem's components are still up after the successful completion of ray down. While it's usually not a problem in itself, you might wish to start with a clean slate on consecutive ray up runs.

Solution

To perform a cleanup, let's first check if there are any orphaned components indeed:

ps axc | grep -v grep | grep -E 'yagna|ray-on-golem'

It produces an output like this:

  71257 ?        Ssl    0:02 ray-on-golem
  71258 ?        Sl     2:35 yagna

The above shows ray-on-golem webserver and the yagna daemon are running.

The surest way to stop them is to kill them (using the PID numbers as shown in the first column):

kill -9 71257 71258

After it is done, the above command should show no more hanging processes:

ps axc | grep -v grep | grep -E 'yagna|ray-on-golem'

It might also be a good idea to clean up Ray's configuration cache files:

rm /tmp/ray-config-*

How can I resolve the node not found error?

Problem

Sometimes, ray up might time out with a note saying that there is no node available.

It might look like this:

ray_on_golem.client.exceptions.RayOnGolemClientError: Couldn't create node: {"error": "Creating activities timeout reached"}

Solution

This means, that there are not enough providers on the network.

If you are running Ray on Golem on the testnet (property network: "goerli" in the cluster yaml) - most likely all the nodes are busy with requests of other users.

We are preparing a tool to check providers' availability. Another solution would be to move to mainnet - we are also working on enabling this option.

For now, the best course of action would be to report the situation on #Ray on Golem discord channel.

Usually, the testnet isn't busy for a long time - it might be enough to wait a couple of minutes.

Additional packages required on the cluster

Problem

Your app might need non-standard packages to run on a Ray on Golem cluster. The default VM image includes nothing besides the bare minimum.

In such a case, the output for ray submit may look like this:

Traceback (most recent call last):
  File "/root/dds-with-ray.py", line 49, in <module>
    results = get_lots_of_deals()
  File "/root/dds-with-ray.py", line 42, in get_lots_of_deals
    results = ray.get(result_ids)
  File "/usr/local/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/ray/_private/worker.py", line 2547, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ModuleNotFoundError): ray::get_deal() (pid=871, ip=192.168.0.3)
  File "/root/dds-with-ray.py", line 30, in get_deal
    from endplay.dds import calc_dd_table
ModuleNotFoundError: No module named 'endplay'
Shared connection to 192.168.0.3 closed.
Error: Command failed:

Solution

Note that even if you have the needed dependencies installed, and your app runs on your local environment, you still need to tell Ray on Golem cluster the packages are needed.

The best way to do it is by adding an appropriate pip install command to setup_commands in the cluster yaml file. Check out the cluster yaml reference to get more information.

Passing arguments to your Ray script fails

Problem

Often, you need to pass arguments to the Ray script you are submitting:

ray submit golem-cluster.yaml examples/simple-task.py -n 20

While it looks correct, in this particular case it won't have the intended effect of passing the parameter to the simple-stask.py script. Instead, it will cause Ray to, seemingly, suddenly require a password for the node, i.e.:

Checking External environment settings
Ray On Golem webserver
  Not starting webserver, as it's already running
Fetched IP: 192.168.0.3
root@192.168.0.3's password:

Solution

ray submit and other ray commands have their own arguments. If your script arguments are named the same, Ray will get confused.

In this example -n stands for cluster name, which we don't support yet - hence the unexpected request for the SSH password.

The solution is to precede your arguments with the double-dash symbol (--):

ray submit golem-cluster.yaml examples/simple-task.py -- -n 20

This informs Ray that everything after the double dash is not to be interpreted, but instead passed as-is to the executed script.