Description
What happened + What you expected to happen
1. The bug
When writing a Ray dataset in a BigQuery table using the dataset's method "write_bigquery()", my job fails because of an exception from GCP API - google.api_core.exceptions.TooManyRequests- indicating that I reached GCP quotas regarding the number of operations on a table. It seems that no retries are triggered before the exception is raised.
2. Expected behavior
I would expect that this king of exceptions relative to quotas is retried.
The trace:
RayTaskError(TooManyRequests): ray::Write() (pid=354836, ip=172.19.29.23)
for b_out in map_transformer.apply_transform(iter(blocks), ctx):
File "/home/project/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 253, in __call__
yield from self._block_fn(input, ctx)
File "/home/project/lib/python3.10/site-packages/ray/data/_internal/planner/plan_write_op.py", line 26, in fn
write_result = datasink_or_legacy_datasource.write(blocks, ctx)
File "/home/project/lib/python3.10/site-packages/ray/data/_internal/datasource/bigquery_datasink.py", line 125, in write
ray.get(
ray.exceptions.RayTaskError(TooManyRequests): ray::_write_single_block() (pid=314835, ip=172.19.29.23)
File "/home/project/lib/python3.10/site-packages/ray/data/_internal/datasource/bigquery_datasink.py", line 96, in _write_single_block
logger.info(job.result())
File "/home/project/lib/python3.10/site-packages/google/cloud/bigquery/job/base.py", line 1003, in result
return super(_AsyncJob, self).result(timeout=timeout, **kwargs)
File "/home/project/lib/python3.10/site-packages/google/api_core/future/polling.py", line 261, in result
raise self._exception
google.api_core.exceptions.TooManyRequests: 429 Exceeded rate limits: too many table update operations for this table. For more information, see https://cloud.google.com/bigquery/docs/troubleshoot-quotas; reason: rateLimitExceeded, location: table.write, message: Exceeded rate limits: too many table update operations for this table. For more information, see https://cloud.google.com/bigquery/docs/troubleshoot-quotas
3. Useful information
This behavior appeared after I updated the version of google-cloud-bigquery.
I believe it comes from this google-cloud-bigquery PR deployed in version 3.26.0.
From what I understood, when the user experiences a rateLimitExceeded condition, Google now raises an http.client.TooManyRequests instead of a http.client.FORBIDEN.
The issue is that Ray retries the wirting of single blocks only for Forbidden exceptions. In Ray's bigquery_datasink.py:
except exceptions.Forbidden as e:
retry_cnt += 1
if retry_cnt > self.max_retry_cnt:
break
logger.info(
"A block write encountered a rate limit exceeded error"
+ f" {retry_cnt} time(s). Sleeping to try again."
)
logging.debug(e)
time.sleep(RATE_LIMIT_EXCEEDED_SLEEP_TIME)
A fix could be:
except (exceptions.Forbidden, exceptions.TooManyRequests)
Versions / Dependencies
python=3.10.13
ray=2.34.0
google-cloud-bigquery=3.25.0
os: Debian GNU/Linux 12 (bookworm)
Reproduction script
import pandas as pd
import random
import string
import ray
PROJECT_ID = "project_id"
DATASET = "dataset_name.table_name"
def random_string(length=5):
return ''.join(random.choices(string.ascii_letters, k=length))
data = {
"col1": [random_string() for _ in range(100_000_000)],
"col2": [random_string() for _ in range(100_000_000)],
"col3": [random_string() for _ in range(100_000_000)],
}
df = pd.DataFrame(data)
ray.init()
ray_dataset = ray.data.from_pandas(df)
ray_dataset.write_bigquery(
project_id=PROJECT_ID,
dataset=DATASET,
overwrite_table=True
)
Issue Severity
Medium: It is a significant difficulty but I can work around it.