8000 Icu autoloading on overloads by carlopi · Pull Request #13716 · duckdb/duckdb · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Icu autoloading on overloads #13716

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

carlopi
Copy link
Contributor
@carlopi carlopi commented Sep 3, 2024

Current autoloading logic works like:

  1. do a lookup for a function name
  2. if empty result set, lookup in EXTENSION_FUNCTIONS (or EXTENSION_COPY_FUNCTIONS, or EXTENSION_TYPES, or EXTENSION_COLLATIONS, depending on type)
  3. do another lookup

This fails for functions where overloads are provided in extensions, since the first step would succeed, leading to no autoloading.

The problem is inherently not super nice, since either we blindly perform autoloading (and might do so N times, paying potentially lag), or when we will eventually realize we might need overloads we will miss context (and we might also not be able to decide on the overload in the same way before and after load).

Only proper solution is performing Autoloading in the same place it's done now also for some overloads.

Solving ICU autoloading problems

The most visible source of problems here, and the one meant to be solved by this PR, are ICU overloads.

This is both since ICU is one of the few autoloadable extensions to provide a lots of autoloads, and since ICU is statically linked-in by default in most platforms, but NOT shipped by default in DuckDB-Wasm.

Current solution to support same syntax locally and on the Web is, in most cases, just force a load of ICU on startup. While this works, it's unnecessary slow AND partially defeats the purpose of autoloading.

Solution

A list of functions in the ICU extensions that provide overloads to already existing functions has been compiled by a logic like:

CREATE TABLE OLD as FROM SELECT count(*), function_name FROM duckdb_functions() GROUP BY function_name;
INSTALL icu;
LOAD icu;
CREATE TABLE NEW as FROM SELECT count(*), function_name FROM duckdb_functions() GROUP BY function_name;
FROM OLD,NEW WHERE OLD.function_name = NEW.function_name AND #1 != #3;

Then when such a function is found, TryAutoLoading is performed on ICU.
Given no information is given back to the user on failures, this is only attempted once per instance. This is to mitigate performing one INSTALL request for every query with a timezone (say on a commit with no ICU available remotely).

On failure to find an overloads, the error message will show that autoloading has been attempted (at some previous point), but given context on the actual error has been lost, a generic error message inviting to explicitly INSTALL and LOAD will be shown.

Demo

Current behaviour

D select now() at time zone 'PST';
Binder Error: No function matches the given name and argument types 'timezone(STRING_LITERAL, TIMESTAMP WITH TIME ZONE)'. You might need to add explicit type casts.
	Candidate functions:
	timezone(DATE) -> BIGINT
	timezone(TIMESTAMP) -> BIGINT
	timezone(INTERVAL) -> BIGINT
	timezone(INTERVAL, TIME WITH TIME ZONE) -> TIME WITH TIME ZONE

LINE 1: select now() at time zone 'PST';
                     ^

Current when autoloading works:

D select now() at time zone 'PST';
┌─────────────────────────────┐
│ main.timezone('PST', now()) │
│          timestamp          │
├─────────────────────────────┤
│ 2024-09-03 05:21:24.425     │
└─────────────────────────────┘

Current when autoloading fails:

D select now() at time zone 'PST';
Binder Error: No function matches the given name and argument types 'timezone(STRING_LITERAL, TIMESTAMP WITH TIME ZONE)'. You might need to add explicit type casts.
	Candidate functions:
	timezone(DATE) -> BIGINT
	timezone(TIMESTAMP) -> BIGINT
	timezone(INTERVAL) -> BIGINT
	timezone(INTERVAL, TIME WITH TIME ZONE) -> TIME WITH TIME ZONE

Implicit installing and loading of 'icu' failed.
Explcitly `INSTALL icu; LOAD icu;` might provide relevant overloads.

LINE 1: select now() at time zone 'PST';
                     ^

Limitiations

The main limitation is that this currently works only for ICU. Generalizing this is viable, requires both compiling list of overloads (ideally to be done dynamically) and to provide a more general mechanism to register extension for which TryAutoLoading has been tried in ClientContext (removing tried_autoloading_icu and using a map from autoloadable extensions names)
The same mechanism can also be used to improve other cases where TryAutoLoading is used, so that repeated failures are avoided. For example, I think the spatial+parquet interaction might incur in 1 request per parqu 8000 et file if the requests are failing. This has not been raised yet, but eventually might need fixing. Similar cases might exist elsewhere, but haven't looked too hard.

How to move forward

Goal of this PR is offer a way forward for duckdb-wasm, so the target on ICU. This also means has basically no effect on distributed binaries that links in ICU, given in that case ICU will already be available in any case.

I would like to use this in duckdb-wasm, one way would be to merge this in duckdb, unsure whether main or feature.
I can also work with this merged into feature, and applying this as a patch on top of duckdb v1.1.0.

Once this logic will be moved to duckdb-wasm (either as part of duckdb or as a patch), this solves duckdb/duckdb-wasm#1770 and various similar issues raised over time.

@carlopi carlopi requested a review from samansmink September 3, 2024 12:33
@carlopi carlopi added the Draft label Sep 3, 2024
@carlopi
Copy link
8000 Contributor Author
carlopi commented Sep 3, 2024

Noticed that in our tests we might be relying on the fact that this should NOT work, like:

LOCAL_EXTENSION_REPO='build/release/repository' ./build/release/test/unittest test/sql/function/timetz/test_date_part.test

there is a select date_part('timezone', '10:00:00'::TIMETZ); that has 0 as expected result, and work only since it's tested only in limited cases.

This also means that it would require some more care / probably can't be merged in main while changing test results.

Results by duckdb available binaries:

  • DuckDB with ICU statically loaded
    duckdb -c "select date_part('timezone', '10:00:00'::TIMETZ);" -> 7200

  • DuckDB builds with no ICU loaded
    duckdb -c "select date_part('timezone', '10:00:00'::TIMETZ);" -> 0

  • DuckDB with ICU available (after this PR)
    duckdb -c "select date_part('timezone', '10:00:00'::TIMETZ);" -> 7200

  • DuckDB-Wasm shell, with no ICU
    select date_part('timezone', '10:00:00'::TIMETZ); -> 0

  • DuckDB-Wasm shell, after ICU loading
    select date_part('timezone', '10:00:00'::TIMETZ); -> 7200

@Mytherin
Copy link
Collaborator
Mytherin commented Sep 3, 2024

Thanks for the PR!

I think this is a good idea, but I think the implementation here is a bit hacky.

I would suggest the following implementation:

  • When starting the database, we implement stubs for the ICU functions that are overloaded. These functions have the correct signature for the overload that ICU adds (e.g. age(TIMESTAMP_TZ, TIMESTAMP_TZ)) - but only have a Bind callback.
  • The bind callback in the stub contains the logic for the auto-loading, e.g. something like this:
unique_ptr<FunctionData> BindAutoLoadableFunction(ClientContext &context, ScalarFunction &bound_function, vector<unique_ptr<Expression>> &arguments) {
    // auto-load the extension
    if (!ExtensionHelper::TryAutoLoadExtension(context, "icu")) {
        throw BinderException("Function %s exists in extension ICU - but the ICU extension cannot be auto-loaded.", bound_function.name);
    }
    // perform the lookup of the function in the catalog
    auto &system_catalog = Catalog::GetSystemCatalog(context);
    auto &function = system_catalog.GetEntry<ScalarFunctionCatalogEntry>(context, DEFAULT_SCHEMA, bound_function.name);
    // bind the function using the argument set
    auto extension_function = function.functions.GetFunctionByArguments(context, bound_function.arguments);
    // override the bound function
    bound_function = std::move(extension_function);
    // call bind again
    return bound_function.bind(context, bound_function, arguments);
}
  • The ICU extension, rather than adding overloads, instead replaces the stubs with the real implementation of the function.

I think that implementation is a lot cleaner as:

(1) we don't require this hacky work-around in the ClientContext to prevent for infinite recursion
(2) We only autoload if the overload is actually used
(3) The present overloads are visible in the catalog
(4) We can even detect that all stubs are correctly registered when loading ICU, as replacing a stub will fail if the stub is not there. This ensures these stubs are kept up-to-date

@carlopi
Copy link
Contributor Author
carlopi commented Sep 3, 2024

I did explore the BindReplace, but had not properly managed. I might ask for help and get that out.
Autoload on usage is way more sane, and solves one big problem (at least for the common case that is ICU) that is the fact that "silent autoloading" might not be visible to the user, and while in the happy path is done only once (since then stuff is loaded), it can be done N times, that is not super user friendly (say you try to fetch an extension over the network, failing every time, and keep trying on every query).

PR is a bit rough, to say the least, I was trying to keep changes to a minimum to see whether this was viable. Thanks for the suggestion, I will look at fixing that up (probably is not really viable to merge for v1.1.0 either way).

@Mytherin
Copy link
Collaborator
Mytherin commented Sep 3, 2024

Autoload on usage is way more sane, and solves one big problem (at least for the common case that is ICU) that is the fact that "silent autoloading" might not be visible to the user, and while in the happy path is done only once (since then stuff is loaded), it can be done N times, that is not super user friendly (say you try to fetch an extension over the network, failing every time, and keep trying on every query).

We can always keep some cache of attempted_autoload_extensions to ensure we only try to autoload the same extension once.

@carlopi
Copy link
Contributor Author
carlopi commented Sep 3, 2024

We can always keep some cache of attempted_autoload_extensions to ensure we only try to autoload the same extension once.

I think this does makes sense to have in general, to be fixed post v1.1.0.

I will close this PR, it's not really useful as-is, thanks for the feedback!

@carlopi carlopi closed this Sep 3, 2024
@carlopi carlopi deleted the icu_autoloading_on_overloads branch February 4, 2025 08:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0