-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Icu autoloading on overloads #13716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Icu autoloading on overloads #13716
Conversation
Noticed that in our tests we might be relying on the fact that this should NOT work, like:
there is a This also means that it would require some more care / probably can't be merged in main while changing test results. Results by duckdb available binaries:
|
Thanks for the PR! I think this is a good idea, but I think the implementation here is a bit hacky. I would suggest the following implementation:
unique_ptr<FunctionData> BindAutoLoadableFunction(ClientContext &context, ScalarFunction &bound_function, vector<unique_ptr<Expression>> &arguments) {
// auto-load the extension
if (!ExtensionHelper::TryAutoLoadExtension(context, "icu")) {
throw BinderException("Function %s exists in extension ICU - but the ICU extension cannot be auto-loaded.", bound_function.name);
}
// perform the lookup of the function in the catalog
auto &system_catalog = Catalog::GetSystemCatalog(context);
auto &function = system_catalog.GetEntry<ScalarFunctionCatalogEntry>(context, DEFAULT_SCHEMA, bound_function.name);
// bind the function using the argument set
auto extension_function = function.functions.GetFunctionByArguments(context, bound_function.arguments);
// override the bound function
bound_function = std::move(extension_function);
// call bind again
return bound_function.bind(context, bound_function, arguments);
}
I think that implementation is a lot cleaner as: (1) we don't require this hacky work-around in the ClientContext to prevent for infinite recursion |
I did explore the BindReplace, but had not properly managed. I might ask for help and get that out. PR is a bit rough, to say the least, I was trying to keep changes to a minimum to see whether this was viable. Thanks for the suggestion, I will look at fixing that up (probably is not really viable to merge for v1.1.0 either way). |
We can always keep some cache of |
I think this does makes sense to have in general, to be fixed post v1.1.0. I will close this PR, it's not really useful as-is, thanks for the feedback! |
Current autoloading logic works like:
This fails for functions where overloads are provided in extensions, since the first step would succeed, leading to no autoloading.
The problem is inherently not super nice, since either we blindly perform autoloading (and might do so N times, paying potentially lag), or when we will eventually realize we might need overloads we will miss context (and we might also not be able to decide on the overload in the same way before and after load).
Only proper solution is performing Autoloading in the same place it's done now also for some overloads.
Solving ICU autoloading problems
The most visible source of problems here, and the one meant to be solved by this PR, are ICU overloads.
This is both since ICU is one of the few autoloadable extensions to provide a lots of autoloads, and since ICU is statically linked-in by default in most platforms, but NOT shipped by default in DuckDB-Wasm.
Current solution to support same syntax locally and on the Web is, in most cases, just force a load of ICU on startup. While this works, it's unnecessary slow AND partially defeats the purpose of autoloading.
Solution
A list of functions in the ICU extensions that provide overloads to already existing functions has been compiled by a logic like:
Then when such a function is found, TryAutoLoading is performed on ICU.
Given no information is given back to the user on failures, this is only attempted once per instance. This is to mitigate performing one INSTALL request for every query with a
timezone
(say on a commit with no ICU available remotely).On failure to find an overloads, the error message will show that autoloading has been attempted (at some previous point), but given context on the actual error has been lost, a generic error message inviting to explicitly INSTALL and LOAD will be shown.
Demo
Current behaviour
Current when autoloading works:
Current when autoloading fails:
Limitiations
The main limitation is that this currently works only for ICU. Generalizing this is viable, requires both compiling list of overloads (ideally to be done dynamically) and to provide a more general mechanism to register extension for which TryAutoLoading has been tried in ClientContext (removing tried_autoloading_icu and using a map from autoloadable extensions names)
The same mechanism can also be used to improve other cases where TryAutoLoading is used, so that repeated failures are avoided. For example, I think the spatial+parquet interaction might incur in 1 request per parqu 8000 et file if the requests are failing. This has not been raised yet, but eventually might need fixing. Similar cases might exist elsewhere, but haven't looked too hard.
How to move forward
Goal of this PR is offer a way forward for duckdb-wasm, so the target on ICU. This also means has basically no effect on distributed binaries that links in ICU, given in that case ICU will already be available in any case.
I would like to use this in duckdb-wasm, one way would be to merge this in duckdb, unsure whether main or feature.
I can also work with this merged into feature, and applying this as a patch on top of duckdb v1.1.0.
Once this logic will be moved to duckdb-wasm (either as part of duckdb or as a patch), this solves duckdb/duckdb-wasm#1770 and various similar issues raised over time.