-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Logging #15119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Logging #15119
Conversation
a44f357
to
51b1a0f
Compare
50a922b
to
48246d8
Compare
Rewrite test function to scalarSlightly bigger fix was to rewrite the test function for logging to a ScalarFunction. This makes the function generally useful because it can also be used to write to the log through SQL, for example: SELECT write_log('this is my log message', level := 'warn', scope := 'database', log_type := 'some.log.type' ) Note that the return value of write_log is NULL by default, but it can be defined using the return value parameter: SELECT write_log('logging: ' || i::VARCHAR, return_value := i) as i from range(0,10) tbl(i) I've rewritten the benchmarks and tests to use this function now. Pluggable LogStorageThe logstorage is now pluggable, see auto log_storage = make_shared_ptr<MyLogStorage>();
db.instance->GetLogManager().RegisterLogStorage("my_log_storage", log_storage); This may require some more thought later, but the basics should work Minor fixes
|
@Mytherin I think this is one is good to go, failure is a regression test that fails on the old run because the feature is new |
Thanks! |
throw NotImplementedException("File log storage is not yet implemented"); | ||
} else if (registered_log_storages.find(storage_name_to_lower) != registered_log_storages.end()) { | ||
log_storage = registered_log_storages[storage_name_to_lower]; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@samansmink: should there be a:
else {
throw NotImplementedException("Log Storage %s not implemented and not registered", storage_name_to_lower);
}
Currently:
SET logging_storage='abc';
succeeds but do not actually change log_storage
.
…#15677) The metadata size was being calculated for all the groups in the segment, but we were adding this to the total `space_used` We know split the `space_used` into `data_size` and `metadata_size`. `metadata_size` gets recalculated with every flushed container, whereas `data_size` is added onto for every container. We make use of the new [logging](#15119) system to test this! 🥳
This PR follows up on #15119 with: - Switching the HTTP Logger to use the new logging infrastructure by default (logging to a file is still possible using old setting) - Adds a new mechanism for FileHandle logging - Refactoring the loggers to support structured logging My apologies for the slightly chaotic PR here. It took a few tries to understand what makes sense here and I still feel like we are not 100% there yet. However, this *should* be a decent step in the right direction. ## FileHandle logging This PR adds logging infra for filehandles. Basically what I wanted is to create a unified way to log of the FileSystem APIs. I think this can be very useful, for example in testing/optimizing Iceberg/Delta workloads, but also to easily analyze IO patterns of our various readers like the parquet reader, json reader, or avro readers. Since the log messages will be in JSON format, they are easy to parse and do analysis on. Heres a demo: ``` D set enable_logging=true; D set logging_level='trace'; D COPY (SELECT 1 as a) TO './test.csv'; D FROM "./test.csv"; ┌───────┐ │ a │ │ int64 │ ├───────┤ │ 1 │ └───────┘ D SELECT message FROM duckdb_logs WHERE type = 'FileSystem'; ┌───────────────────────────────────────────────────────────────────────────┐ │ message │ │ varchar │ ├───────────────────────────────────────────────────────────────────────────┤ │ {"fs":"LocalFileSystem","path":"./test.csv","op":"OPEN"} │ │ {"fs":"LocalFileSystem","path":"./test.csv","op":"WRITE","bytes":"2"} │ │ {"fs":"LocalFileSystem","path":"./test.csv","op":"WRITE","bytes":"1"} │ │ {"fs":"LocalFileSystem","path":"./test.csv","op":"WRITE","bytes":"1"} │ │ {"fs":"LocalFileSystem","path":"./test.csv","op":"CLOSE"} │ │ {"fs":"LocalFileSystem","path":"./test.csv","op":"OPEN"} │ │ {"fs":"LocalFileSystem","path":"./test.csv","op":"READ","bytes":"4"} │ │ {"fs":"LocalFileSystem","path":"./test.csv","op":"READ","bytes":"0"} │ │ {"fs":"LocalFileSystem","path":"./test.csv","op":"CLOSE"} │ └──────────────────────────────────────────────────── 8F07 ──────────────────────┘ ``` ### Implementation While conceptually simple, the complexity lies in the fact that we want to generally log to the ClientContext to ensure the log entries have connection level context info. This is done by changing what was previously a `unique_ptr<Logger>` to a `shared_ptr<Logger>` in the Client context and copying the pointer to the logger into the filehandles on creation. What this means is that a filehandle will log using the logging configuration of the client context that created it. because for now filehandles will not outlive the clientcontext that created it, this is fine, but in theory this can create some confusion if we were to share filehandles between connections or even between queries. I think we probably just want to ensure we don't keep filehandles open between queries. I've created some macros for standardization of the common filehandle operations that we want to log: ```C++ #define DUCKDB_LOG_FILE_SYSTEM_READ(HANDLE, ...) DUCKDB_LOG_FILE_HANDLE_VA(HANDLE, "READ", __VA_ARGS__); #define DUCKDB_LOG_FILE_SYSTEM_WRITE(HANDLE, ...) DUCKDB_LOG_FILE_HANDLE_VA(HANDLE, "WRITE", __VA_ARGS__); #define DUCKDB_LOG_FILE_SYSTEM_OPEN(HANDLE) DUCKDB_LOG_FILE_HANDLE(HANDLE, "OPEN"); #define DUCKDB_LOG_FILE_SYSTEM_CLOSE(HANDLE) DUCKDB_LOG_FILE_HANDLE(HANDLE, "CLOSE"); ``` Then in the code, for example in the `LocalFileSystem::Read()` we can easily call the logger in an efficient way with the centrally defined log type and level for the FileHandle logging: ```C++ DUCKDB_LOG_FILE_SYSTEM_READ(handle, bytes_to_read, location); ``` which will ensure that extension that implement their own filesystems will be able to easily adhere to the logging convention. ## Logging refactor To neatly support writing structured log messages, this PR makes a few conceptual overhauls to the logging. First of all, we remove the log types as manually passed string. In the initial logging implementation, you would log like: ```C++ DUCKDB_LOG_<level>(context, "<log type string>", "message"); ``` However these manually provided strings are not really practical and likely to result in utter chaos. We change this now to: ```C++ // Default log type (empty string for now) DUCKDB_LOG_<level>(context, "message"); // Predefined log type DUCKDB_LOG(context, LogTypeClassName, "message"); ``` The `DUCKDB_LOG` macro is defined as ```C++ DUCKDB_LOG_INTERNAL(SOURCE, LOG_TYPE_CLASS::NAME, LOG_TYPE_CLASS::LEVEL, LOG_TYPE_CLASS::ConstructLogMessage(__VA_ARGS__)) ``` Which will ensure that the logs types can only be created using the corresponding log message construction methods. The `LogType` class will then also contain the logic to deserialize the log message string into a predefined datatype for easy parsing. What this allows us to do is to easily enable a specific log type, and let DuckDB automatically deserialize and unnest the resulting data: ```SQL PRAGMA enable_logging('FileSystem'); FROM './test.csv' SELECT fs, path, bytes, pos FROM duckdb_logs_parsed('FileSystem'); ``` which yields: ``` LocalFileSystem test.csv OPEN NULL NULL LocalFileSystem test.csv READ 4 0 LocalFileSystem test.csv READ 0 4 LocalFileSystem test.csv CLOSE NULL NULL ``` Note that `duckdb_logs_parsed` is simply a table macro for: ```SQL SELECT * EXCLUDE (message), UNNEST(parse_duckdb_log_message(log_type, message)) FROM duckdb_logs WHERE type = log_type ``` ## TODOs Some general follow ups: - Add logging to azure - Add logging to fsspec - Add logging to #16463 - More benchmarking - Make HTTPLogType a structured type A more complex follow up is to think about how to make this performant. Currently, enabling `PRAGMA enable_logging('FileSystem');` coulld become quite expensive be cause it needs to string compare the the logging type for every single `TRACE` level call. For now we actually get away with it because we only add the logger to the filehandle if that the time of filehandle creation we are logging for `FileSystem`. However, as soon as we were to add more Trace level calls that are hit a lot during execution, things will likely slow down fast.
This draft PR introduces a Logger to DuckDB. The logger will enable the communication of various types of information to both users and developers.
The logger works without any globals and should be fast enough to have log statements compiled into the release builds. By default logging will be disabled but should be configurable through a runtime global setting
How to use
First of all, the logger currently has 6 levels:
TRACE
,DEBUG
,INFO
,WARN
,ERROR
andFATAL
.Secondly, the logger uses no globals, a log statement always requires some sort of context. We currently support
ThreadContext
,ExecutionContext
,ClientContext
,FileOpener
andDatabaseInstance
.Basics
This PR includes a bunch of templating magic to make this work, so our basic log statement looks like:
We can also use other log levels or contexts:
String construction
Because log statements are intended for release binaries too, we should ensure that we can cheaply run the log statements without doing any more work than necessary when logging is disabled. For this reason I have also added format strings log statements:
and callback based log statements:
Enabling logging
Logging is disabled by default, but can be enabled using
Log types
Log entries have a type that is currently set to "default" everywhere. However you can specify a custom type to be set using:
these types can be used to filter using an inclusion list:
or an exclusions list:
Log output
In the current draft state the log entries are written to a ColumnDataCollection stored in the LogManager in the Databaseinstance. We should however be able to switch between that, writing to stdout directly, and appending to a file.
Implementation
The main driver behind this implementation is that log settings should be configurable at runtime. The consequence of this is that we need to take special care to avoid needing to lock some global state or check atomics on every log statement. This is achieved by caching the logging configuration at various points.
3 Layers of logging
Currently, there are 3 places where loggers are stored:
Besides caching the logging config, these loggers can also store additional context information such as the thread_id and the transaction_id.
Also, the logger in the ThreadContext, is thread-local, which we can use later to optimize performance since it can buffer log entries in a thread-local cache, wheres the higher level loggers require a global lock for writing log entries.
Performance
I've added some benchmarks that are also added to the regression tester to ensure we aren't introducing any measurable slowdown by leaving in log statements in release binaries. The most important benchmarks are
benchmark/micro/logger/disabled/*.benchmark
since these measure the overhead of a Log statement when logging is disabled. The results on my M1 Max macbook pro are:In these benchmarks, a single-threaded table function is called which runs a
range()
table function that generates a single INT column with values from 0 to 50000000 in steps of 1. In the inner loop of this range function, the logger is called once for each tuple. The for thelogging_disabled_reference
benchmark, the log statement is not called at all serving as a reference for the other benchmarks. The results show that the Log call is cheap, but also not free when logging is disabled. Furthermore, we can see that the global logger is the most expensive, which makes sense as it needs to use atomics instead of const values. The FileOpener logger is also slightly more expensive which is caused by one extra pointer that needs to be followed for checking if logging is enabled.The conclusion here is that the Log calls are very cheap and can be left in the code basically everywhere except for the performance critical tight inner loops.
Log Type naming convention
Log entries can contain a
log_type
. I propose the following namespaced naming scheme for the log types to keep them sane:<codebase>.<NameSpaceName>.<NameSpaceName>...<TypeName>
for
<codebase>
, this will be any of:duckdb
for logs coming from duckdb itself<extension_name>
for logs coming from extensionsduckdb-<client>
for logs coming from duckdb client code (python, node, etc)For example, I have currently added:
duckdb.ClientContext.BeginQuery
,duckdb.FileSystem.LocalFileSystem.OpenFile
,duckdb.Extensions.ExtensionLoaded
, andduckdb.Extensions.ExtensionAutoLoaded
Remaining work
There's a few things still remaining, though these may potentially be split across follow up PRs:
Prefix matching Log Types
Currently the log types are added to a set and then (inefficiently) string matched when the LogMode is set to
DISABLE_SELECTED
orENABLE_SELECTED
. However, with namespaced log type names, what we would actually like is to be able to do prefix matching. For example in:we would like to have DuckDB log all entries with log types that start with one of these prefixes.
there are basically 2 ways I see of doing this. 1 is to use a trie, perhaps using our own ART. I have discussed this with @taniabogatsch and we have concluded that we would need to make some changes to the ART to make it lightweight enough to fulfil this task. The other approach would be to create some sort of mapping where log types are pre-registered and then passed to the log calls as numericals. The latter would be the fastest for sure, but would also require registering log types separately and keeping some sort of state