Logging #15119

samansmink · 2024-12-03T16:16:13Z

This draft PR introduces a Logger to DuckDB. The logger will enable the communication of various types of information to both users and developers.

The logger works without any globals and should be fast enough to have log statements compiled into the release builds. By default logging will be disabled but should be configurable through a runtime global setting

How to use

First of all, the logger currently has 6 levels: TRACE, DEBUG, INFO, WARN, ERROR and FATAL.

Secondly, the logger uses no globals, a log statement always requires some sort of context. We currently support ThreadContext, ExecutionContext, ClientContext, FileOpener and DatabaseInstance.

Basics

This PR includes a bunch of templating magic to make this work, so our basic log statement looks like:

Logger::Warn(db_instance, "hi i'm a log entry");

We can also use other log levels or contexts:

Logger::Info(client_context, "this goes to the client_context local logger");
Logger::Fatal(thread_context, "this goes to the thread local logger");

String construction

Because log statements are intended for release binaries too, we should ensure that we can cheaply run the log statements without doing any more work than necessary when logging is disabled. For this reason I have also added format strings log statements:

Logger::Warn(db, "look at my pretty log: '%s'", some_string);

and callback based log statements:

Logger::Info(db, [](){
	return SomeExpensiveStringConstructionFunction();
});

Enabling logging

Logging is disabled by default, but can be enabled using

set enable_logging=true;
set logging_level='warn';

Log types

Log entries have a type that is currently set to "default" everywhere. However you can specify a custom type to be set using:

Logger::Info("my_log_type", client_context, "this goes to the client_context local logger");

these types can be used to filter using an inclusion list:

set enable_logging=true;
set enabled_loggers='my_log_type1,my_log_type2'
set logging_mode='enable_selected'

or an exclusions list:

set disable_logging=true;
set disabled_loggers='my_log_type1,my_log_type2'
set logging_mode='disa
10000
ble_selected'

Log output

In the current draft state the log entries are written to a ColumnDataCollection stored in the LogManager in the Databaseinstance. We should however be able to switch between that, writing to stdout directly, and appending to a file.

Implementation

The main driver behind this implementation is that log settings should be configurable at runtime. The consequence of this is that we need to take special care to avoid needing to lock some global state or check atomics on every log statement. This is achieved by caching the logging configuration at various points.

3 Layers of logging

Currently, there are 3 places where loggers are stored:

In the LogManager (which lives in the DatabaseInstance)
- uses atomics to check if logging is enabled
In the ClientContext
- uses const values to check if logging is enabled
- refreshed on every QueryStart and QueryEnd
In the ThreadContext
- uses const values to check if logging is enabled

Besides caching the logging config, these loggers can also store additional context information such as the thread_id and the transaction_id.

Also, the logger in the ThreadContext, is thread-local, which we can use later to optimize performance since it can buffer log entries in a thread-local cache, wheres the higher level loggers require a global lock for writing log entries.

Performance

I've added some benchmarks that are also added to the regression tester to ensure we aren't introducing any measurable slowdown by leaving in log statements in release binaries. The most important benchmarks are benchmark/micro/logger/disabled/*.benchmark since these measure the overhead of a Log statement when logging is disabled. The results on my M1 Max macbook pro are:

name	result	relative_to_reference
benchmark/micro/logger/disabled/logging_disabled_global.benchmark	0.926281	2.210263585898307
benchmark/micro/logger/disabled/logging_disabled_reference.benchmark	0.419082	1.0
benchmark/micro/logger/disabled/logging_disabled_client_context.benchmark	0.60807	1.4509584768216974
benchmark/micro/logger/disabled/logging_disabled_thread_local.benchmark	0.60227	1.4371349159686324
benchmark/micro/logger/disabled/logging_disabled_file_opener.benchmark	0.64255	1.533238872811405

In these benchmarks, a single-threaded table function is called which runs a range() table function that generates a single INT column with values from 0 to 50000000 in steps of 1. In the inner loop of this range function, the logger is called once for each tuple. The for the logging_disabled_reference benchmark, the log statement is not called at all serving as a reference for the other benchmarks. The results show that the Log call is cheap, but also not free when logging is disabled. Furthermore, we can see that the global logger is the most expensive, which makes sense as it needs to use atomics instead of const values. The FileOpener logger is also slightly more expensive which is caused by one extra pointer that needs to be followed for checking if logging is enabled.

The conclusion here is that the Log calls are very cheap and can be left in the code basically everywhere except for the performance critical tight inner loops.

Log Type naming convention

Log entries can contain a log_type. I propose the following namespaced naming scheme for the log types to keep them sane:
<codebase>.<NameSpaceName>.<NameSpaceName>...<TypeName>

for <codebase>, this will be any of:

duckdb for logs coming from duckdb itself
<extension_name> for logs coming from extensions
duckdb-<client> for logs coming from duckdb client code (python, node, etc)

For example, I have currently added:
duckdb.ClientContext.BeginQuery, duckdb.FileSystem.LocalFileSystem.OpenFile, duckdb.Extensions.ExtensionLoaded, and duckdb.Extensions.ExtensionAutoLoaded

Remaining work

There's a few things still remaining, though these may potentially be split across follow up PRs:

Make LogStorage pluggable for extensions
determine guidelines for default logging behaviour in duckdb/duckdb and core extensions
add C API calls

Prefix matching Log Types

Currently the log types are added to a set and then (inefficiently) string matched when the LogMode is set to DISABLE_SELECTED or ENABLE_SELECTED. However, with namespaced log type names, what we would actually like is to be able to do prefix matching. For example in:

set enabled_loggers='duckdb.FileSystem,httpfs.FileSystem,duckdb.Extensions'
set logging_mode='enable_selected'

we would like to have DuckDB log all entries with log types that start with one of these prefixes.

there are basically 2 ways I see of doing this. 1 is to use a trie, perhaps using our own ART. I have discussed this with @taniabogatsch and we have concluded that we would need to make some changes to the ART to make it lightweight enough to fulfil this task. The other approach would be to create some sort of mapping where log types are pre-registered and then passed to the log calls as numericals. The latter would be the fastest for sure, but would also require registering log types separately and keeping some sort of state

src/logging/logger.cpp

samansmink · 2025-01-09T16:27:44Z

Rewrite test function to scalar

Slightly bigger fix was to rewrite the test function for logging to a ScalarFunction. This makes the function generally useful because it can also be used to write to the log through SQL, for example:

SELECT write_log('this is my log message', level := 'warn', scope := 'database', log_type := 'some.log.type' )

Note that the return value of write_log is NULL by default, but it can be defined using the return value parameter:

SELECT write_log('logging: ' || i::VARCHAR, return_value := i) as i from range(0,10) tbl(i)

I've rewritten the benchmarks and tests to use this function now.

Pluggable LogStorage

The logstorage is now pluggable, see test/logging/test_logging.cpp:

auto log_storage = make_shared_ptr<MyLogStorage>();
db.instance->GetLogManager().RegisterLogStorage("my_log_storage", log_storage);

This may require some more thought later, but the basics should work

Minor fixes

LogLevel Enum is now prefixed with LOG_. The prefix is removed for EnumUtil::ToString
the join between the log_entries and log_contexts is now done in the view, not in a bindreplace
query log is restored for now, will remove once file log is added
remove weird config copy

samansmink · 2025-01-12T11:52:00Z

@Mytherin I think this is one is good to go, failure is a regression test that fails on the old run because the feature is new

Mytherin · 2025-01-12T11:55:20Z

Thanks!

carlopi · 2025-01-12T17:26:01Z

src/logging/log_manager.cpp

+		throw NotImplementedException("File log storage is not yet implemented");
+	} else if (registered_log_storages.find(storage_name_to_lower) != registered_log_storages.end()) {
+		log_storage = registered_log_storages[storage_name_to_lower];
+	}


@samansmink: should there be a:

else { throw NotImplementedException("Log Storage %s not implemented and not registered", storage_name_to_lower); }

Currently:

SET logging_storage='abc';

succeeds but do not actually change log_storage.

…#15677) The metadata size was being calculated for all the groups in the segment, but we were adding this to the total `space_used` We know split the `space_used` into `data_size` and `metadata_size`. `metadata_size` gets recalculated with every flushed container, whereas `data_size` is added onto for every container. We make use of the new [logging](#15119) system to test this! 🥳

Logging (duckdb/duckdb#15119)

This PR follows up on #15119 with: - Switching the HTTP Logger to use the new logging infrastructure by default (logging to a file is still possible using old setting) - Adds a new mechanism for FileHandle logging - Refactoring the loggers to support structured logging My apologies for the slightly chaotic PR here. It took a few tries to understand what makes sense here and I still feel like we are not 100% there yet. However, this *should* be a decent step in the right direction. ## FileHandle logging  This PR adds logging infra for filehandles. Basically what I wanted is to create a unified way to log of the FileSystem APIs. I think this can be very useful, for example in testing/optimizing Iceberg/Delta workloads, but also to easily analyze IO patterns of our various readers like the parquet reader, json reader, or avro readers. Since the log messages will be in JSON format, they are easy to parse and do analysis on. Heres a demo: ``` D set enable_logging=true; D set logging_level='trace'; D COPY (SELECT 1 as a) TO './test.csv'; D FROM "./test.csv"; ┌───────┐ │ a │ │ int64 │ ├───────┤ │ 1 │ └───────┘ D SELECT message FROM duckdb_logs WHERE type = 'FileSystem'; ┌───────────────────────────────────────────────────────────────────────────┐ │ message │ │ varchar │ ├───────────────────────────────────────────────────────────────────────────┤ │ {"fs":"LocalFileSystem","path":"./test.csv","op":"OPEN"} │ │ {"fs":"LocalFileSystem","path":"./test.csv","op":"WRITE","bytes":"2"} │ │ {"fs":"LocalFileSystem","path":"./test.csv","op":"WRITE","bytes":"1"} │ │ {"fs":"LocalFileSystem","path":"./test.csv","op":"WRITE","bytes":"1"} │ │ {"fs":"LocalFileSystem","path":"./test.csv","op":"CLOSE"} │ │ {"fs":"LocalFileSystem","path":"./test.csv","op":"OPEN"} │ │ {"fs":"LocalFileSystem","path":"./test.csv","op":"READ","bytes":"4"} │ │ {"fs":"LocalFileSystem","path":"./test.csv","op":"READ","bytes":"0"} │ │ {"fs":"LocalFileSystem","path":"./test.csv","op":"CLOSE"} │ └────────────────────────────────────────────────────� 8F07 �──────────────────────┘ ``` ### Implementation While conceptually simple, the complexity lies in the fact that we want to generally log to the ClientContext to ensure the log entries have connection level context info. This is done by changing what was previously a `unique_ptr<Logger>` to a `shared_ptr<Logger>` in the Client context and copying the pointer to the logger into the filehandles on creation. What this means is that a filehandle will log using the logging configuration of the client context that created it. because for now filehandles will not outlive the clientcontext that created it, this is fine, but in theory this can create some confusion if we were to share filehandles between connections or even between queries. I think we probably just want to ensure we don't keep filehandles open between queries. I've created some macros for standardization of the common filehandle operations that we want to log: ```C++ #define DUCKDB_LOG_FILE_SYSTEM_READ(HANDLE, ...) DUCKDB_LOG_FILE_HANDLE_VA(HANDLE, "READ", __VA_ARGS__); #define DUCKDB_LOG_FILE_SYSTEM_WRITE(HANDLE, ...) DUCKDB_LOG_FILE_HANDLE_VA(HANDLE, "WRITE", __VA_ARGS__); #define DUCKDB_LOG_FILE_SYSTEM_OPEN(HANDLE) DUCKDB_LOG_FILE_HANDLE(HANDLE, "OPEN"); #define DUCKDB_LOG_FILE_SYSTEM_CLOSE(HANDLE) DUCKDB_LOG_FILE_HANDLE(HANDLE, "CLOSE");  ``` Then in the code, for example in the `LocalFileSystem::Read()` we can easily call the logger in an efficient way with the centrally defined log type and level for the FileHandle logging: ```C++ DUCKDB_LOG_FILE_SYSTEM_READ(handle, bytes_to_read, location);  ``` which will ensure that extension that implement their own filesystems will be able to easily adhere to the logging convention. ## Logging refactor To neatly support writing structured log messages, this PR makes a few conceptual overhauls to the logging. First of all, we remove the log types as manually passed string. In the initial logging implementation, you would log like: ```C++ DUCKDB_LOG_<level>(context, "<log type string>", "message"); ``` However these manually provided strings are not really practical and likely to result in utter chaos. We change this now to: ```C++ // Default log type (empty string for now) DUCKDB_LOG_<level>(context, "message"); // Predefined log type DUCKDB_LOG(context, LogTypeClassName, "message"); ``` The `DUCKDB_LOG` macro is defined as ```C++ DUCKDB_LOG_INTERNAL(SOURCE, LOG_TYPE_CLASS::NAME, LOG_TYPE_CLASS::LEVEL, LOG_TYPE_CLASS::ConstructLogMessage(__VA_ARGS__)) ``` Which will ensure that the logs types can only be created using the corresponding log message construction methods. The `LogType` class will then also contain the logic to deserialize the log message string into a predefined datatype for easy parsing. What this allows us to do is to easily enable a specific log type, and let DuckDB automatically deserialize and unnest the resulting data: ```SQL PRAGMA enable_logging('FileSystem'); FROM './test.csv' SELECT fs, path, bytes, pos FROM duckdb_logs_parsed('FileSystem'); ``` which yields: ``` LocalFileSystem test.csv OPEN NULL NULL LocalFileSystem test.csv READ 4 0 LocalFileSystem test.csv READ 0 4 LocalFileSystem test.csv CLOSE NULL NULL ``` Note that `duckdb_logs_parsed` is simply a table macro for: ```SQL SELECT * EXCLUDE (message), UNNEST(parse_duckdb_log_message(log_type, message)) FROM duckdb_logs WHERE type = log_type ``` ## TODOs Some general follow ups: - Add logging to azure - Add logging to fsspec - Add logging to #16463 - More benchmarking - Make HTTPLogType a structured type A more complex follow up is to think about how to make this performant. Currently, enabling `PRAGMA enable_logging('FileSystem');` coulld become quite expensive be cause it needs to string compare the the logging type for every single `TRACE` level call. For now we actually get away with it because we only add the logger to the filehandle if that the time of filehandle creation we are logging for `FileSystem`. However, as soon as we were to add more Trace level calls that are hit a lot during execution, things will likely slow down fast.

samansmink added 12 commits November 26, 2024 18:21

loggin wip

fece7bd

wip logging

b8e9a20

first basic prototype of logger working-ish

073f5e5

Logging API is done?

3ed52f7

missing file

8f63445

add client context

74a3335

add timestamps

7e59c00

more logging stuff

f3c183e

wip testing logging

5f777fd

add proper logging test

f22f45a

fix various issues discovered by exhaustive tests

03d6ca0

some small touches to logging

e92aefe

Mytherin reviewed Dec 4, 2024

View reviewed changes

src/logging/logger.cpp Outdated Show resolved Hide resolved

samansmink added 11 commits December 5, 2024 13:10

split up logger into files

3481155

basic stdout logging

82888f2

make combined function through bind_replace

6113d18

fix FileOpener logging

821d9fa

wip

baa7eb6

add benchmarking for logging

c5ea48f

add benchmarks for all loggers

ade4bdf

logger cleanup

dfdc587

format

f83b49c

avoid logger outliving buffer manager

4609997

wip locking scans

51b1a0f

samansmink force-pushed the logging-wip branch from a44f357 to 51b1a0f Compare December 12, 2024 16:57

samansmink added 5 commits December 13, 2024 13:59

fix issue with buffer managed log entries

489d7a9

fix context registering issue

ed192aa

format

7f2a4b0

increase buffer size for log storage

2ef8577

add slow test for many log entries

8b347b3

samansmink added 6 commits January 9, 2025 15:29

change log level enum prefix once more

bab121e

remove bindreplace in favor of doing join in view

d5401e9

restore query log for now

921431b

make log storage pluggable

d277ff5

Merge branch 'main' into logging-wip

3494354

cleanup new logger code

48246d8

samansmink force-pushed the logging-wip branch from 50a922b to 48246d8 Compare January 9, 2025 16:12

samansmink marked this pull request as ready for review January 9, 2025 16:40

samansmink added 2 commits January 10, 2025 16:35

clang tidy fixes

dcda89d

prevent overwriting expression alias

88c9593

duckdb-draftbot marked this pull request as draft January 10, 2025 15:58

samansmink marked this pull request as ready for review January 10, 2025 16:00

tidy fixes

ccbb78a

duckdb-draftbot marked this pull request as draft January 11, 2025 13:00

samansmink marked this pull request as ready for review January 11, 2025 13:03

fix include

4d8b254

duckdb-draftbot marked this pull request as draft January 12, 2025 02:55

samansmink marked this pull request as ready for review January 12, 2025 02:55

Mytherin merged commit 16d1d64 into duckdb:main Jan 12, 2025
48 of 49 checks passed

carlopi reviewed Jan 12, 2025

View reviewed changes

github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request Feb 2, 2025

vendor: Update vendored sources to duckdb/duckdb@16d1d64

f6c725f

Logging (duckdb/duckdb#15119)

github-actions bot mentioned this pull request Feb 2, 2025

vendor: Update vendored sources to duckdb/duckdb@16d1d6445dbf6edf5972893c9be938e84ebd26d4 duckdb/duckdb-r#1036

Closed

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request Feb 2, 2025

vendor: Update vendored sources to duckdb/duckdb@16d1d64

39b8eb2

Logging (duckdb/duckdb#15119)

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request Feb 2, 2025

vendor: Update vendored sources to duckdb/duckdb@16d1d64

c043b33

Logging (duckdb/duckdb#15119)

samansmink mentioned this pull request Mar 20, 2025

FileHandle Logging #16758

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Logging #15119

Logging #15119

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Logging #15119

Logging #15119

Uh oh!

Conversation

Uh oh!

How to use

Basics

String construction

Enabling logging

Log types

Log output

Implementation

3 Layers of logging

Performance

Log Type naming convention

Remaining work

Prefix matching Log Types

Uh oh!

Uh oh!

Rewrite test function to scalar

Pluggable LogStorage

Minor fixes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!