10000 Logging by samansmink · Pull Request #15119 · duckdb/duckdb · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Logging #15119

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 68 commits into from
Jan 12, 2025
Merged

Logging #15119

merged 68 commits into from
Jan 12, 2025

Conversation

samansmink
Copy link
Contributor
@samansmink samansmink commented Dec 3, 2024

This draft PR introduces a Logger to DuckDB. The logger will enable the communication of various types of information to both users and developers.

The logger works without any globals and should be fast enough to have log statements compiled into the release builds. By default logging will be disabled but should be configurable through a runtime global setting

How to use

First of all, the logger currently has 6 levels: TRACE, DEBUG, INFO, WARN, ERROR and FATAL.

Secondly, the logger uses no globals, a log statement always requires some sort of context. We currently support ThreadContext, ExecutionContext, ClientContext, FileOpener and DatabaseInstance.

Basics

This PR includes a bunch of templating magic to make this work, so our basic log statement looks like:

Logger::Warn(db_instance, "hi i'm a log entry");

We can also use other log levels or contexts:

Logger::Info(client_context, "this goes to the client_context local logger");
Logger::Fatal(thread_context, "this goes to the thread local logger");

String construction

Because log statements are intended for release binaries too, we should ensure that we can cheaply run the log statements without doing any more work than necessary when logging is disabled. For this reason I have also added format strings log statements:

Logger::Warn(db, "look at my pretty log: '%s'", some_string);

and callback based log statements:

Logger::Info(db, [](){
	return SomeExpensiveStringConstructionFunction();
});

Enabling logging

Logging is disabled by default, but can be enabled using

set enable_logging=true;
set logging_level='warn';

Log types

Log entries have a type that is currently set to "default" everywhere. However you can specify a custom type to be set using:

Logger::Info("my_log_type", client_context, "this goes to the client_context local logger");

these types can be used to filter using an inclusion list:

set enable_logging=true;
set enabled_loggers='my_log_type1,my_log_type2'
set logging_mode='enable_selected'

or an exclusions list:

set disable_logging=true;
set disabled_loggers='my_log_type1,my_log_type2'
set logging_mode='disa
10000
ble_selected'

Log output

In the current draft state the log entries are written to a ColumnDataCollection stored in the LogManager in the Databaseinstance. We should however be able to switch between that, writing to stdout directly, and appending to a file.

Implementation

The main driver behind this implementation is that log settings should be configurable at runtime. The consequence of this is that we need to take special care to avoid needing to lock some global state or check atomics on every log statement. This is achieved by caching the logging configuration at various points.

3 Layers of logging

Currently, there are 3 places where loggers are stored:

  • In the LogManager (which lives in the DatabaseInstance)
    • uses atomics to check if logging is enabled
  • In the ClientContext
    • uses const values to check if logging is enabled
    • refreshed on every QueryStart and QueryEnd
  • In the ThreadContext
    • uses const values to check if logging is enabled

Besides caching the logging config, these loggers can also store additional context information such as the thread_id and the transaction_id.

Also, the logger in the ThreadContext, is thread-local, which we can use later to optimize performance since it can buffer log entries in a thread-local cache, wheres the higher level loggers require a global lock for writing log entries.

Performance

I've added some benchmarks that are also added to the regression tester to ensure we aren't introducing any measurable slowdown by leaving in log statements in release binaries. The most important benchmarks are benchmark/micro/logger/disabled/*.benchmark since these measure the overhead of a Log statement when logging is disabled. The results on my M1 Max macbook pro are:

name result relative_to_reference
benchmark/micro/logger/disabled/logging_disabled_global.benchmark 0.926281 2.210263585898307
benchmark/micro/logger/disabled/logging_disabled_reference.benchmark 0.419082 1.0
benchmark/micro/logger/disabled/logging_disabled_client_context.benchmark 0.60807 1.4509584768216974
benchmark/micro/logger/disabled/logging_disabled_thread_local.benchmark 0.60227 1.4371349159686324
benchmark/micro/logger/disabled/logging_disabled_file_opener.benchmark 0.64255 1.533238872811405

In these benchmarks, a single-threaded table function is called which runs a range() table function that generates a single INT column with values from 0 to 50000000 in steps of 1. In the inner loop of this range function, the logger is called once for each tuple. The for the logging_disabled_reference benchmark, the log statement is not called at all serving as a reference for the other benchmarks. The results show that the Log call is cheap, but also not free when logging is disabled. Furthermore, we can see that the global logger is the most expensive, which makes sense as it needs to use atomics instead of const values. The FileOpener logger is also slightly more expensive which is caused by one extra pointer that needs to be followed for checking if logging is enabled.

The conclusion here is that the Log calls are very cheap and can be left in the code basically everywhere except for the performance critical tight inner loops.

Log Type naming convention

Log entries can contain a log_type. I propose the following namespaced naming scheme for the log types to keep them sane:
<codebase>.<NameSpaceName>.<NameSpaceName>...<TypeName>

for <codebase>, this will be any of:

  • duckdb for logs coming from duckdb itself
  • <extension_name> for logs coming from extensions
  • duckdb-<client> for logs coming from duckdb client code (python, node, etc)

For example, I have currently added:
duckdb.ClientContext.BeginQuery, duckdb.FileSystem.LocalFileSystem.OpenFile, duckdb.Extensions.ExtensionLoaded, and duckdb.Extensions.ExtensionAutoLoaded

Remaining work

There's a few things still remaining, though these may potentially be split across follow up PRs:

  • Make LogStorage pluggable for extensions
  • determine guidelines for default logging behaviour in duckdb/duckdb and core extensions
  • add C API calls

Prefix matching Log Types

Currently the log types are added to a set and then (inefficiently) string matched when the LogMode is set to DISABLE_SELECTED or ENABLE_SELECTED. However, with namespaced log type names, what we would actually like is to be able to do prefix matching. For example in:

set enabled_loggers='duckdb.FileSystem,httpfs.FileSystem,duckdb.Extensions'
set logging_mode='enable_selected'

we would like to have DuckDB log all entries with log types that start with one of these prefixes.

there are basically 2 ways I see of doing this. 1 is to use a trie, perhaps using our own ART. I have discussed this with @taniabogatsch and we have concluded that we would need to make some changes to the ART to make it lightweight enough to fulfil this task. The other approach would be to create some sort of mapping where log types are pre-registered and then passed to the log calls as numericals. The latter would be the fastest for sure, but would also require registering log types separately and keeping some sort of state

@samansmink
Copy link
Contributor Author

Rewrite test function to scalar

Slightly bigger fix was to rewrite the test function for logging to a ScalarFunction. This makes the function generally useful because it can also be used to write to the log through SQL, for example:

SELECT write_log('this is my log message', level := 'warn', scope := 'database', log_type := 'some.log.type' )

Note that the return value of write_log is NULL by default, but it can be defined using the return value parameter:

SELECT write_log('logging: ' || i::VARCHAR, return_value := i) as i from range(0,10) tbl(i)

I've rewritten the benchmarks and tests to use this function now.

Pluggable LogStorage

The logstorage is now pluggable, see test/logging/test_logging.cpp:

auto log_storage = make_shared_ptr<MyLogStorage>();
db.instance->GetLogManager().RegisterLogStorage("my_log_storage", log_storage);

This may require some more thought later, but the basics should work

Minor fixes

  • LogLevel Enum is now prefixed with LOG_. The prefix is removed for EnumUtil::ToString
  • the join between the log_entries and log_contexts is now done in the view, not in a bindreplace
  • query log is restored for now, will remove once file log is added
  • remove weird config copy

@samansmink samansmink marked this pull request as ready for review January 9, 2025 16:40
@duckdb-draftbot duckdb-draftbot marked this pull request as draft January 10, 2025 15:58
@samansmink samansmink marked this pull request as ready for review January 10, 2025 16:00
@duckdb-draftbot duckdb-draftbot marked this pull request as draft January 11, 2025 13:00
@samansmink samansmink marked this pull request as ready for review January 11, 2025 13:03
@duckdb-draftbot duckdb-draftbot marked this pull request as draft January 12, 2025 02:55
@samansmink samansmink marked this pull request as ready for review January 12, 2025 02:55
@samansmink
Copy link
Contributor Author

@Mytherin I think this is one is good to go, failure is a regression test that fails on the old run because the feature is new

@Mytherin Mytherin merged commit 16d1d64 into duckdb:main Jan 12, 2025
48 of 49 checks passed
@Mytherin
Copy link
Collaborator

Thanks!

throw NotImplementedException("File log storage is not yet implemented");
} else if (registered_log_storages.find(storage_name_to_lower) != registered_log_storages.end()) {
log_storage = registered_log_storages[storage_name_to_lower];
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@samansmink: should there be a:

else {
    throw NotImplementedException("Log Storage %s not implemented and not registered", storage_name_to_lower);
}

Currently:

SET logging_storage='abc';

succeeds but do not actually change log_storage.

Mytherin added a commit that referenced this pull request Jan 13, 2025
…#15677)

The metadata size was being calculated for all the groups in the
segment, but we were adding this to the total `space_used`
We know split the `space_used` into `data_size` and `metadata_size`.

`metadata_size` gets recalculated with every flushed container, whereas
`data_size` is added onto for every container.
We make use of the new [logging](#15119) system to test this! 🥳
github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request Feb 2, 2025
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request Feb 2, 2025
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request Feb 2, 2025
@samansmink samansmink mentioned this pull request Mar 20, 2025
Mytherin added a commit that referenced this pull request May 15, 2025
This PR follows up on #15119 with:
- Switching the HTTP Logger to use the new logging infrastructure by
default (logging to a file is still possible using old setting)
- Adds a new mechanism for FileHandle logging
- Refactoring the loggers to support structured logging

My apologies for the slightly chaotic PR here. It took a few tries to
understand what makes sense here and I still feel like we are not 100%
there yet. However, this *should* be a decent step in the right
direction.

## FileHandle logging

This PR adds logging infra for filehandles. Basically what I wanted is
to create a unified way to log of the FileSystem APIs. I think this can
be very useful, for example in testing/optimizing Iceberg/Delta
workloads, but also to easily analyze IO patterns of our various readers
like the parquet reader, json reader, or avro readers. Since the log
messages will be in JSON format, they are easy to parse and do analysis
on.

Heres a demo:

```
D set enable_logging=true;
D set logging_level='trace';
D COPY (SELECT 1 as a) TO './test.csv';
D FROM "./test.csv";
┌───────┐
│   a   │
│ int64 │
├───────┤
│   1   │
└───────┘
D SELECT message FROM duckdb_logs WHERE type = 'FileSystem';
┌───────────────────────────────────────────────────────────────────────────┐
│                                  message                                  │
│                                  varchar                                  │
├───────────────────────────────────────────────────────────────────────────┤
│ {"fs":"LocalFileSystem","path":"./test.csv","op":"OPEN"}                  │
│ {"fs":"LocalFileSystem","path":"./test.csv","op":"WRITE","bytes":"2"}     │
│ {"fs":"LocalFileSystem","path":"./test.csv","op":"WRITE","bytes":"1"}     │
│ {"fs":"LocalFileSystem","path":"./test.csv","op":"WRITE","bytes":"1"}     │
│ {"fs":"LocalFileSystem","path":"./test.csv","op":"CLOSE"}                 │
│ {"fs":"LocalFileSystem","path":"./test.csv","op":"OPEN"}                  │
│ {"fs":"LocalFileSystem","path":"./test.csv","op":"READ","bytes":"4"}      │
│ {"fs":"LocalFileSystem","path":"./test.csv","op":"READ","bytes":"0"}      │
│ {"fs":"LocalFileSystem","path":"./test.csv","op":"CLOSE"}                 │
└────────────────────────────────────────────────────
8F07
──────────────────────┘
```

### Implementation
While conceptually simple, the complexity lies in the fact that we want
to generally log to the ClientContext to ensure the log entries have
connection level context info. This is done by changing what was
previously a `unique_ptr<Logger>` to a `shared_ptr<Logger>` in the
Client context and copying the pointer to the logger into the
filehandles on creation. What this means is that a filehandle will log
using the logging configuration of the client context that created it.
because for now filehandles will not outlive the clientcontext that
created it, this is fine, but in theory this can create some confusion
if we were to share filehandles between connections or even between
queries. I think we probably just want to ensure we don't keep
filehandles open between queries.

I've created some macros for standardization of the common filehandle
operations that we want to log:
```C++
#define DUCKDB_LOG_FILE_SYSTEM_READ(HANDLE, ...)  DUCKDB_LOG_FILE_HANDLE_VA(HANDLE, "READ", __VA_ARGS__);
#define DUCKDB_LOG_FILE_SYSTEM_WRITE(HANDLE, ...) DUCKDB_LOG_FILE_HANDLE_VA(HANDLE, "WRITE", __VA_ARGS__);
#define DUCKDB_LOG_FILE_SYSTEM_OPEN(HANDLE)       DUCKDB_LOG_FILE_HANDLE(HANDLE, "OPEN");
#define DUCKDB_LOG_FILE_SYSTEM_CLOSE(HANDLE)      DUCKDB_LOG_FILE_HANDLE(HANDLE, "CLOSE");

```

Then in the code, for example in the `LocalFileSystem::Read()` we can
easily call the logger in an efficient way with the centrally defined
log type and level for the FileHandle logging:

```C++
DUCKDB_LOG_FILE_SYSTEM_READ(handle, bytes_to_read, location);

```

which will ensure that extension that implement their own filesystems
will be able to easily adhere to the logging convention.

## Logging refactor
To neatly support writing structured log messages, this PR makes a few
conceptual overhauls to the logging.

First of all, we remove the log types as manually passed string. In the
initial logging implementation, you would log like:

```C++
DUCKDB_LOG_<level>(context, "<log type string>", "message");
```
However these manually provided strings are not really practical and
likely to result in utter chaos.

We change this now to:

```C++
// Default log type (empty string for now)
DUCKDB_LOG_<level>(context, "message");

// Predefined log type
DUCKDB_LOG(context, LogTypeClassName, "message");
```

The `DUCKDB_LOG` macro is defined as 

```C++
DUCKDB_LOG_INTERNAL(SOURCE, LOG_TYPE_CLASS::NAME, LOG_TYPE_CLASS::LEVEL, LOG_TYPE_CLASS::ConstructLogMessage(__VA_ARGS__))
```

Which will ensure that the logs types can only be created using the
corresponding log message construction methods. The `LogType` class will
then also contain the logic to deserialize the log message string into a
predefined datatype for easy parsing.

What this allows us to do is to easily enable a specific log type, and
let DuckDB automatically deserialize and unnest the resulting data:

```SQL
PRAGMA enable_logging('FileSystem');
FROM './test.csv'
SELECT fs, path, bytes, pos FROM duckdb_logs_parsed('FileSystem');
```
which yields:
```
LocalFileSystem	test.csv	OPEN	NULL	NULL
LocalFileSystem	test.csv	READ	4	0
LocalFileSystem	test.csv	READ	0	4
LocalFileSystem	test.csv	CLOSE	NULL	NULL
```

Note that `duckdb_logs_parsed` is simply a table macro for:

```SQL
SELECT * EXCLUDE (message), UNNEST(parse_duckdb_log_message(log_type, message))
FROM duckdb_logs
WHERE type = log_type
```

## TODOs
Some general follow ups:
- Add logging to azure
- Add logging to fsspec 
- Add logging to #16463
- More benchmarking
- Make HTTPLogType a structured type

A more complex follow up is to think about how to make this performant.
Currently, enabling `PRAGMA enable_logging('FileSystem');` coulld become
quite expensive be cause it needs to string compare the the logging type
for every single `TRACE` level call. For now we actually get away with
it because we only add the logger to the filehandle if that the time of
filehandle creation we are logging for `FileSystem`. However, as soon as
we were to add more Trace level calls that are hit a lot during
execution, things will likely slow down fast.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Documentation Use for issues or PRs that require changes in the documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
0