8000 FileHandle Logging by samansmink · Pull Request #16758 · duckdb/duckdb · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

FileHandle Logging #16758

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 49 commits into from
May 15, 2025
Merged

FileHandle Logging #16758

merged 49 commits into from
May 15, 2025

Conversation

samansmink
Copy link
Contributor
@samansmink samansmink commented Mar 20, 2025

This PR follows up on #15119 with:

  • Rework logging type mechanism into structured logging
  • Adds a new mechanism for FileSystem logging, using structured logging
  • Switch HTTP log to use structured logging

My apologies for the slightly chaotic PR here. It took a few tries to understand what makes sense here and I still feel like we are not 100% there yet. However, this should be a decent step in the right direction.

Logging refactor: structured logging

To neatly support writing structured log messages, this PR makes a few conceptual overhauls to the logging.

First of all, we remove the log types as manually passed string. In the initial logging implementation, you would log like:

DUCKDB_LOG_<level>(context, "<log type string>", "message");

However these manually provided strings are not really practical and likely to result in utter chaos.

We change this now to:

// Default log type (empty string for now)
DUCKDB_LOG_<level>(context, "message");

// Predefined log type
DUCKDB_LOG(context, LogTypeClassName, "message");

The DUCKDB_LOG macro is defined as

DUCKDB_LOG_INTERNAL(SOURCE, LOG_TYPE_CLASS::NAME, LOG_TYPE_CLASS::LEVEL, LOG_TYPE_CLASS::ConstructLogMessage(__VA_ARGS__))

Which will ensure that the logs types can only be created using the corresponding log message construction methods. The LogType class will then also contain the logic to deserialize the log message string into a predefined datatype for easy parsing.

What this allows us to do is to easily enable a specific log type, and let DuckDB automatically deserialize and unnest the resulting data:

PRAGMA enable_logging('FileSystem');
FROM './test.csv'
SELECT fs, path, bytes, pos FROM duckdb_logs_parsed('FileSystem');

which yields:

LocalFileSystem	test.csv	OPEN	NULL	NULL
LocalFileSystem	test.csv	READ	4	0
LocalFileSystem	test.csv	READ	0	4
LocalFileSystem	test.csv	CLOSE	NULL	NULL

Note that duckdb_logs_parsed is simply a table macro for:

SELECT * EXCLUDE (message), UNNEST(parse_duckdb_log_message(log_type, message))
FROM duckdb_logs
WHERE type = log_type

FileHandle logging


This PR adds logging infra for filehandles. Basically what I wanted is to create a unified way to log of the FileSystem APIs. I think this can be very useful, for example in testing/optimizing Iceberg/Delta workloads, but also to easily analyze IO patterns of our various readers like the parquet reader, json reader, or avro readers. Since the log messages will be in JSON format, they are easy to parse and do analysis on.

Heres a demo:

D set enable_logging=true;
D set logging_level='trace';
D COPY (SELECT 1 as a) TO './test.csv';
D FROM "./test.csv";
┌───────┐
│   a   │
│ int64 │
├───────┤
│   1   │
└───────┘
D SELECT message FROM duckdb_logs WHERE type = 'FileSystem';
┌───────────────────────────────────────────────────────────────────────────┐
│                                  message                                  │
│                                  varchar                                  │
├───────────────────────────────────────────────────────────────────────────┤
│ {"fs":"LocalFileSystem","path":"./test.csv","op":"OPEN"}                  │
│ {"fs":"LocalFileSystem","path":"./test.csv","op":"WRITE","bytes":"2"}     │
│ {"fs":"LocalFileSystem","path":"./test.csv","op":"WRITE","bytes":"1"}     │
│ {"fs":"LocalFileSystem","path":"./test.csv","op":"WRITE","bytes":"1"}     │
│ {"fs":"LocalFileSystem","path":"./test.csv","op":"CLOSE"}                 │
│ {"fs":"LocalFileSystem","path":"./test.csv","op":"OPEN"}                  │
│ {"fs":"LocalFileSystem","path":"./test.csv","op":"READ","bytes":"4"}      │
│ {"fs":"LocalFileSystem","path":"./test.csv","op":"READ","bytes":"0"}      │
│ {"fs":"LocalFileSystem","path":"./test.csv","op":"CLOSE"}                 │
└───────────────────────────────────────────────────────────────────────────┘

Implementation

While conceptually simple, the complexity lies in the fact that we want to generally log to the ClientContext to ensure the log entries have connection level context info. This is done by changing what was previously a unique_ptr<Logger> to a shared_ptr<Logger> in the Client context and copying the pointer to the logger into the filehandles on creation. What this means is that a filehandle will log using the logging configuration of the client context that created it. because for now filehandles will not outlive the clientcontext that created it, this is fine, but in theory this can create some confusion if we were to share filehandles between connections or even between queries. I think we probably just want to ensure we don't keep filehandles open between queries.

I've created some macros for standardization of the common filehandle operations that we want to log:

#define DUCKDB_LOG_FILE_SYSTEM_READ(HANDLE, ...)  DUCKDB_LOG_FILE_HANDLE_VA(HANDLE, "READ", __VA_ARGS__);
#define DUCKDB_LOG_FILE_SYSTEM_WRITE(HANDLE, ...) DUCKDB_LOG_FILE_HANDLE_VA(HANDLE, "WRITE", __VA_ARGS__);
#define DUCKDB_LOG_FILE_SYSTEM_OPEN(HANDLE)       DUCKDB_LOG_FILE_HANDLE(HANDLE, "OPEN");
#define DUCKDB_LOG_FILE_SYSTEM_CLOSE(HANDLE)      DUCKDB_LOG_FILE_HANDLE(HANDLE, "CLOSE");


Then in the code, for example in the LocalFileSystem::Read() we can easily call the logger in an efficient way with the centrally defined log type and level for the FileHandle logging:

DUCKDB_LOG_FILE_SYSTEM_READ(handle, bytes_to_read, location);


which will ensure that extension that implement their own filesystems will be able to easily adhere to the logging convention.

HTTP Logging

The old http logging mechanism would print http request information straight to stdout or to a file. In this PR we add a new structured logging type HTTP:

D pragma enable_logging('HTTP');
D FROM 'https://github.com/duckdb/duckdb/raw/main/data/csv/who.csv.gz';
D select request.type, request.url, request.headers from duckdb_logs_parsed('HTTP');
┌─────────┬───────────────────────────────────────────────────────────────┬──────────────────────────┐
│  type   │                              url                              │         headers          │
│ varchar │                            varchar                            │  map(varchar, varchar)   │
├─────────┼───────────────────────────────────────────────────────────────┼──────────────────────────┤
│ HEAD    │ https://github.com/duckdb/duckdb/raw/main/data/csv/who.csv.gz │ {}                       │
│ GET     │ https://github.com/duckdb/duckdb/raw/main/data/csv/who.csv.gz │ {Range='bytes=0-135283'} │
└─────────┴───────────────────────────────────────────────────────────────┴──────────────────────────┘

TODOs

Some general follow ups:

A more complex follow up is to think about how to make this performant. Currently, enabling PRAGMA enable_logging('FileSystem'); coulld become quite expensive be cause it needs to string compare the the logging type for every single TRACE level call. For now we actually get away with it because we only add the logger to the filehandle if that the time of filehandle creation we are logging for FileSystem. However, as soon as we were to add more Trace level calls that are hit a lot during execution, things will likely slow down fast.

@samansmink samansmink marked this pull request as ready for review March 24, 2025 13:42
@duckdb-draftbot duckdb-draftbot marked this pull request as draft March 27, 2025 13:12
@samansmink samansmink marked this pull request as ready for review May 9, 2025 10:41
@duckdb-draftbot duckdb-draftbot marked this pull request as draft May 9, 2025 15:26
@samansmink samansmink marked this pull request as ready for review May 9, 2025 15:29
@duckdb-draftbot duckdb-draftbot marked this pull request as draft May 12, 2025 12:26
@samansmink samansmink marked this pull request as ready for review May 12, 2025 12:43
@duckdb-draftbot duckdb-draftbot marked this pull request as draft May 13, 2025 08:12
@samansmink samansmink marked this pull request as ready for review May 14, 2025 19:33
@duckdb-draftbot duckdb-draftbot marked this pull request as draft May 15, 2025 09:46
@samansmink samansmink marked this pull request as ready for review May 15, 2025 10:09
@duckdb-draftbot duckdb-draftbot marked this pull request as draft May 15, 2025 14:14
@samansmink samansmink marked this pull request as ready for review May 15, 2025 14:19
@duckdb-draftbot duckdb-draftbot marked this pull request as draft May 15, 2025 15:38
@samansmink samansmink marked this pull request as ready for review May 15, 2025 15:44
@Mytherin Mytherin merged commit 44d0856 into duckdb:main May 15, 2025
50 checks passed
@Mytherin
Copy link
Collaborator

Thanks!

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 18, 2025
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 19, 2025
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0