FileHandle Logging #16758

samansmink · 2025-03-20T18:37:23Z

This PR follows up on #15119 with:

Rework logging type mechanism into structured logging
Adds a new mechanism for FileSystem logging, using structured logging
Switch HTTP log to use structured logging

My apologies for the slightly chaotic PR here. It took a few tries to understand what makes sense here and I still feel like we are not 100% there yet. However, this should be a decent step in the right direction.

Logging refactor: structured logging

To neatly support writing structured log messages, this PR makes a few conceptual overhauls to the logging.

First of all, we remove the log types as manually passed string. In the initial logging implementation, you would log like:

DUCKDB_LOG_<level>(context, "<log type string>", "message");

However these manually provided strings are not really practical and likely to result in utter chaos.

We change this now to:

// Default log type (empty string for now)
DUCKDB_LOG_<level>(context, "message");

// Predefined log type
DUCKDB_LOG(context, LogTypeClassName, "message");

The DUCKDB_LOG macro is defined as

DUCKDB_LOG_INTERNAL(SOURCE, LOG_TYPE_CLASS::NAME, LOG_TYPE_CLASS::LEVEL, LOG_TYPE_CLASS::ConstructLogMessage(__VA_ARGS__))

Which will ensure that the logs types can only be created using the corresponding log message construction methods. The LogType class will then also contain the logic to deserialize the log message string into a predefined datatype for easy parsing.

What this allows us to do is to easily enable a specific log type, and let DuckDB automatically deserialize and unnest the resulting data:

PRAGMA enable_logging('FileSystem');
FROM './test.csv'
SELECT fs, path, bytes, pos FROM duckdb_logs_parsed('FileSystem');

which yields:

LocalFileSystem	test.csv	OPEN	NULL	NULL
LocalFileSystem	test.csv	READ	4	0
LocalFileSystem	test.csv	READ	0	4
LocalFileSystem	test.csv	CLOSE	NULL	NULL

Note that duckdb_logs_parsed is simply a table macro for:

SELECT * EXCLUDE (message), UNNEST(parse_duckdb_log_message(log_type, message))
FROM duckdb_logs
WHERE type = log_type

FileHandle logging 

This PR adds logging infra for filehandles. Basically what I wanted is to create a unified way to log of the FileSystem APIs. I think this can be very useful, for example in testing/optimizing Iceberg/Delta workloads, but also to easily analyze IO patterns of our various readers like the parquet reader, json reader, or avro readers. Since the log messages will be in JSON format, they are easy to parse and do analysis on.

Heres a demo:

D set enable_logging=true;
D set logging_level='trace';
D COPY (SELECT 1 as a) TO './test.csv';
D FROM "./test.csv";
┌───────┐
│   a   │
│ int64 │
├───────┤
│   1   │
└───────┘
D SELECT message FROM duckdb_logs WHERE type = 'FileSystem';
┌───────────────────────────────────────────────────────────────────────────┐
│                                  message                                  │
│                                  varchar                                  │
├───────────────────────────────────────────────────────────────────────────┤
│ {"fs":"LocalFileSystem","path":"./test.csv","op":"OPEN"}                  │
│ {"fs":"LocalFileSystem","path":"./test.csv","op":"WRITE","bytes":"2"}     │
│ {"fs":"LocalFileSystem","path":"./test.csv","op":"WRITE","bytes":"1"}     │
│ {"fs":"LocalFileSystem","path":"./test.csv","op":"WRITE","bytes":"1"}     │
│ {"fs":"LocalFileSystem","path":"./test.csv","op":"CLOSE"}                 │
│ {"fs":"LocalFileSystem","path":"./test.csv","op":"OPEN"}                  │
│ {"fs":"LocalFileSystem","path":"./test.csv","op":"READ","bytes":"4"}      │
│ {"fs":"LocalFileSystem","path":"./test.csv","op":"READ","bytes":"0"}      │
│ {"fs":"LocalFileSystem","path":"./test.csv","op":"CLOSE"}                 │
└───────────────────────────────────────────────────────────────────────────┘

Implementation

While conceptually simple, the complexity lies in the fact that we want to generally log to the ClientContext to ensure the log entries have connection level context info. This is done by changing what was previously a unique_ptr<Logger> to a shared_ptr<Logger> in the Client context and copying the pointer to the logger into the filehandles on creation. What this means is that a filehandle will log using the logging configuration of the client context that created it. because for now filehandles will not outlive the clientcontext that created it, this is fine, but in theory this can create some confusion if we were to share filehandles between connections or even between queries. I think we probably just want to ensure we don't keep filehandles open between queries.

I've created some macros for standardization of the common filehandle operations that we want to log:

#define DUCKDB_LOG_FILE_SYSTEM_READ(HANDLE, ...)  DUCKDB_LOG_FILE_HANDLE_VA(HANDLE, "READ", __VA_ARGS__);
#define DUCKDB_LOG_FILE_SYSTEM_WRITE(HANDLE, ...) DUCKDB_LOG_FILE_HANDLE_VA(HANDLE, "WRITE", __VA_ARGS__);
#define DUCKDB_LOG_FILE_SYSTEM_OPEN(HANDLE)       DUCKDB_LOG_FILE_HANDLE(HANDLE, "OPEN");
#define DUCKDB_LOG_FILE_SYSTEM_CLOSE(HANDLE)      DUCKDB_LOG_FILE_HANDLE(HANDLE, "CLOSE");

Then in the code, for example in the LocalFileSystem::Read() we can easily call the logger in an efficient way with the centrally defined log type and level for the FileHandle logging:

DUCKDB_LOG_FILE_SYSTEM_READ(handle, bytes_to_read, location);

which will ensure that extension that implement their own filesystems will be able to easily adhere to the logging convention.

HTTP Logging

The old http logging mechanism would print http request information straight to stdout or to a file. In this PR we add a new structured logging type HTTP:

D pragma enable_logging('HTTP');
D FROM 'https://github.com/duckdb/duckdb/raw/main/data/csv/who.csv.gz';
D select request.type, request.url, request.headers from duckdb_logs_parsed('HTTP');
┌─────────┬───────────────────────────────────────────────────────────────┬──────────────────────────┐
│  type   │                              url                              │         headers          │
│ varchar │                            varchar                            │  map(varchar, varchar)   │
├─────────┼───────────────────────────────────────────────────────────────┼──────────────────────────┤
│ HEAD    │ https://github.com/duckdb/duckdb/raw/main/data/csv/who.csv.gz │ {}                       │
│ GET     │ https://github.com/duckdb/duckdb/raw/main/data/csv/who.csv.gz │ {Range='bytes=0-135283'} │
└─────────┴───────────────────────────────────────────────────────────────┴──────────────────────────┘

TODOs

Some general follow ups:

Add logging to azure
Add logging to fsspec
Add logging to External File Cache #16463
More benchmarking

A more complex follow up is to think about how to make this performant. Currently, enabling PRAGMA enable_logging('FileSystem'); coulld become quite expensive be cause it needs to string compare the the logging type for every single TRACE level call. For now we actually get away with it because we only add the logger to the filehandle if that the time of filehandle creation we are logging for FileSystem. However, as soon as we were to add more Trace level calls that are hit a lot during execution, things will likely slow down fast.

…ttached dbs

src/common/local_file_system.cpp

Mytherin · 2025-05-15T19:59:18Z

Thanks!

FileHandle Logging (duckdb/duckdb#16758)

samansmink added 13 commits March 21, 2025 14:21

switch http logger to new logging infra

83afae4

remove aws require from test

2b3a738

cleanup comments

bc66b5a

add option to test

cd03672

format

33f0123

remove old test

a1fc4ea

remove http logger from client data

4d9c1c0

fix potential lifetime issues

1ce807d

patch httpfs

5607dce

first version of filehandle logging

14d06f8

switch filehandle logging to use shared ptrs

777e952

remove pos col from test, it's not platform independent

3e2d59f

fix ranges for unix handle, also allow logging to global logger for a…

bcdaaf8

…ttached dbs

samansmink force-pushed the logging-cleanup branch from a22cfac to bcdaaf8 Compare March 21, 2025 15:03

samansmink added 5 commits March 21, 2025 16:42

remove json dependency for filehandle logging tests

33b5458

add logging patch for httpfs, add to test

55fabfb

fix ci: format, noforcestorage

a1f2e00

fix incorrect macro invoke on windows

4568454

format

b1e76cc

samansmink marked this pull request as ready for review March 24, 2025 13:42

add missing include

2024087

MacOS reviewed Mar 26, 2025

View reviewed changes

src/common/local_file_system.cpp Outdated Show resolved Hide resolved

samansmink added 2 commits March 27, 2025 12:27

wip: further logging refactor

2873672

format

9604b1e

duckdb-draftbot marked this pull request as draft March 27, 2025 13:12

samansmink added 5 commits March 27, 2025 14:15

fix includes

2fec053

add missing file

ead328b

clang tidy

cb2f0d9

fix test after changed log format

bac7802

disable delta for now

b1d02d1

samansmink marked this pull request as ready for review May 9, 2025 10:41

skip with alternative verify

beddf8b

duckdb-draftbot marked this pull request as draft May 9, 2025 15:26

samansmink marked this pull request as ready for review May 9, 2025 15:29

fix httpfs patch, file location was not counted properly

e86f125

duckdb-draftbot marked this pull request as draft May 12, 2025 12:26

samansmink marked this pull request as ready for review May 12, 2025 12:43

make test single-threaded

1fce07c

duckdb-draftbot marked this pull request as draft May 13, 2025 08:12

samansmink added 2 commits May 14, 2025 16:27

Merge branch 'main' into logging-cleanup

59943f5

fix merge issue

094c865

samansmink marked this pull request as ready for review May 14, 2025 19:33

samansmink added 2 commits May 15, 2025 09:21

Merge branch 'main' into logging-cleanup

2b92988

fix merge issues, add structured http logging

9b183a9

duckdb-draftbot marked this pull request as draft May 15, 2025 09:46

add test for enable_http_logging setting

5248db3

samansmink marked this pull request as ready for review May 15, 2025 10:09

Mytherin mentioned this pull request May 15, 2025

Bump httpfs, remove patch #17496

Closed

fix test from httpfs

5b83131

duckdb-draftbot marked this pull request as draft May 15, 2025 14:14

samansmink marked this pull request as ready for review May 15, 2025 14:19

fix log test failure

1ae97d8

duckdb-draftbot marked this pull request as draft May 15, 2025 15:38

samansmink marked this pull request as ready for review May 15, 2025 15:44

Mytherin merged commit 44d0856 into duckdb:main May 15, 2025
50 checks passed

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 18, 2025

vendor: Update vendored sources to duckdb/duckdb@44d0856

6122e3f

FileHandle Logging (duckdb/duckdb#16758)

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 19, 2025

vendor: Update vendored sources to duckdb/duckdb@44d0856

a2a16a6

FileHandle Logging (duckdb/duckdb#16758)

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 19, 2025

vendor: Update vendored sources to duckdb/duckdb@44d0856

77114d2

FileHandle Logging (duckdb/duckdb#16758)

Tishj mentioned this pull request May 26, 2025

Local Build Logging not working duckdb/duckdb-iceberg#261

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FileHandle Logging #16758

FileHandle Logging #16758

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

FileHandle Logging #16758

FileHandle Logging #16758

Uh oh!

Conversation

Uh oh!

Logging refactor: structured logging

FileHandle logging

Implementation

HTTP Logging

TODOs

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

FileHandle logging