could not enqueue records into the ring buffer #9906

pawel-lmcb · 2025-02-02T08:34:41Z

Bug Report

Describe the bug

We've got a fluent-bit aggregator VM running 3.2.5.

The node is behaving in a strange manner, throughput used to be about 80MB/s, 40 in and 40 out. However, now it's doing < 10MB in, and 0 out. The only time there is output is when I restarted the process. Also resources are readily available.

It was working fine, but all of a sudden the traffic came crashing down and output went to 0. Almost like it hit a race condition.

After a processes reboot, network traffic will spike up to 70-80mb/s then come down to 10mb/s.

We're seeing the following errors pop up once every second:

[2025/02/02 03:13:53] [error] [input:forward:forward.0] could not enqueue records into the ring buffer [2025/02/02 03:13:54] [error] [input:forward:forward.0] could not enqueue records into the ring buffer [2025/02/02 03:13:55] [error] [input:forward:forward.0] could not enqueue records into the ring buffer [2025/02/02 03:13:56] [error] [input:forward:forward.0] could not enqueue records into the ring buffer [2025/02/02 03:13:57] [error] [input:forward:forward.0] could not enqueue records into the ring buffer [2025/02/02 03:13:58] [error] [input:forward:forward.0] could not enqueue records into the ring buffer [2025/02/02 03:13:59] [error] [input:forward:forward.0] could not enqueue records into the ring buffer [2025/02/02 03:14:00] [error] [input:forward:forward.0] could not enqueue records into the ring buffer [2025/02/02 03:14:01] [error] [input:forward:forward.0] could not enqueue records into the ring buffer

To Reproduce

Steps to reproduce the problem:

Install fluent-bit 3.2.5, with config:

[root@localhost fluent-bit]# cat /etc/fluent-bit/fluent-bit.conf 
[SERVICE]
    Flush                   1
    Log_Level               info
    Log_File                /var/log/fluent-bit/fluentbit-kafka.log
    # https://docs.fluentbit.io/manual/administration/monitoring#health-check-for-fluent-bit
    # curl -s http://127.0.0.1:2020/api/v1/metrics/prometheus
    HTTP_Server             on
    HTTP_Listen             0.0.0.0
    HTTP_Port               2020
    storage.path            /var/log/fluent-bit/
    storage.sync            full
    storage.checksum        off
    Storage.metrics         on
    scheduler.base          1
    scheduler.cap           20

[INPUT]
    Name                    forward
    Listen                  0.0.0.0
    Port                    24224
    Buffer_Chunk_Size       64MB
    Buffer_Max_Size         256MB
    Threaded                true
    storage.type            filesystem

[OUTPUT]
    Name                    kafka
    Alias                   kafka-app.analytics_vmwaredatacenter.cloudadmin
    Match                   app.analytics_vmwaredatacenter.cloudadmin
    Brokers                 192.168.100.77:9092,192.168.100.87:9092,192.168.100.72:9092
    Topics                  analytics_development
    Retry_Limit             5
    rdkafka.compression.type gzip

[OUTPUT]
    Name                    kafka
    Alias                   kafka-app.aws_billing_vmwaredatacenter.cloudadmin
    Match                   app.aws_billing_vmwaredatacenter.cloudadmin
    Brokers                 192.168.100.77:9092,192.168.100.87:9092,192.168.100.72:9092
    Topics                  aws_billing_development
    Retry_Limit             False
    Workers                 8

We have 5 VMs acting as forwarders, each sending a 2.2M lined CSV file, which the aggregator then ingests writes out to disk and sends to redpanda (kafka).

There is NO compression on the forwarders, and NO compression going to redpanda (kafka).

Update

So I realized when fluent-bit's network performance decreases, it also seems like all cores stop working and only 1 core is working primarily on the forwarder input.

This can best be seen in the screens above, something puts it into this odd single core state, even though as you can see from the config it should be threaded for input and has multiple workers for output.

Expected behavior

Having run a dozen tests we expect throughput to be 80MB/s with 40 in and 40 hour, and the disk maintaining 40mb/s writes.

Your Environment

VMware ESXi, 7.0.3, 21424296
Hardware is a Dell PowerEdge R720XD, 512GB of ram, Intel(R) Xeon(R) CPU E5-2667 v2 @ 3.30GHz with NVMe drives and 10G networking.
RHEL 9.5 (Plow)

The VM for fluent-bit aggregator is 16vCPU, 16GB of Mem, 300GB on NVMe, 78% free disk space.

The text was updated successfully, but these errors were encountered:

edsiper · 2025-02-04T15:03:33Z

please attach your full Fluent Bit log file

pawel-lmcb · 2025-02-05T01:16:09Z

@edsiper do you want me to enable a higher debug level than info and re-run this ?

vpshibin · 2025-02-26T23:11:04Z

I'm seeing same issue in fluent-bit 3.2.1

It happens with multiple inputs, see log entries below for TCP and prometheus_remote_write inputs. If I change the inputs to threaded: false, the error goes away. However, that puts all inputs in the main thread, and not good for performance.

I have also seen below issue which mentions same error, and the root cause was supposed to be fixed in an earlier version.
#7071

[2025/02/27 08:11:13] [error] [input:tcp:tcp.1] could not enqueue records into the ring buffer
[2025/02/27 08:12:52] [error] [input:tcp:tcp.1] could not enqueue records into the ring buffer
[2025/02/27 08:12:53] [error] [input:tcp:tcp.1] could not enqueue records into the ring buffer

[2025/02/27 08:55:23] [error] [input:prometheus_remote_write:prometheus_remote_write.2] could not enqueue records into the ring buffer
[2025/02/27 08:55:24] [error] [input:prometheus_remote_write:prometheus_remote_write.2] could not enqueue records into the ring buffer
[2025/02/27 08:55:25] [error] [input:prometheus_remote_write:prometheus_remote_write.2] could not enqueue records into the ring buffer
[2025/02/27 08:55:26] [error] [input:prometheus_remote_write:prometheus_remote_write.2] could not enqueue records into the ring buffer

naegelin · 2025-03-19T20:53:05Z

Same issue in v3.1.7 here

cdancy · 2025-05-01T19:34:30Z

Issue is still happening on 4.0.1 Turning off threading for our tail plugins gets things working again. I fluent-bit into debug mode but no extra logs were produced that were connected with this other than lots of:

[2025/05/01 19:34:08] [debug] [input:tail:tail.0] failed buffer write, retries=0

pawel-lmcb added the status: waiting-for-triage label Feb 2, 2025

This was referenced Feb 6, 2025

Fluent-bit hangs, follow up of https://github.com/fluent/fluent-bit/issues/9626 #9917

Open

Fluent-bit hanging and stopping operation #9927

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

could not enqueue records into the ring buffer #9906

could not enqueue records into the ring buffer #9906

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

could not enqueue records into the ring buffer #9906

could not enqueue records into the ring buffer #9906

Comments

Uh oh!

Bug Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!