8000 seg fault under high load whilst tailing a log · Issue #9864 · fluent/fluent-bit · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

seg fault under high load whilst tailing a log #9864

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
sorran opened this issue Jan 24, 2025 · 5 comments
Open

seg fault under high load whilst tailing a log #9864

sorran opened this issue Jan 24, 2025 · 5 comments

Comments

@sorran
Copy link
sorran commented Jan 24, 2025

Bug Report

Describe the bug

When running under high load we encounter seg faults:

#0  0xffff98bcd810      in  ???() at ???:0
#1  0xffff98bcffa3      in  ???() at ???:0
#2  0xffff98bd0d4f      in  ???() at ???:0
#3  0x5f884f            in  msgpack_sbuffer_write() at lib/msgpack-c/include/msgpack/sbuffer.h:81
#4  0x5b8367            in  msgpack_pack_map() at lib/msgpack-c/include/msgpack/pack_template.h:753
#5  0x5bc78b            in  flb_mp_map_header_init() at src/flb_mp.c:326
#6  0x5f8a83            in  flb_log_event_encoder_dynamic_field_scope_enter() at src/flb_log_event_encoder_dynamic_field.c:70
#7  0x5f8b7b            in  flb_log_event_encoder_dynamic_field_begin_map() at src/flb_log_event_encoder_dynamic_field.c:117
#8  0x5ee66f            in  flb_log_event_encoder_begin_record() at src/flb_log_event_encoder.c:250
#9  0xafdcc3            in  apply_modifying_rules() at plugins/filter_modify/modify.c:1414
#10 0xafe127            in  cb_modify_filter() at plugins/filter_modify/modify.c:1526
#11 0x4dcc53            in  flb_filter_do() at src/flb_filter.c:161
#12 0x4d25e7            in  input_chunk_append_raw() at src/flb_input_chunk.c:1608
#13 0x4d2e4f            in  flb_input_chunk_append_raw() at src/flb_input_chunk.c:1929
#14 0x5d2923            in  input_log_append() at src/flb_input_log.c:71
#15 0x5d29ab            in  flb_input_log_append() at src/flb_input_log.c:90
#16 0x744ccb            in  ml_stream_buffer_flush() at plugins/in_tail/tail_file.c:412
#17 0x745ceb            in  ml_flush_callback() at plugins/in_tail/tail_file.c:919
#18 0x574883            in  flb_ml_flush_stream_group() at src/multiline/flb_ml.c:1516
#19 0x571e13            in  flb_ml_flush_parser_instance() at src/multiline/flb_ml.c:117
#20 0x5fd42f            in  flb_ml_stream_id_destroy_all() at src/multiline/flb_ml_stream.c:316
#21 0x746cb7            in  flb_tail_file_remove() at plugins/in_tail/tail_file.c:1256
#22 0x74898b            in  check_purge_deleted_file() at plugins/in_tail/tail_file.c:1936
#23 0x748d07            in  flb_tail_file_purge() at plugins/in_tail/tail_file.c:1992
#24 0x4cbb13            in  flb_input_collector_fd() at src/flb_input.c:1982
#25 0x50f3af            in  flb_engine_handle_event() at src/flb_engine.c:577
#26 0x50f3af            in  flb_engine_start() at src/flb_engine.c:960
#27 0x4ad693            in  flb_lib_worker() at src/flb_lib.c:835
#28 0xffff98bc0933      in  ???() at ???:0
#29 0xffff98b64e5b      in  ???() at ???:0
#30 0xffffffffffffffff  in  ???() at ???:0
#0  0xffffb400e810      in  ???() at ???:0
#1  0xffffb4010fa3      in  ???() at ???:0
#2  0xffffb4011d4f      in  ???() at ???:0
#3  0x1219a63           in  msgpack_unpacker_init() at lib/msgpack-c/src/unpack.c:372
#4  0xafd9e3            in  apply_modifying_rules() at plugins/filter_modify/modify.c:1372
#5  0xafe127            in  cb_modify_filter() at plugins/filter_modify/modify.c:1526
#6  0x4dcc53            in  flb_filter_do() at src/flb_filter.c:161
#7  0x4d25e7            in  input_chunk_append_raw() at src/flb_input_chunk.c:1608
#8  0x4d2e4f            in  flb_input_chunk_append_raw() at src/flb_input_chunk.c:1929
#9  0x5d2923            in  input_log_append() at src/flb_input_log.c:71
#10 0x5d29ab            in  flb_input_log_append() at src/flb_input_log.c:90
#11 0x744ccb            in  ml_stream_buffer_flush() at plugins/in_tail/tail_file.c:412
#12 0x745ceb            in  ml_flush_callback() at plugins/in_tail/tail_file.c:919
#13 0x574883            in  flb_ml_flush_stream_group() at src/multiline/flb_ml.c:1516
#14 0x571e13            in  flb_ml_flush_parser_instance() at src/multiline/flb_ml.c:117
#15 0x5fd42f            in  flb_ml_stream_id_destroy_all() at src/multiline/flb_ml_stream.c:316
#16 0x746cb7            in  flb_tail_file_remove() at plugins/in_tail/tail_file.c:1256
#17 0x74898b            in  check_purge_deleted_file() at plugins/in_tail/tail_file.c:1936
#18 0x748d07            in  flb_tail_file_purge() at plugins/in_tail/tail_file.c:1992
#19 0x4cbb13            in  flb_input_collector_fd() at src/flb_input.c:1982
#20 0x50f3af            in  flb_engine_handle_event() at src/flb_engine.c:577
#21 0x50f3af            in  flb_engine_start() at src/flb_engine.c:960
#22 0x4ad693            in  flb_lib_worker() at src/flb_lib.c:835
#23 0xffffb4001933      in  ???() at ???:0
#24 0xffffb3fa5e5b      in  ???() at ???:0
#25 0xffffffffffffffff  in  ???() at ???:0
Aborted (core dumped)

Indicates some memory corruption around:

buffer = (char*)malloc(initial_buffer_size);

tmp = realloc(sbuf->data, nsize);

Valgrind logs:

valgrindx-1.log
valgrindx.log

Any advice on how to troubleshoot further would be well received.

To Reproduce

  • Rubular link if applicable:
  • Example log message if applicable:
2025-01-24 07:12:17,692 [B] [107465] [homecontainer]  WARN [Thread-92(sf-worker)] (Debugger.java:120) - Can't find respawn variables (LogicGameObjectManager:2313:2062 LogicLevel:5550 LogicGameMode:1085:1029:296)
2025-01-24 07:12:17,692 [C] [267970] [homecontainer]  WARN [Thread-104(sf-worker)] (Debugger.java:120) - Can't find respawn variables (LogicGameObjectManager:2313:2062 LogicLevel:5550 LogicGameMode:1085:1029:296)
2025-01-24 07:12:17,692 [C] [488726] [homecontainer]  WARN [Thread-81(sf-worker)] (Debugger.java:120) - Can't find respawn variables (LogicGameObjectManager:2313:2062 LogicLevel:5550 LogicGameMode:1085:1029:296)
2025-01-24 07:12:17,692 [C] [256683] [battlecontainer]  WARN [Thread-103(sf-worker)] (Debugger.java:120) - Can't find respawn variables (LogicGameObjectManager:2313:2062 LogicLevel:5550 LogicGameMode:1085:1029:296)
  • Steps to reproduce the problem:

Seems to be occur under high stress. We can encounter within 1-2 minutes on a c7g.large EC2 instance that is tailing a log that is producing at 50k lines/s. Performane at 25k line/s performance appears stable. The log is a java log that will rotate once it hits 500mb and is then deleted once there's > 3 logs. At 50k line/s we'd expect the log producer is running at a higher rate then what fluent-bit is likely able to consume.

Expected behavior

Performance to degrade without seg fault crash.

Screenshots

Your Environment

  • Version used: v3.2.4
  • Configuration:

config.zip

  • Environment name and version (e.g. Kubernetes? What version?): EC2
  • Server type and version: AWS Linux c7g.large
  • Operating System and version: Amazon Linux
  • Filters and plugins: tail input, java_multiline, java_capture, modify, http output

Additional context

Stress testing fluent-bit attempting to understand performance limitations. Possibly we need throttle fluent-bit but unclear if this will actually resolve the seg fault.

@patrick-stephens
Copy link
Contributor
patrick-stephens commented Jan 24, 2025

@sorran could you paste in the flat config just to save having to download, extract and open potentially malicious files?

It looks like it is tail input with loki and http output but be good to get the full config as flat text?

I wrote this a while back when I had monstrous includes to help: https://github.com/couchbase/couchbase-fluent-bit/blob/main/tools/flatten-config.sh

@sorran
Copy link
Author
sorran commented Apr 16, 2025

Still a problem for us. We have different outputs we are testing, normally we'll test against one of opensearch, victorialogs, loki and each has the same problem. Feels like its in the tail of the input file, the input file is writing and rolling over at a very high rate.

inputs.d/spitter.confg:

[INPUT]
    Name                tail
    Path                /var/log/spitter/*.log
    Tag                 java
    Refresh_Interval    1
    Multiline.Parser    java_multiline
    Mem_Buf_Limit       128MB

fluent-bit.conf:

[SERVICE]
    Flush        5
    Log_Level    info
    Parsers_File /etc/fluent-bit/parsers.conf
    HTTP_Server    On
    HTTP_Listen    0.0.0.0
    HTTP_Port      2020
    # Configure retries in case of back pressure (retries 4 times: after 1, 2, 4 and 8 minutes)
    scheduler.base   30
    scheduler.cap    300

@INCLUDE inputs.d/*
@INCLUDE outputs.d/*

# Set static tags on all records
[FILTER]
    Name modify
    Match *
    Add hostname ${HOSTNAME}
    Add environment ${S_ENVIRONMENT}
    Add service ${S_SERVICE}

    
#Parse Java logs
[FILTER]
    Name            parser
    Match           *
    Key_Name        log
    Parser          java_capture
    Reserve_Data    On

Parsers.conf

[MULTILINE_PARSER]
    Name          java_multiline
    Type          regex
    Flush_Timeout 5000
    rule      "start_state"   "/(\d*-\d*-\d* \d*:\d*:\d*,\d*)(.*)/"  "cont"
    rule      "cont"          "/^(?!(\d*-\d*-\d* \d*:\d*:\d*,\d*))(.*)/"  "cont"    


[PARSER]
    Name        java_capture
    Format      regex
    Regex       /^(?<timestamp>(\d*-\d*-\d* \d*:\d*:\d*,\d*))\s+\[(?<product>.*)\]\s+\[(?<playerId>.*)\]\s+\[(?<role>.*)\]\s+(?<logLevel>(INFO|ERROR|WARN|DEBUG))\s+\[(?<thread>.*)\]\s+\((?<class>.*):(?<line>\d*)\)\s+-\s+(?<message>[^\n]*)\n(?<stackTrace>.*)/m
    Time_Key    timestamp
    Time_Format %Y-%m-%d %H:%M:%S,%L
    Time_Keep On  # this just keeps the original "timestamp" field around, which is redundant

We have reproduced it across various outputs

One example output but any output seems to still have the same the issue (probably issue is in the tail of the input)

outputs.d/opensearch.conf:

[OUTPUT]
    Name  http
    Match *
    Host  log-aggregation-test-....us-east-1.osis.amazonaws.com
    Port  443
    URI /logs
    Format json
    aws_auth true
    aws_region us-east-1
    aws_service osis
    Log_Level warn
    tls On

@patrick-stephens
Copy link
Contributor

Is this with the latest version (4.x)?

@ncorreia
Copy link

Hi @patrick-stephens this was using fluent bit 3.2.10-1

@leonardo-albertovich
Copy link
Collaborator

As far as I can see release 3.2.10 does not include the required patch I made in PR 10251.

Would you by chance be able to build your own fleunt-bit 3.2 from source?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants
0