-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Uptick in dropped events from disk buffer InvalidProtobufPayload errors #18130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @sbalmos , that could be related to protobuf request size limits. |
For this instance, which is mainly a message router to different destinations, it's not that interesting. I've updated the Configuration section of the original post. |
Found the original issue referencing the |
Can you share a sample input that I can use in my local kafka producer in order to trigger this error? |
I do not have one, since I can't backtrack what input causes the error and thus ends up being dropped by the buffer / sink. I would have to surmise you could possibly trigger it by creating a massive input event of some sort, something that will definitely end up being over 4MB in size whether encoded in protobuf or json. |
I think that 4 MB limit was only added for the tonic gRPC server for decoding incoming requests, not for protobuf decoding generally 🤔 @sbalmos it sounds like you are seeing more of these errors in 0.31.0 vs 0.29.0? If so, I'm wondering if we could try to bisect down to identify a specific commit that causes the issue. |
Looking at this again, it sounds like it could be the case that events written to a buffer by v0.29.0 couldn't be read by v0.31.0. It'd be worth trying that as a stand-alone test case. |
Gah, haven't really had time yet to get back to this, totally my fault. We've since gone all-0.32.1 and the issue's still present, so it's not a disk buffer format backwards incompatibility. I've got it on my todo list to trace back the usages of |
If you're still seeing the problem on fresh installs of 0.32, with fresh disk buffers, then I would agree that it's not related to compatibility issues of the disk buffer files between versions. Technically speaking, the error is most commonly expected due to something like a compatibility issue (mismatch in the data vs the Protocol Buffers definition, etc) but it can also be triggered purely from the perspective of "is this even valid Protocol Buffers data at all?" It sounds like you're experiencing the latter. If there's any more information you can share at all, it would be appreciated/helpful. Stuff like the number of events it reports as dropping in a single go (which would be a proxy to understanding the record size in the buffer itself), the size of input events if you're able to roughly calculate that, etc. |
I just hit this error on multiple of my vector instances after restart which then fails to start:
Is there a way how to at least workaround this? Eg. to determine broken buffer and delete it. I tried Best thing I got is this:
Boot being stucked due to buffer corruption is worse case scenario. In my environment it is better to wipe out broken buffer than keep failing. |
Another workaround is to add this into entrypoint before starting vector process: now=$(date +%s)
last_startup=$(cat /var/lib/vector/startup || echo 0)
last_startup_age=$[ ($now - $last_startup) / 60 ]
if find /var/lib/vector/buffer -type f -name buffer.db -mmin +$last_startup_age | grep buffer.db >/dev/null; then
log_error "Cleaning vector buffer as it was not updated since last startup ($last_startup_age minutes) to fix startup in case of buffer corruption"
rm -rf /var/lib/vector/buffer
fi
echo "$now" > /var/lib/vector/startup It will simply delete buffer directory if |
Also having the same issue here, not sure if there is any update solutions? Our setup is pretty simple vector deployed as a sidecar with the pod that writes to a PVC and vector parses and sends this to s3. We have pretty large log lines some being many MBs in size. I constantly get the following errors Vector version: 0.46.1
Which then leads to
Lastly after a while I end up with my pods not being able to start the same as @fpytloun with
My vector config looks like
I don't know of any way of knowing which event is causing this as we have many hundreds per second and the only fix when it can't start anymore is to delete the buffer file, had this crash a few times and it was just the one every time I deleted. |
A note for the community
Problem
Since upgrading to 0.31 (from 0.29 in the case of this instance), there has been a marked uptick in dropped events to my splunk_hec_logs sink, which is backed by a disk buffer. The error indicates the events are dropped due to an InvalidProtobufPayload error reading from disk.
2023-08-01T13:18:46.116488Z ERROR sink{component_kind="sink" component_id=staging_splunk_hec component_type=splunk_hec_logs component_name=staging_splunk_hec}: vector_buffers::internal_events: Error encountered during buffer read. error=failed to decoded record: InvalidProtobufPayload error_code="decode_failed" error_type="reader_failed" stage="processing" internal_log_rate_limit=true
I can't find it at the moment, but seem to remember another issue or discussion where the underlying protobuf library was now implementing a 4MB size limit and potentially truncating messages larger than that. Maybe that is also related?
Configuration
Version
vector 0.31.0 (x86_64-unknown-linux-gnu 0f13b22 2023-07-06 13:52:34.591204470)
Debug Output
No response
Example Data
No response
Additional Context
No response
References
No response
The text was updated successfully, but these errors were encountered: