mongo: handle large mongo document #3108

jgao54 · 2025-06-27T03:59:25Z

This PR handles two issues w.r.t. large event (separated by commits, recommend review them separately):

Mongo limits document size to 16MB, trying to insert even one byte larger will result in error:

[object to insert too large. size in bytes: xxx, max size: xxx]

With this information I was able to generate document in test that is exactly the maximum size that we can write to mongodb -- and we should support up to this size in our CDC connector without errors.

< 10000 p dir="auto">Currently we seem to be imposing snowflake's 15MB limit in the shared MarshalJSONWithOptions function, so the e2e test added to this PR was failing because the json got set to "{}". With this PR, Mongo is made an exception (although i think the current implementation is sloppy since it's based on source data type rather than destination datatypes). Looking for feedback on a better solution, or if we can get ride of the 15MB limit altogether in MarshalJSONWithOptions given QValueToAvro seem to handle snowflake's edge cases already -- but not sure if I'm missing something)

With above change, inserts/replace/delete events processed successfully during CDC, but update events was still failing because it was double the size as the other events and change stream also has a 16MB limitation on event size (because the value was present at least twice in the event: once in the fullDocument field and once in the updateDescription field, not to mention it could also be present in the fullDocumentBeforeChange field if available). So the PR:

set fullDocumentBeforeChange to off always, which we are not using
make the pipeline filter out updateDescription field, which we are not using
restrict to only event type we care about so far (insert/update/delete/replace) and only to fields we need (operatinType/clusterTime/documentKey/fullDocument/ns) so far.

Intentionally not using $changeStreamSplitLargeEvent for now as it seems quite discouraged by mongo doc -- we can support it if current implementation is no longer sufficient.

Test: e2e test passing with this change.

serprex · 2025-06-27T12:16:00Z

flow/model/record_items.go

@@ -25,12 +25,14 @@ func ItemsToJSON(items Items) (string, error) {

 // encoding/gob cannot encode unexported fields
 type RecordItems struct {
-	ColToVal map[string]types.QValue
+	ColToVal   map[string]types.QValue
+	NoTruncate bool


This could be size limit instead of bool. Let's go larger than 15MB for non SF connectors

In this PR, i only set the callsite in mongo to true, everything else is false, to maintains backwards compatibility. if we go larger than 15MB for all non SF connectors, we'll break back compatibility (in the sense that data that was truncated are now available), is this okay?

serprex · 2025-06-27T12:19:00Z

flow/pua/stream_adapter.go

@@ -18,7 +18,7 @@ func AttachToStream(ls *lua.LState, lfn *lua.LFunction, stream *model.QRecordStr
 		}
 		output.SetSchema(schema)
 		for record := range stream.Records {
-			row := model.NewRecordItems(len(record))
+			row := model.NewRecordItems(len(record), false)


This can be true, makes it clearer target stream responsible

At each of the call site for NewRecordItems the information about target DWH is not always available. is there an easy way to wire this information through to the callsite?

jgao54 requested a review from serprex June 27, 2025 03:59

jgao54 force-pushed the large-event-handling branch from b91eaa1 to 190a9c7 Compare June 27, 2025 04:15

handle large mongo document

46f7f3e

jgao54 force-pushed the large-event-handling branch from 190a9c7 to 46f7f3e Compare June 27, 2025 04:18

handle update/replace/delete large event

fab0211

jgao54 requested a review from heavycrystal June 27, 2025 06:06

heavycrystal approved these changes Jun 27, 2025

View reviewed changes

serprex reviewed Jun 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

mongo: handle large mongo document #3108

mongo: handle large mongo document #3108

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mongo: handle large mongo document #3108

Are you sure you want to change the base?

mongo: handle large mongo document #3108

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!