Open
Description
Project:
snowplow/enricher/common
Version:
master ( latest )
Expected behavior:
Generate universally unique ID across entire pipeline every time.
Actual behavior:
Generates duplicates event_id columns on enrichment stage
Steps to reproduce:
Create enriched event using class com.snowplowanalytics.snowplow.enrich/common/enrichments/EnrichmentManager.scala using method setupEnrichedEvent, and proceed with the case when EnrichedEvent is returned.
- The problem with unique ID generation requires our team to build deduplication logic on our side after collector and enriched stages are done, and there doesn't seem to be an easy to fix it because of "at least once" policy of processing those events. I was curious about the reason of choosing UUID based event_id generation and absence of custom configuration. I also would want to propose using Twitter Snowflake strategy to create those IDs. The UUID strategy is mostly tied to MAC Address of the network interface, and Twitter Snowflake includes machine ID, which I think could resolve the issue of duplication. I might be wrong though, and wanted to know the reason for going towards UUID strategy.
Metadata
Metadata
Assignees
Labels
No labels