8000 [PM-17562] Add support for retries on event integrations by brant-livefront · Pull Request #5795 · bitwarden/server · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[PM-17562] Add support for retries on event integrations #5795

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occa 8000 sionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
May 27, 2025

Conversation

brant-livefront
Copy link
Contributor
@brant-livefront brant-livefront commented May 9, 2025

🎟️ Tracking

PM-17562

📔 Objective

This PR adds a new layer of AMQP messaging (for RabbitMQ now, but ASB will be added in another PR). This allows us to queue each individual integration message to allow for retries and dead-lettering.

The flow of events with this new change (assuming the feature flag is On and Rabbit is configured):

  1. Event comes into the system
  2. Event gets published to the main exchange, which is a fanout exchange. This means it is immediately published to the event-integration queues (1 per integration, currently webhook and slack), and the event repository queue
  3. Listener/Handler pairs on each queue receive the message and run
    • For the event repository handler, it writes the event to the EventRepository.
  4. For the event-integration queues, the EventIntegrationHandler pulls the current configurations for this Organization, Event Type, and IntegrationType. For each configuration:
    • It parses the configuration into a typed configuration object
    • It processes the Template string and fills in tokens with the contents of EventMessage or any other fetched properties
    • It bundles all of this up into an IntegrationMessage
    • And it publishes that message to another exchange - the Integration Exchange - with the routing key of the IntegrationType.
  5. The Integration Exchange is not fanout. It publishes direct to exactly one queue depending on the routing key. The queues are one per integration (i.e. slack, webhook).
  6. The specific queue receives the message and another Listener/Handler pair runs. These are directly tied to the integration and take the IntegrationMessage (which has the necessary configuration and message) and publish it to the integration.
  7. The handlers return a result that indicates a success/failure and (if failure) if the message can be retried. The listener checks the result. If it's a success, it's simply acknowledged and the process ends. If it's retryable, it is published to the retry queue for this integration. If it is not retryable, or it has exceeded the configurable number of retries, it is sent to the dead letter queue.

In the RabbitMQ implementation, we are using a very simple retry queue. It has a TTL of 30 seconds with a dead-letter pointing back to its original queue. This means that messages are put in the retry queue and no handlers ever pick them up. When the TTL expires, Rabbit dead-letters them back to the main integration queue, where they are picked up and reprocessed. Integration messages contain the number of retries, so we don't do this endlessly.

Although this implementation adds things like NotBeforeUtc to support scheduling messages and obeying Retry-After headers, we ignore that for the RabbitMQ implementation. Azure Service Bus will use these directly since it supports scheduling out of the box. RabbitMQ needs a special package to add delay scheduling which adds complexity that self-hosted instances probably do not need.

We are also using one dead-letter-queue for all integrations. We could make this per-integration if we wanted to, it just adds more complexity in the configuration / global settings. Since messages in the DLQ are really just for information purposes (i.e. we're not planning on programmatically re-queueing them), it seemed cleaner to put them all in one place where they could be viewed.

Lastly, this PR also cleans up some of the Startup.cs files with some methods in our ServiceCollectionExtensions.

⏰ Reminders before review

  • Contributor guidelines followed
  • All formatters and local linters executed and passed
  • Written new unit and / or integration tests where applicable
  • Protected functional changes with optionality (feature flags)
  • Used internationalization (i18n) for all UI strings
  • CI builds passed
  • Communicated to DevOps any deployment requirements
  • Updated any necessary documentation (Confluence, contributing docs) or informed the documentation team

🦮 Reviewer guidelines

  • 👍 (:+1:) or similar for great changes
  • 📝 (:memo:) or ℹ️ (:information_source:) for notes or general info
  • ❓ (:question:) for questions
  • 🤔 (:thinking:) or 💭 (:thought_balloon:) for more open inquiry that's not quite a confirmed issue and could potentially benefit from discussion
  • 🎨 (:art:) for suggestions / improvements
  • ❌ (:x:) or ⚠️ (:warning:) for more significant problems or concerns needing attention
  • 🌱 (:seedling:) or ♻️ (:recycle:) for future improvements or indications of technical debt
  • ⛏ (:pick:) for minor or nitpick changes

@brant-livefront brant-livefront requested a review from a team as a code owner May 9, 2025 14:03
@brant-livefront brant-livefront requested a review from eliykat May 9, 2025 14:03
Copy link
Contributor
github-actions bot commented May 9, 2025

Logo
Checkmarx One – Scan Summary & Details355c188f-8db0-4046-926e-d9d12a23f109

Great job, no security vulnerabilities found in this Pull Request

Copy link
codecov bot commented May 9, 2025

Codecov Report

Attention: Patch coverage is 35.51797% with 305 lines in your changes missing coverage. Please review.

Project coverage is 47.49%. Comparing base (198d96e) to head (9617326).
Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
...lementations/RabbitMqIntegrationListenerService.cs 0.00% 144 Missing ⚠️
...SharedWeb/Utilities/ServiceCollectionExtensions.cs 17.24% 115 Missing and 5 partials ⚠️
...es/Implementations/RabbitMqIntegrationPublisher.cs 0.00% 29 Missing ⚠️
...ervices/Implementations/EventIntegrationHandler.cs 90.74% 2 Missing and 3 partials ⚠️
src/Core/Settings/GlobalSettings.cs 84.61% 2 Missing ⚠️
...ole/Models/Data/Integrations/IntegrationMessage.cs 94.44% 0 Missing and 1 partial ⚠️
...es/Implementations/RabbitMqEventListenerService.cs 0.00% 1 Missing ⚠️
...vices/Implementations/RabbitMqEventWriteService.cs 0.00% 1 Missing ⚠️
...vices/Implementations/WebhookIntegrationHandler.cs 96.77% 0 Missing and 1 partial ⚠️
src/EventsProcessor/Startup.cs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5795      +/-   ##
==========================================
- Coverage   50.83%   47.49%   -3.34%     
==========================================
  Files        1651     1660       +9     
  Lines       74866    75186     +320     
  Branches     6741     6756      +15     
==========================================
- Hits        38055    35711    -2344     
- Misses      35291    38012    +2721     
+ Partials     1520     1463      -57     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@withinfocus withinfocus changed the title [PM-17562] Add support for retires on event integrations [PM-17562] Add support for retries on event integrations May 12, 2025
Copy link
Contributor
@withinfocus withinfocus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Little things and questions.

The one thing I have been sorta staring at:

In the RabbitMQ implementation, we are using a very simple retry queue. It has a TTL of 30 seconds with a dead-letter pointing back to its original queue. This means that messages are put in the retry queue and no handlers ever pick them up. When the TTL expires, Rabbit dead-letters them back to the main integration queue, where they are picked up and reprocessed.

Is this standard / normal setup with Rabbit? And per your other notes, we can't use the NotBeforeUtc instead of the 30s that's hardcoded? I am trying to think about how we can compensate for some of these gaps if this is all Rabbit can do.

Copy link
Contributor
@withinfocus withinfocus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there.

withinfocus
withinfocus previously approved these changes May 13, 2025
Copy link
Member
@eliykat eliykat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the detailed PR description, it provided much needed context. I'm mostly just following along here and relying on @withinfocus' review for the detail. The issues below are pretty minor, happy to discuss if needed.

Copy link

@brant-livefront brant-livefront merged commit f3e637c into main May 27, 2025
53 checks passed
@brant-livefront brant-livefront deleted the brant/pm-17562-add-retries-to-webhooks branch May 27, 2025 12:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0