ENG-41 Adj OS bulk import error handling #3802

leite08 · 2025-05-08T13:07:07Z

Dependencies

upstream: none
downstream: ENG-268 Consolidated search by keyword #3814

Description

Improve OpenSearch's bulk import error handling.

group similar errors w/ count when sending to Sentry (cloud env)
store in NDJSON when running local from utils/scripts to help debugging

Testing

Local
- Errors to ingest are converted correctly
- Errors in bulk ingestion are sent to Slack when running in cloud
- Errors in bulk ingestion are stored on file when local
Staging
- none
Sandbox
- none
Production
- none

Release Plan

Merge this

Summary by CodeRabbit

New Features
- Enhanced error handling for OpenSearch bulk operations with detailed error extraction and callback support.
- Semantic ingestion errors are now logged to timestamped NDJSON files for easier troubleshooting.
- Added Sentry initialization for improved error monitoring during development.
Improvements
- Increased bulk ingestion chunk size and timeout, with retry logic for more reliable processing.
- Error reporting now includes counts by error type for clearer diagnostics.
Refactor
- Streamlined internal error processing and ingestion flow for better clarity and maintainability.
Chores
- Reorganized output folder setup to ensure consistent file management.

linear · 2025-05-08T13:07:10Z

ENG-41 MVP OpenSearch for Semantic Search

coderabbitai · 2025-05-08T13:07:14Z

Walkthrough

The changes introduce new utilities for extracting and handling errors from OpenSearch bulk responses, refactor bulk ingestion logic for improved error tracking and reporting, and propagate granular error handling through the ingestion pipeline. Error callbacks are now supported during semantic ingestion, enabling errors to be logged to NDJSON files. Sentry initialization is added for error monitoring.

Changes

File(s)	Change Summary
`packages/core/src/external/opensearch/bulk.ts`	New module providing typed utilities to extract and process errors from OpenSearch bulk responses, including types for operations and errors, and functions for counting and stringifying errors.
`packages/core/src/external/opensearch/semantic/ingest.ts`	Updated `ingestSemantic` function to accept and forward an optional `onItemError` callback for per-item error handling during ingestion.
`packages/core/src/external/opensearch/text-ingestor-direct.ts`	Refactored and enhanced bulk ingestion: increased chunk size and timeout, added retry logic, introduced granular error tracking via callback, updated types, and improved logging/reporting of errors. Added helper functions for building requests, handling errors, and logging.
`packages/utils/src/open-search/semantic-ingest.ts`	Added error callback to capture and write ingestion errors as NDJSON files in a dedicated output folder. Integrated Sentry initialization for error monitoring. Updated control flow to support per-item error logging.
`packages/utils/src/open-search/semantic-text-from-consolidated.ts`	Reordered folder initialization and file path construction to ensure output directory is set up before file operations. No logic changes to data processing.
`packages/utils/src/shared/sentry.ts`	New module providing a utility function to initialize Sentry for error tracking, using environment variables and configuration suitable for development/testing.

Sequence Diagram(s)

sequenceDiagram
    participant Script as semantic-ingest.ts
    participant Sentry as Sentry
    participant SemanticIngest as ingestSemantic
    participant TextIngestor as OpenSearchTextIngestorDirect
    participant OpenSearch as OpenSearch API

    Script->>Sentry: initSentry()
    Script->>SemanticIngest: ingestSemantic({ patient, onItemError })
    SemanticIngest->>TextIngestor: ingestBulk({ cxId, patientId, resources, onItemError })
    loop For each chunk of resources
        TextIngestor->>OpenSearch: bulk API request
        OpenSearch-->>TextIngestor: bulk response
        TextIngestor->>TextIngestor: processErrorsFromBulkResponse(response, operation, onItemError)
        alt If error in item
            TextIngestor->>Script: onItemError(error) (writes to NDJSON)
        end
    end

sequenceDiagram
    participant TextIngestor as OpenSearchTextIngestorDirect
    participant BulkUtil as bulk.ts
    participant OpenSearch as OpenSearch API

    TextIngestor->>OpenSearch: bulk API request
    OpenSearch-->>TextIngestor: bulk response
    TextIngestor->>BulkUtil: processErrorsFromBulkResponse(response, operation, onItemError)
    BulkUtil-->>TextIngestor: error count
    alt If onItemError provided
        BulkUtil->>Script: onItemError(error)
    end

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

npm warn config production Use --omit=dev instead.
npm error code ERR_SSL_WRONG_VERSION_NUMBER
npm error errno ERR_SSL_WRONG_VERSION_NUMBER
npm error request to https://10.0.0.28:4873/punycode/-/punycode-2.3.1.tgz failed, reason: C02C74F56D7F0000:error:0A00010B:SSL routines:ssl3_get_record:wrong version number:../deps/openssl/openssl/ssl/record/ssl3_record.c:354:
npm error
npm error A complete log of this run can be found in: /.npm/_logs/2025-05-23T14_18_58_436Z-debug-0.log

Note

⚡️ AI Code Reviews for VS Code, Cursor, Windsurf

CodeRabbit now has a plugin for VS Code, Cursor and Windsurf. This brings AI code reviews directly in the code editor. Each commit is reviewed immediately, finding bugs before the PR is raised. Seamless context handoff to your AI code agent ensures that you can easily incorporate review feedback.
Learn more here.

✨ Finishing Touches

📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit. 8000
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (7)

packages/utils/src/shared/sentry.ts (1)
15-23: Consider adding return type and error handling.

The function implementation follows best practices by using environment-specific configuration. However, consider adding a return type annotation and potentially some error handling in case Sentry initialization fails.
-export function initSentry() {
+export function initSentry(): void {
   Sentry.init({
     dsn: sentryDsn,
     enabled: sentryDsn != null,
     environment: getEnvType(),
     sampleRate: 1.0, // error sample rate
     tracesSampleRate: 0, // transaction sample rate
   });
}
packages/utils/src/open-search/semantic-ingest.ts (1)
39-44: Consider adding error handling for file operations.

The file path setup code is well-structured, but there's no error handling for potential file system operation failures. Consider adding try/catch blocks around file operations to handle potential errors gracefully.
 const outputFolderName = `semantic-ingest`;
-initRunsFolder(outputFolderName);
-const getFolderName = buildGetDirPathInside(outputFolderName);
-const outputFilePrefix = getFolderName(`${cxId}_${patientId}`);
-const outputFilePath = outputFilePrefix + ".ndjson";
+try {
+  initRunsFolder(outputFolderName);
+  const getFolderName = buildGetDirPathInside(outputFolderName);
+  const outputFilePrefix = getFolderName(`${cxId}_${patientId}`);
+  const outputFilePath = outputFilePrefix + ".ndjson";
+} catch (error) {
+  console.error(`Failed to initialize output folder: ${error}`);
+  process.exit(1);
+}
packages/core/src/external/opensearch/text-ingestor-direct.ts (5)
19-19: Fix grammatical issue in comment

The unit should be plural since the value is 2.
-const DEFAULT_BULK_INGESTION_TIMEOUT = dayjs.duration(2, "minute").asMilliseconds();
+const DEFAULT_BULK_INGESTION_TIMEOUT = dayjs.duration(2, "minutes").asMilliseconds();
29-29: Consider adding a configuration option for bulk chunk size

The bulk chunk size has been increased significantly from what appears to be a previous value to 10,000. While the comments justify this based on average resource size, it would be better to make this configurable to handle varying resource sizes across different environments.

Consider adding a configuration parameter for the bulk chunk size:
-const bulkChunkSize = 10_000;
+const DEFAULT_BULK_CHUNK_SIZE = 10_000;

 export type OpenSearchTextIngestorDirectConfig = OpenSearchIngestorConfig & {
   endpoint: string;
   username: string;
   password: string;
   settings?: OpenSearchFileIngestorDirectSettings;
+  bulkChunkSize?: number;
 };

 // Then in constructor:
 constructor(readonly config: OpenSearchTextIngestorDirectConfig) {
   // ...
   this.bulkChunkSize = config.bulkChunkSize ?? DEFAULT_BULK_CHUNK_SIZE;
 }

 // Use this.bulkChunkSize in the ingestBulk method
44-44: Add JSDoc for the onItemError parameter in IngestBulkRequest

For consistency with your other documentation, you should add a JSDoc comment for the onItemError parameter in the IngestBulkRequest type.
 export type IngestBulkRequest = {
   cxId: string;
   patientId: string;
   resources: IngestRequestResource[];
+  /** Function to call for each item error during bulk ingestion. See buildOnItemError() for the default implementation. */
   onItemError?: OnBulkItemError | undefined;
 };
175-183: Extract flatMap operation to a more explicit implementation

The flatMap operation can be hard to understand at a glance. Consider using a more explicit approach with map + flat for better readability, especially for developers who may not 8000 be familiar with flatMap.
-    const bulkRequest = resources.flatMap(resource =>
-      resourceToBulkRequest({
-        cxId,
-        patientId,
-        resource,
-        operation,
-      })
-    );
+    const bulkRequest = resources
+      .map(resource =>
+        resourceToBulkRequest({
+          cxId,
+          patientId,
+          resource,
+          operation,
+        })
+      )
+      .flat();
244-245: Fix typo in comment and clarify the assumption

There's a typo in the comment and the assumption should be clarified to avoid potential misunderstandings.
-/**
- * Builds a function that will add the error to the list if it doesn't already exist.
- *
- *
- * NOTE: Assumes all requeests are for the same index and operation!
- *
+/**
+ * Builds a function that will add the error to the list if it doesn't already exist.
+ *
+ *
+ * NOTE: Assumes all requests being processed share the same index and operation type!
+ * This function only tracks error counts by type, not by index or operation.
+ *

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9546a22 and 7eca2da.

📒 Files selected for processing (6)

packages/core/src/external/opensearch/bulk.ts (1 hunks)
packages/core/src/external/opensearch/semantic/ingest.ts (3 hunks)
packages/core/src/external/opensearch/text-ingestor-direct.ts (6 hunks)
packages/utils/src/open-search/semantic-ingest.ts (2 hunks)
packages/utils/src/open-search/semantic-text-from-consolidated.ts (1 hunks)
packages/utils/src/shared/sentry.ts (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

`**/*.ts`: - Use the Onion Pattern to organize a package's code in layers - Try to use immutable code and avoid sharing state across different functions, objects, and systems - Try...

**/*.ts: - Use the Onion Pattern to organize a package's code in layers

Try to use immutable code and avoid sharing state across different functions, objects, and systems

Try to build code that's idempotent whenever possible

Prefer functional programming style functions: small, deterministic, 1 input, 1 output

Minimize coupling / dependencies

Avoid modifying objects received as parameter

Only add comments to code to explain why something was done, not how it works

Naming

classes, enums: PascalCase

constants, variables, functions: camelCase

file names: kebab-case

table and column names: snake_case

Use meaningful names, so whoever is reading the code understands what it means

Don’t use negative names, like notEnabled, prefer isDisabled

For numeric values, if the type doesn’t convey the unit, add the unit to the name

Typescript

Use types

Prefer const instead of let

Avoid any and casting from any to other types

Type predicates: only applicable to narrow down the type, not to force a complete type conversion

Prefer deconstructing parameters for functions instead of multiple parameters that might be of
the same type

Don’t use null inside the app, only on code interacting with external interfaces/services,
like DB and HTTP; convert to undefined before sending inwards into the code

Use async/await instead of .then()

Use the strict equality operator ===, don’t use abstract equality operator ==

When calling a Promise-returning function asynchronously (i.e., not awaiting), use .catch() to
handle errors (see processAsyncError and emptyFunction depending on the case)

Date and Time

Always use buildDayjs() to create dayjs instances

Prefer dayjs.duration(...) to create duration consts and keep them as duration

Prefer Nullish Coalesce (??) than the OR operator (||) to provide a default value

Avoid creating arrow functions

Use truthy syntax instead of in - i.e., if (data.link) not if ('link' in data)

Error handling

Pass the original error as the new one’s cause so the stack trace is persisted

Error messages should have a static message - add dynamic data to MetriportError's additionalInfo prop

Avoid sending multiple events to Sentry for a single error

Global constants and variables

Move literals to constants declared after imports when possible (avoid magic numbers)

Avoid shared, global objects

Avoid using console.log and console.error in packages other than utils, infra and shared,
and try to use out().log instead

Avoid multi-line logs

don't send objects as a second parameter to console.log() or out().log()

don't create multi-line strings when using JSON.stringify()

Use eslint to enforce code style

Use prettier to format code

max column length is 100 chars

multi-line comments use /** */

scripts: top-level comments go after the import

packages/core/src/external/opensearch/semantic/ingest.ts
packages/utils/src/open-search/semantic-text-from-consolidated.ts
packages/utils/src/shared/sentry.ts
packages/utils/src/open-search/semantic-ingest.ts
packages/core/src/external/opensearch/bulk.ts
packages/core/src/external/opensearch/text-ingestor-direct.ts

🧬 Code Graph Analysis (4)

packages/core/src/external/opensearch/semantic/ingest.ts (1)

packages/core/src/external/opensearch/bulk.ts (1)

OnBulkItemError (23-23)

packages/utils/src/shared/sentry.ts (1)

packages/api/src/shared/config.ts (1)

getEnvVar (10-10)

packages/utils/src/open-search/semantic-ingest.ts (5)

packages/utils/src/shared/folder.ts (2)

initRunsFolder (9-27)

buildGetDirPathInside (37-43)

packages/utils/src/shared/sentry.ts (1)

initSentry (15-23)

packages/core/src/external/opensearch/semantic/ingest.ts (1)

ingestSemantic (14-49)

packages/utils/src/shared/duration.ts (1)

elapsedTimeAsStr (8-13)

packages/core/src/external/opensearch/bulk.ts (1)

BulkResponseErrorItem (15-21)

packages/core/src/external/opensearch/text-ingestor-direct.ts (3)

packages/core/src/external/opensearch/file-ingestor.ts (1)

IngestRequest (3-7)

packages/core/src/external/opensearch/index.ts (1)

contentFieldName (7-7)

packages/core/src/external/opensearch/bulk.ts (4)

OnBulkItemError (23-23)

getCountErrorsFromBulkResponse (33-55)

BulkOperation (3-3)

BulkResponseErrorItem (15-21)

⏰ Context from checks skipped due to timeout of 90000ms (3)

GitHub Check: check-pr / lint-build-test
GitHub Check: check-pr / lint-build-test
GitHub Check: Analyze (javascript)

🔇 Additional comments (12)

packages/utils/src/shared/sentry.ts (2)

1-5: Good import organization and constant declaration.

The imports are well-organized with the external library separated from internal imports. The constant is declared at the module level which is appropriate.

6-14: Comprehensive JSDoc documentation.

The comment block clearly explains the purpose of the function and includes important warnings about not enabling Sentry in production or scripts that might generate excessive events. The references to existing implementations are also helpful.

packages/core/src/external/opensearch/semantic/ingest.ts (3)

5-5: Clean import addition.

The import of OnBulkItemError is appropriately placed with other imports from the same module.

14-20: Well-structured parameter interface.

The function signature update properly incorporates the optional error callback parameter using TypeScript's optional property syntax in the parameter object. This follows the project's coding guidelines of preferring destructured parameters.

44-44: Proper propagation of error callback.

The optional error callback is correctly forwarded to the ingestBulk method, allowing for granular error handling during semantic ingestion without modifying the core ingestion logic.

packages/utils/src/open-search/semantic-ingest.ts (5)

6-6: Appropriate type import.

The import of BulkResponseErrorItem type from the bulk module aligns with the error handling improvements in this PR.

12-15: Well-organized imports for file operations and Sentry.

The imports are properly organized, adding necessary utilities for file operations and error reporting.

19-27: Clear documentation update.

The comment has been expanded to accurately describe the new error handling functionality, which helps future developers understand the purpose and usage of the script.

48-48: Good integration of Sentry initialization.

Adding Sentry initialization at the beginning of the main function ensures errors are properly captured during execution.

54-54: Proper usage of the error callback.

The onItemError callback is correctly passed to the ingestSemantic function, enabling granular error handling during semantic ingestion.

packages/utils/src/open-search/semantic-text-from-consolidated.ts (1)

36-39: Good reorganization of folder initialization.

Moving the folder initialization and file path setup to the beginning of the main function ensures that these operations are completed before any processing begins. This matches the pattern used in the semantic-ingest.ts file and improves code consistency.

packages/core/src/external/opensearch/text-ingestor-direct.ts (1)

115-154: Good implementation of error tracking with the map data structure

The refactoring to use a Map for error tracking is excellent. This approach effectively groups similar errors and provides counts, which is more useful than a flat list of all errors. The addition of the buildOnItemError helper function creates a clean default implementation.

packages/utils/src/open-search/semantic-ingest.ts

packages/core/src/external/opensearch/bulk.ts

packages/core/src/external/opensearch/text-ingestor-direct.ts

packages/utils/src/open-search/semantic-ingest.ts

packages/core/src/external/opensearch/bulk.ts

lucasdellabella

fixnship, everything i left is basically a nit

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

packages/core/src/external/opensearch/text-ingestor-direct.ts (1)

267-271: Improve error capturing and context in Sentry

When sending errors to Sentry, it's better to provide more structured context rather than JSON stringifying the error map. Sentry can handle structured data better than stringified JSON.

 function captureErrors({
   cxId,
   patientId,
   errors,
   log,
 }: {
   cxId: string;
   patientId: string;
   errors: Map<string, number>;
   log: typeof console.log;
 }) {
   const errorMapToObj = Object.fromEntries(errors.entries());
   log(`Errors: `, () => JSON.stringify(errorMapToObj));
   capture.error("Errors ingesting resources into OpenSearch", {
-    extra: { cxId, patientId, countPerErrorType: JSON.stringify(errorMapToObj, null, 2) },
+    extra: { 
+      cxId, 
+      patientId, 
+      countPerErrorType: errorMapToObj,
+      totalErrorCount: Array.from(errors.values()).reduce((sum, count) => sum + count, 0),
+    },
   });
 }

🧹 Nitpick comments (2)

packages/core/src/external/opensearch/text-ingestor-direct.ts (2)
139-150: Consider parallel processing for better performance

The current implementation processes chunks sequentially using a for-of loop. While this makes error tracking straightforward, it might be slower than processing chunks in parallel.

You could maintain parallel processing while still tracking errors accurately:
- for (const resourceChunk of chunks) {
-   const errorCountChunk = await this.ingestBulkInternal({
-     cxId,
-     patientId,
-     resources: resourceChunk,
-     indexName,
-     client,
-     onItemError: _onItemError,
-   });
-   errorCount += errorCountChunk;
- }
+ const errorCounts = await Promise.all(
+   chunks.map(resourceChunk => 
+     this.ingestBulkInternal({
+       cxId,
+       patientId,
+       resources: resourceChunk,
+       indexName,
+       client,
+       onItemError: _onItemError,
+     })
+   )
+ );
+ errorCount = errorCounts.reduce((sum, count) => sum + count, 0);
245-245: Fix typo in comment

There's a typo in the comment: "requeests" should be "requests".
- NOTE: Assumes all requeests are for the same index and operation!
+ NOTE: Assumes all requests are for the same index and operation!

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Lite

📥 Commits

Reviewing files that changed from the base of the PR and between 7eca2da and 6eef068.

📒 Files selected for processing (6)

packages/core/src/external/opensearch/bulk.ts (1 hunks)
packages/core/src/external/opensearch/semantic/ingest.ts (3 hunks)
packages/core/src/external/opensearch/text-ingestor-direct.ts (6 hunks)
packages/utils/src/open-search/semantic-ingest.ts (2 hunks)
packages/utils/src/open-search/semantic-text-from-consolidated.ts (1 hunks)
packages/utils/src/shared/sentry.ts (1 hunks)

✅ Files skipped from review due to trivial changes (1)

packages/utils/src/shared/sentry.ts

🚧 Files skipped from review as they are similar to previous changes (4)

packages/utils/src/open-search/semantic-ingest.ts
packages/core/src/external/opensearch/semantic/ingest.ts
packages/utils/src/open-search/semantic-text-from-consolidated.ts
packages/core/src/external/opensearch/bulk.ts

🧰 Additional context used

📓 Path-based instructions (1)

`**/*.ts`: - Use the Onion Pattern to organize a package's code in layers - Try to use immutable code and avoid sharing state across different functions, objects, and systems - Try...

**/*.ts: - Use the Onion Pattern to organize a package's code in layers

Try to use immutable code and avoid sharing state across different functions, objects, and systems

Try to build code that's idempotent whenever possible

Prefer functional programming style functions: small, deterministic, 1 input, 1 output

Minimize coupling / dependencies

Avoid modifying objects received as parameter

Only add comments to code to explain why something was done, not how it works

Naming

classes, enums: PascalCase

constants, variables, functions: camelCase

file names: kebab-case

table and column names: snake_case

Use meaningful names, so whoever is reading the code understands what it means

Don’t use negative names, like notEnabled, prefer isDisabled

For numeric values, if the type doesn’t convey the unit, add the unit to the name

Typescript

Use types

Prefer const instead of let

Avoid any and casting from any to other types

Type predicates: only applicable to n 8000 arrow down the type, not to force a complete type conversion

Prefer deconstructing parameters for functions instead of multiple parameters that might be of
the same type

Don’t use null inside the app, only on code interacting with external interfaces/services,
like DB and HTTP; convert to undefined before sending inwards into the code

Use async/await instead of .then()

Use the strict equality operator ===, don’t use abstract equality operator ==

When calling a Promise-returning function asynchronously (i.e., not awaiting), use .catch() to
handle errors (see processAsyncError and emptyFunction depending on the case)

Date and Time

Always use buildDayjs() to create dayjs instances

Prefer dayjs.duration(...) to create duration consts and keep them as duration

Prefer Nullish Coalesce (??) than the OR operator (||) to provide a default value

Avoid creating arrow functions

Use truthy syntax instead of in - i.e., if (data.link) not if ('link' in data)

Error handling

Pass the original error as the new one’s cause so the stack trace is persisted

Error messages should have a static message - add dynamic data to MetriportError's additionalInfo prop

Avoid sending multiple events to Sentry for a single error

Global constants and variables

Move literals to constants declared after imports when possible (avoid magic numbers)

Avoid shared, global objects

Avoid using console.log and console.error in packages other than utils, infra and shared,
and try to use out().log instead

Avoid multi-line logs

don't send objects as a second parameter to console.log() or out().log()

don't create multi-line strings when using JSON.stringify()

Use eslint to enforce code style

Use prettier to format code

max column length is 100 chars

multi-line comments use /** */

scripts: top-level comments go after the import

packages/core/src/external/opensearch/text-ingestor-direct.ts

🔇 Additional comments (7)

packages/core/src/external/opensearch/text-ingestor-direct.ts (7)

19-20: 👍 Good improvement to timeout and retry settings

Increasing the bulk ingestion timeout from 1 to 2 minutes and adding a maximum retry count of 3 for bulk requests improves reliability for larger batch operations.

29-29: 👍 Appropriate bulk chunk size increase

Increasing the bulk chunk size from 200 to 10,000 aligns with OpenSearch/Elasticsearch recommendations for bulk operations (5-15MB). With resources typically around 250 bytes, this is approximately 2.5MB per chunk, which is reasonable.

31-45: Good refactoring of request types

The separation of IngestRequestResource from IngestRequest improves modularity, and adding the optional onItemError callback supports the new error handling workflow.

135-138: 👍 Effective error tracking implementation

The implementation of error tracking via a Map with error types as keys and counts as values is an excellent approach to group similar errors, which aligns perfectly with the PR's objective.

175-182: 👍 Good extraction of resource to bulk request logic

Extracting the bulk request building logic to a separate function improves readability and maintainability.

191-194: 👍 Clean error handling with helper function

Using the getCountErrorsFromBulkResponse helper function with the error callback provides a cleaner way to handle errors from bulk responses.

249-254: 👍 Effective error grouping implementation

The buildOnItemError function efficiently groups errors by type and counts occurrences, fulfilling the PR objective of grouping similar errors with their counts.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

packages/core/src/external/opensearch/text-ingestor-direct.ts (1)

256-272: Improve error capturing and context in Sentry.

When sending errors to Sentry, it's better to provide more structured context rather than JSON stringifying the error map. Sentry can handle structured data better than stringified JSON.

function captureErrors({
  cxId,
  patientId,
  errors,
  log,
}: {
  cxId: string;
  patientId: string;
  errors: Map<string, number>;
  log: typeof console.log;
}) {
  const errorMapToObj = Object.fromEntries(errors.entries());
  log(`Errors: `, () => JSON.stringify(errorMapToObj));
  capture.error("Errors ingesting resources into OpenSearch", {
-    extra: { cxId, patientId, countPerErrorType: JSON.stringify(errorMapToObj, null, 2) },
+    extra: { 
+      cxId, 
+      patientId, 
+      countPerErrorType: errorMapToObj,
+      totalErrorCount: Array.from(errors.values()).reduce((sum, count) => sum + count, 0),
+    },
  });
}

🧹 Nitpick comments (3)

packages/core/src/external/opensearch/text-ingestor-direct.ts (3)
19-20: Consider adding a comment explaining the timeout increase.

The bulk ingestion timeout has been doubled from 1 to 2 minutes, but there's no explanation for this change. Adding a comment would help future developers understand the reasoning behind this value.
-const DEFAULT_BULK_INGESTION_TIMEOUT = dayjs.duration(2, "minute").asMilliseconds();
+// Increased from 1 to 2 minutes to accommodate larger bulk operations
+const DEFAULT_BULK_INGESTION_TIMEOUT = dayjs.duration(2, "minute").asMilliseconds();
241-254: Fix typo in comment.

There's a typo in the comment for the buildOnItemError function.
-/**
- * Builds a function that will add the error to the list if it doesn't already exist.
- *
- *
- * NOTE: Assumes all requeests are for the same index and operation!
- *
+/**
+ * Builds a function that will add the error to the list if it doesn't already exist.
+ *
+ *
+ * NOTE: Assumes all requests are for the same index and operation!
+ *
249-254: Consider optimizing Map operation.

The current implementation reads and writes to the Map in separate operations. This can be optimized using a single operation.
function buildOnItemError(errors: Map<string, number>): OnBulkItemError {
  return (error: BulkResponseErrorItem) => {
-    const count = errors.get(error.type) ?? 0;
-    errors.set(error.type, count + 1);
+    errors.set(error.type, (errors.get(error.type) ?? 0) + 1);
  };
}

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Lite

📥 Commits

Reviewing files that changed from the base of the PR and between 6eef068 and 6bef7a4.

📒 Files selected for processing (6)

packages/core/src/external/opensearch/bulk.ts (1 hunks)
packages/core/src/external/opensearch/semantic/ingest.ts (3 hunks)
packages/core/src/external/opensearch/text-ingestor-direct.ts (6 hunks)
packages/utils/src/open-search/semantic-ingest.ts (2 hunks)
packages/utils/src/open-search/semantic-text-from-consolidated.ts (1 hunks)
packages/utils/src/shared/sentry.ts (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (5)

packages/utils/src/open-search/semantic-ingest.ts
packages/utils/src/shared/sentry.ts
packages/core/src/external/opensearch/semantic/ingest.ts
packages/utils/src/open-search/semantic-text-from-consolidated.ts
packages/core/src/external/opensearch/bulk.ts

🧰 Additional context used

📓 Path-based instructions (1)

`**/*.ts`: - Use the Onion Pattern to organize a package's code in layers - Try to use immutable code and avoid sharing state across different functions, objects, and systems - Try...

**/*.ts: - Use the Onion Pattern to organize a package's code in layers

Try to use immutable code and avoid sharing state across different functions, objects, and systems

Try to build code that's idempotent whenever possible

Prefer functional programming style functions: small, deterministic, 1 input, 1 output

Minimize coupling / dependencies

Avoid modifying objects received as parameter

Only add comments to code to explain why something was done, not how it works

Naming

classes, enums: PascalCase

constants, variables, functions: camelCase

file names: kebab-case

table and column names: snake_case

Use meaningful names, so whoever is reading the code understands what it means

Don’t use negative names, like notEnabled, prefer isDisabled

For numeric values, if the type doesn’t convey the unit, add the unit to the name

Typescript

Use types

Prefer const instead of let

Avoid any and casting from any to other types

Type predicates: only applicable to narrow down the type, not to force a complete type conversion

Prefer deconstructing parameters for functions instead of multiple parameters that might be of
the same type

Don’t use null inside the app, only on code interacting with external interfaces/services,
like DB and HTTP; convert to undefined before sending inwards into the code

Use async/await instead of .then()

Use the strict equality operator ===, don’t use abstract equality operator ==

When calling a Promise-returning function asynchronously (i.e., not awaiting), use .catch() to
handle errors (see processAsyncError and emptyFunction depending on the case)

Date and Time

Always use buildDayjs() to create dayjs instances

Prefer dayjs.duration(...) to create duration consts and keep them as duration

Prefer Nullish Coalesce (??) than the OR operator (||) to provide a default value

Avoid creating arrow functions

Use truthy syntax instead of in - i.e., if (data.link) not if ('link' in data)

Error handling

Pass the original error as the new one’s cause so the stack trace is persisted

Error messages should have a static message - add dynamic data to MetriportError's additionalInfo prop

Avoid sending multiple events to Sentry for a single error

Global constants and variables

Move literals to constants declared after imports when possible (avoid magic numbers)

Avoid shared, global objects

Avoid using console.log and console.error in packages other than utils, infra and shared,
and try to use out().log instead

Avoid multi-line logs

don't send objects as a second parameter to console.log() or out().log()

don't create multi-line strings when using JSON.stringify()

Use eslint to enforce code style

Use prettier to format code

max column length is 100 chars

multi-line comments use /** */

scripts: top-level comments go after the import

packages/core/src/external/opensearch/text-ingestor-direct.ts

🧬 Code Graph Analysis (1)

packages/core/src/external/opensearch/text-ingestor-direct.ts (3)

packages/core/src/external/opensearch/index.ts (1)

contentFieldName (7-7)

packages/core/src/external/opensearch/bulk.ts (4)

OnBulkItemError (23-23)

processErrorsFromBulkResponse (33-55)

BulkOperation (3-3)

BulkResponseErrorItem (15-21)

packages/lambdas/src/shared/capture.ts (1)

capture (18-103)

🔇 Additional comments (7)

packages/core/src/external/opensearch/text-ingestor-direct.ts (7)

8-13: Well-structured imports for bulk operation functionality.

The imported types and functions from the bulk module provide a good separation of concerns and improve code organization.

29-29: Approve the significantly increased bulk chunk size.

The chunk size increase (from 200 to 10,000) is well-justified in the comment above. The calculation based on average resource size and Elasticsearch's recommendation supports this change.

34-39: Good type refactoring for better code organization.

Extracting the common resource fields into a separate IngestRequestResource type improves code organization and reusability.

43-44: Good addition of error handling capability.

Adding the optional onItemError callback to the IngestBulkRequest type enables granular error handling during bulk operations, which aligns with the PR objectives.

115-155: Improved bulk ingestion with better error handling.

The rewritten ingestBulk method now properly processes resources in chunks, tracks errors by type, and provides better error reporting. This implementation aligns well with the PR objectives of improving error handling during bulk operations.

157-195: Refactored ingestBulkInternal with improved error handling.

The method now uses the processErrorsFromBulkResponse function from the bulk module and returns the error count, which is a cleaner approach than returning raw errors.

206-227: Well-structured helper function for building bulk requests.

The resourceToBulkRequest function encapsulates the logic for building bulk request bodies, improving code readability and maintainability.

coderabbitai

Actionable comments posted: 0

🔭 Outside diff range comments (1)

packages/core/src/external/opensearch/text-ingestor-direct.ts (1)

157-194: 🛠️ Refactor suggestion

Consider adding proper error handling for network or client failures.

While the method handles OpenSearch response errors via processErrorsFromBulkResponse, there's no try-catch block to handle connection issues or other exceptions that might occur during the client.bulk() call.

    const startedAt = Date.now();
-   const response = await client.bulk(
-     { index: indexName, body: bulkRequest },
-     { requestTimeout: DEFAULT_BULK_INGESTION_TIMEOUT, maxRetries: MAX_BULK_RETRIES }
-   );
-   const time = Date.now() - startedAt;
-
-   const errorCount = processErrorsFromBulkResponse(response, operation, onItemError);
+   try {
+     const response = await client.bulk(
+       { index: indexName, body: bulkRequest },
+       { requestTimeout: DEFAULT_BULK_INGESTION_TIMEOUT, maxRetries: MAX_BULK_RETRIES }
+     );
+     const time = Date.now() - startedAt;
+
+     const errorCount = processErrorsFromBulkResponse(response, operation, onItemError);
+     debug(`${operation}ed ${resources.length} resources in ${time} ms, ${errorCount} errors`);
+     return errorCount;
+   } catch (error) {
+     debug(`Failed to bulk ${operation} ${resources.length} resources: ${error}`);
+     if (onItemError) {
+       // Handle all resources as failed with the same error
+       resources.forEach(() => 
+         onItemError({
+           status: 500,
+           type: error instanceof Error ? error.name : 'UnknownError',
+           reason: error instanceof Error ? error.message : String(error),
+         })
+       );
+     }
+     return resources.length; // All resources failed
+   }
-   debug(`${operation}ed ${resources.length} resources in ${time} ms, ${errorCount} errors`);
-
-   return errorCount;

♻️ Duplicate comments (1)

packages/core/src/external/opensearch/text-ingestor-direct.ts (1)
267-271: Improve error capturing and context in Sentry.

When sending errors to Sentry, it's better to provide more structured context rather than JSON stringifying the error map. Sentry can handle structured data better than stringified JSON.
  const errorMapToObj = Object.fromEntries(errors.entries());
  log(`Errors: `, () => JSON.stringify(errorMapToObj));
  capture.error("Errors ingesting resources into OpenSearch", {
-    extra: { cxId, patientId, countPerErrorType: JSON.stringify(errorMapToObj, null, 2) },
+    extra: { 
+      cxId, 
+      patientId, 
+      countPerErrorType: errorMapToObj,
+      totalErrorCount: Array.from(errors.values()).reduce((sum, count) => sum + count, 0),
+    },
  });

🧹 Nitpick comments (2)

packages/core/src/external/opensearch/text-ingestor-direct.ts (2)
19-19: Fix grammatical error in constant definition.

The time unit "minute" should be plural when specifying multiple time units.
-const DEFAULT_BULK_INGESTION_TIMEOUT = dayjs.duration(2, "minute").asMilliseconds();
+const DEFAULT_BULK_INGESTION_TIMEOUT = dayjs.duration(2, "minutes").asMilliseconds();
206-227: Add return type to resourceToBulkRequest function.

The function is missing a return type definition, which would improve type safety and documentation.
function resourceToBulkRequest({
  cxId,
  patientId,
  resource,
  operation,
}: {
  cxId: string;
  patientId: string;
  resource: IngestRequestResource;
  operation: BulkOperation;
-}) {
+}): [Record<BulkOperation, { _id: string }>, IngestRequest] {
  const entryId = getEntryId(cxId, patientId, resource.resourceId);
  const document: IngestRequest = {
    cxId,
    patientId,
    resourceType: resource.resourceType,
    resourceId: resource.resourceId,
    content: resource.content,
  };
  const cmd = { [operation]: { _id: entryId } };
  return [cmd, document];
}

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Lite

📥 Commits

Reviewing files that changed from the base of the PR and between 6bef7a4 and b39caab.

📒 Files selected for processing (6)

packages/core/src/external/opensearch/bulk.ts (1 hunks)
packages/core/src/external/opensearch/semantic/ingest.ts (3 hunks)
packages/core/src/external/opensearch/text-ingestor-direct.ts (6 hunks)
packages/utils/src/open-search/semantic-ingest.ts (2 hunks)
packages/utils/src/open-search/semantic-text-from-consolidated.ts (1 hunks)
packages/utils/src/shared/sentry.ts (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (5)

packages/utils/src/open-search/semantic-ingest.ts
packages/utils/src/open-search/semantic-text-from-consolidated.ts
packages/utils/src/shared/sentry.ts
packages/core/src/external/opensearch/semantic/ingest.ts
packages/core/src/external/opensearch/bulk.ts

🧰 Additional context used

📓 Path-based instructions (1)

`**/*.ts`: - Use the Onion Pattern to organize a package's code in layers - Try to use immutable code and avoid sharing state across different functions, objects, and systems - Try...

**/*.ts: - Use the Onion Pattern to organize a package's code in layers

Try to use immutable code and avoid sharing state across different functions, objects, and systems

Try to build code that's idempotent whenever possible

Prefer functional programming style functions: small, deterministic, 1 input, 1 output

Minimize coupling / dependencies

Avoid modifying objects received as parameter

Only add comments to code to explain why something was done, not how it works

Naming

classes, enums: PascalCase

constants, variables, functions: camelCase

file names: kebab-case

table and column names: snake_case

Use meaningful names, so whoever is reading the code understands what it means

Don’t use negative names, like notEnabled, prefer isDisabled

For numeric values, if the type doesn’t convey the unit, add the unit to the name

Typescript

Use types

Prefer const instead of let

Avoid any and casting from any to other types

Type predicates: only applicable to narrow down the type, not to force a complete type conversion

Prefer deconstructing parameters for functions instead of multiple parameters that might be of
the same type

Don’t use null inside the app, only on code interacting with external interfaces/services,
like DB and HTTP; convert to undefined before sending inwards into the code

Use async/await instead of .then()

Use the strict equality operator ===, don’t use abstract equality operator ==

When calling a Promise-returning function asynchronously (i.e., not awaiting), use .catch() to
handle errors (see processAsyncError and emptyFunction depending on the case)

Date and Time

Always use buildDayjs() to create dayjs instances

Prefer dayjs.duration(...) to create duration consts and keep them as duration

Prefer Nullish Coalesce (??) than the OR operator (||) to provide a default value

Avoid creating arrow functions

Use truthy syntax instead of in - i.e., if (data.link) not if ('link' in data)

Error handling

Pass the original error as the new one’s cause so the stack trace is persisted

Error messages should have a static message - add dynamic data to MetriportError's additionalInfo prop

Avoid sending multiple events to Sentry for a single error

Global constants and variables

Move literals to constants declared after imports when possible (avoid magic numbers)

Avoid shared, global objects

Avoid using console.log and console.error in packages other than utils, infra and shared,
and try to use out().log instead

Avoid multi-line logs

don't send objects as a second parameter to console.log() or out().log()

don't create multi-line strings when using JSON.stringify()

Use eslint to enforce code style

Use prettier to format code

max column length is 100 chars

multi-line comments use /** */

scripts: top-level comments go after the import

packages/core/src/external/opensearch/text-ingestor-direct.ts

🔇 Additional comments (1)

packages/core/src/external/opensearch/text-ingestor-direct.ts (1)

124-154: LGTM! Error handling implementation looks solid.

The bulk ingestion implementation properly tracks errors and aggregates them by type, which is aligned with the PR objective to group similar errors. The fallback error handler is a nice touch that ensures errors are always captured even when no custom handler is provided.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

packages/core/src/external/opensearch/text-ingestor-direct.ts (1)

256-272: Consider structured error data in Sentry.

While the implementation works, consider the previous review suggestion to provide structured data to Sentry rather than stringified JSON for better error exploration.

 function captureErrors({
   cxId,
   patientId,
   errors,
   log,
 }: {
   cxId: string;
   patientId: string;
   errors: Map<string, number>;
   log: typeof console.log;
 }) {
   const errorMapToObj = Object.fromEntries(errors.entries());
   log(`Errors: `, () => JSON.stringify(errorMapToObj));
   capture.error("Errors ingesting resources into OpenSearch", {
-    extra: { cxId, patientId, countPerErrorType: JSON.stringify(errorMapToObj, null, 2) },
+    extra: { 
+      cxId, 
+      patientId, 
+      countPerErrorType: errorMapToObj,
+      totalErrorCount: Array.from(errors.values()).reduce((sum, count) => sum + count, 0),
+    },
   });
 }

🧹 Nitpick comments (2)

packages/core/src/external/opensearch/text-ingestor-direct.ts (2)
135-138: Consider clearer naming for the error handler variable.

The variable name _onItemError is very similar to the parameter name onItemError which could be confusing for future developers.
-const _onItemError = onItemError ?? buildOnItemError(errors);
+const effectiveOnItemError = onItemError ?? buildOnItemError(errors);
241-254: Error aggregation helper with typo in comment.

The buildOnItemError function effectively aggregates errors by type, but there's a typo in the comment.
-* NOTE: Assumes all requeests are for the same index and operation!
+* NOTE: Assumes all requests are for the same index and operation!

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Lite

📥 Commits

Reviewing files that changed from the base of the PR and between b39caab and f3c604b.

📒 Files selected for processing (6)

packages/core/src/external/opensearch/bulk.ts (1 hunks)
packages/core/src/external/opensearch/semantic/ingest.ts (3 hunks)
packages/core/src/external/opensearch/text-ingestor-direct.ts (6 hunks)
packages/utils/src/open-search/semantic-ingest.ts (2 hunks)
packages/utils/src/open-search/semantic-text-from-consolidated.ts (1 hunks)
packages/utils/src/shared/sentry.ts (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (5)

packages/utils/src/shared/sentry.ts
packages/core/src/external/opensearch/semantic/ingest.ts
packages/utils/src/open-search/semantic-text-from-consolidated.ts
packages/utils/src/open-search/semantic-ingest.ts
packages/core/src/external/opensearch/bulk.ts

🧰 Additional context used

📓 Path-based instructions (1)

`**/*.ts`: - Use the Onion Pattern to organize a package's code in layers - Try to use immutable code and avoid sharing state across different functions, objects, and systems - Try...

**/*.ts: - Use the Onion Pattern to organize a package's code in layers

Try to use immutable code and avoid sharing state across different functions, objects, and systems

Try to build code that's idempotent whenever possible

Prefer functional programming style functions: small, deterministic, 1 input, 1 output

Minimize coupling / dependencies

Avoid modifying objects received as parameter

Only add comments to code to explain why something was done, not how it works

Naming

classes, enums: PascalCase

constants, variables, functions: camelCase

file names: kebab-case

table and column names: snake_case

Use meaningful names, so whoever is reading the code understands what it means

Don’t use negative names, like notEnabled, prefer isDisabled

For numeric values, if the type doesn’t convey the unit, add the unit to the name

Typescript

Use types

Prefer const instead of let

Avoid any and casting from any to other types

Type predicates: only applicable to narrow down the type, not to force a complete type conversion

Prefer deconstructing parameters for functions instead of multiple parameters that might be of
the same type

Don’t use null inside the app, only on code interacting with external interfaces/services,
like DB and HTTP; convert to undefined before sending inwards into the code

Use async/await instead of .then()

Use the strict equality operator ===, don’t use abstract equality operator ==

When calling a Promise-returning function asynchronously (i.e., not awaiting), use .catch() to
handle errors (see processAsyncError and emptyFunction depending on the case)

Date and Time

Always use buildDayjs() to create dayjs instances

Prefer dayjs.duration(...) to create duration consts and keep them as duration

Prefer Nullish Coalesce (??) than the OR operator (||) to provide a default value

Avoid creating arrow functions

Use truthy syntax instead of in - i.e., if (data.link) not if ('link' in data)

Error handling

Pass the original error as the new one’s cause so the stack trace is persisted

Error messages should have a static message - add dynamic data to MetriportError's additionalInfo prop

Avoid sending multiple events to Sentry for a single error

Global constants and variables

Move literals to constants declared after imports when possible (avoid magic numbers)

Avoid shared, global objects

Avoid using console.log and console.error in packages other than utils, infra and shared,
and try to use out().log instead

Avoid multi-line logs

don't send objects as a second parameter to console.log() or out().log()

don't create multi-line strings when using JSON.stringify()

Use eslint to enforce code style

Use prettier to format code

max column length is 100 chars

multi-line comments use /** */

scripts: top-level comments go after the import

packages/core/src/external/opensearch/text-ingestor-direct.ts

🧬 Code Graph Analysis (1)

packages/core/src/external/opensearch/text-ingestor-direct.ts (3)

packages/core/src/external/opensearch/index.ts (1)

contentFieldName (7-7)

packages/core/src/external/opensearch/bulk.ts (4)

OnBulkItemError (23-23)

processErrorsFromBulkResponse (33-55)

BulkOperation (3-3)

BulkResponseErrorItem (15-21)

packages/lambdas/src/shared/capture.ts (1)

capture (18-103)

🔇 Additional comments (10)

packages/core/src/external/opensearch/text-ingestor-direct.ts (10)

8-13: LGTM: New imports added for bulk operations.

These imports from the new bulk.ts module properly support the enhanced error handling capabilities being implemented.

34-39: Good type refactoring for improved code structure.

Extracting IngestRequestResource as a separate type and adding the optional onItemError callback parameter improve the interface design and support the new error handling functionality.

Also applies to: 43-44

115-124: Well-documented function with clear parameters.

The JSDoc comments for ingestBulk clearly explain the purpose and parameters of the function, which is helpful for maintainability.

140-150: LGTM: Properly handles chunks and tracks error count.

The refactored loop correctly processes chunks of resources and accumulates the error count.

153-154: Good error reporting with proper log context.

The log includes the total resources processed, time taken, and error count, followed by error capture when errors exist.

175-182: Improved bulk request creation with dedicated helper function.

Extracting the bulk request creation logic to the resourceToBulkRequest helper function improves readability and maintainability.

186-187: Enhanced bulk operation configuration.

Using the increased timeout and setting a maximum retry count improves the robustness of bulk operations.

191-194: Simplified error handling with bulk response processor.

Using the new processErrorsFromBulkResponse utility function from bulk.ts makes the error handling code more maintainable and consistent.

206-227: Well-structured helper function for bulk request formatting.

The resourceToBulkRequest function properly formats resources for OpenSearch bulk operations, making the main code cleaner.

19-20:
❓ Verification inconclusive

Configuration adjustments for improved performance.

The changes to these constants look good:

Increasing timeout from 1 to 2 minutes allows more time for larger bulk operations

Adding MAX_BULK_RETRIES ensures failed operations are retried

Increasing bulkChunkSize from 200 to 10,000 aligns with the documentation comment about optimal bulk sizes

Consider verifying that these increased values don't cause memory pressure issues in your environment:

Also applies to: 29-29

🏁 Script executed:
#!/bin/bash
# Check if there are any timeout errors in the logs related to OpenSearch bulk operations
rg -i "timeout|opensearch.*error|econnreset" --glob "*.log"
Length of output: 210
Approve code changes – updated timeout and retry settings

The adjustments look solid:

Timeout increased from 1 to 2 minutes for larger bulk operations

Added MAX_BULK_RETRIES to transparently retry failures

Bulk chunk size raised to 10,000 per the optimal-size comment

I ran rg -i 'timeout|opensearch.*error|econnreset' --glob '*.log', but no log files were present in the repo. Please verify in your deployment environment that these increased settings don’t introduce memory pressure or timeout issues during OpenSearch bulk ingestion.

Relevant location:

packages/core/src/external/opensearch/text-ingestor-direct.ts (lines 19–20, 29)

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

packages/core/src/external/opensearch/text-ingestor-direct.ts (1)

256-272: Improve error capturing in Sentry.

When sending errors to Sentry, it would be better to provide structured data rather than stringifying the error map. This was previously suggested in a past review comment.

function captureErrors({
  cxId,
  patientId,
  errors,
  log,
}: {
  cxId: string;
  patientId: string;
  errors: Map<string, number>;
  log: typeof console.log;
}) {
  const errorMapToObj = Object.fromEntries(errors.entries());
  log(`Errors: `, () => JSON.stringify(errorMapToObj));
  capture.error("Errors ingesting resources into OpenSearch", {
-    extra: { cxId, patientId, countPerErrorType: JSON.stringify(errorMapToObj, null, 2) },
+    extra: { 
+      cxId, 
+      patientId, 
+      countPerErrorType: errorMapToObj,
+      totalErrorCount: Array.from(errors.values()).reduce((sum, count) => sum + count, 0),
+    },
  });
}

🧹 Nitpick comments (2)

packages/core/src/external/opensearch/text-ingestor-direct.ts (2)
19-20: Consider using plural form for time unit.

The singular form "minute" in dayjs.duration(2, "minute") should be "minutes" for consistency with dayjs API conventions.
-const DEFAULT_BULK_INGESTION_TIMEOUT = dayjs.duration(2, "minute").asMilliseconds();
+const DEFAULT_BULK_INGESTION_TIMEOUT = dayjs.duration(2, "minutes").asMilliseconds();
241-254: Fix typo in comment.

There's a typo in the function comment.
-* NOTE: Assumes all requeests are for the same index and operation!
+* NOTE: Assumes all requests are for the same index and operation!

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Lite

📥 Commits

Reviewing files that changed from the base of the PR and between f3c604b and 9bbda99.

📒 Files selected for processing (6)

packages/core/src/external/opensearch/bulk.ts (1 hunks)
packages/core/src/external/opensearch/semantic/ingest.ts (3 hunks)
packages/core/src/external/opensearch/text-ingestor-direct.ts (6 hunks)
packages/utils/src/open-search/semantic-ingest.ts (2 hunks)
packages/utils/src/open-search/semantic-text-from-consolidated.ts (1 hunks)
packages/utils/src/shared/sentry.ts (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (5)

packages/utils/src/open-search/semantic-ingest.ts
packages/core/src/external/opensearch/semantic/ingest.ts
packages/utils/src/shared/sentry.ts
packages/core/src/external/opensearch/bulk.ts
packages/utils/src/open-search/semantic-text-from-consolidated.ts

🧰 Additional context used

📓 Path-based instructions (1)

`**/*.ts`: - Use the Onion Pattern to organize a package's code in layers - Try to use immutable code and avoid sharing state across different functions, objects, and systems - Try...

**/*.ts: - Use the Onion Pattern to organize a package's code in layers

Try to use immutable code and avoid sharing state across different functions, objects, and systems

Try to build code that's idempotent whenever possible

Prefer functional programming style functions: small, deterministic, 1 input, 1 output

Minimize coupling / dependencies

Avoid modifying objects received as parameter

Only add comments to code to explain why something was done, not how it works

Naming

classes, enums: PascalCase

constants, variables, functions: camelCase

file names: kebab-case

table and column names: snake_case

Use meaningful names, so whoever is reading the code understands what it means

Don’t use negative names, like notEnabled, prefer isDisabled

For numeric values, if the type doesn’t convey the unit, add the unit to the name

Typescript

Use types

Prefer const instead of let

Avoid any and casting from any to other types

Type predicates: only applicable to narrow down the type, not to force a complete type conversion

Prefer deconstructing parameters for functions instead of multiple parameters that might be of
the same type

Don’t use null inside the app, only on code interacting with external interfaces/services,
like DB and HTTP; convert to undefined before sending inwards into the code

Use async/await instead of .then()

Use the strict equality operator ===, don’t use abstract equality operator ==

When calling a Promise-returning function asynchronously (i.e., not awaiting), use .catch() to
handle errors (see processAsyncError and emptyFunction depending on the case)

Date and Time

Always use buildDayjs() to create dayjs instances

Prefer dayjs.duration(...) to create duration consts and keep them as duration

Prefer Nullish Coalesce (??) than the OR operator (||) to provide a default value

Avoid creating arrow functions

Use truthy syntax instead of in - i.e., if (data.link) not if ('link' in data)

Error handling

Pass the original error as the new one’s cause so the stack trace is persisted

Error messages should have a static message - add dynamic data to MetriportError's additionalInfo prop

Avoid sending multiple events to Sentry for a single error

Global constants and variables

Move literals to constants declared after imports when possible (avoid magic numbers)

Avoid shared, global objects

Avoid using console.log and console.error in packages other than utils, infra and shared,
and try to use out().log instead

Avoid multi-line logs

don't send objects as a second parameter to console.log() or out().log()

don't create multi-line strings when using JSON.stringify()

Use eslint to enforce code style

Use prettier to format code

max column length is 100 chars

multi-line comments use /** */

scripts: top-level comments go after the import

packages/core/src/external/opensearch/text-ingestor-direct.ts

🧬 Code Graph Analysis (1)

packages/core/src/external/opensearch/text-ingestor-direct.ts (4)

packages/core/src/external/opensearch/file-ingestor.ts (1)

IngestRequest (3-7)

packages/core/src/external/opensearch/index.ts (1)

contentFieldName (7-7)

packages/core/src/external/opensearch/bulk.ts (4)

OnBulkItemError (23-23)

processErrorsFromBulkResponse (33-55)

BulkOperation (3-3)

BulkResponseErrorItem (15-21)

packages/lambdas/src/shared/capture.ts (1)

capture (18-103)

🔇 Additional comments (6)

packages/core/src/external/opensearch/text-ingestor-direct.ts (6)

8-13: Well-organized imports for new bulk error handling capabilities.

The new imports from the "./bulk" module bring in the necessary types and functions for enhanced error handling in bulk operations, which aligns well with the PR objective of improving OpenSearch bulk import error handling.

29-29: Significant increase in bulk chunk size is well-documented.

The increase from 200 to 10,000 is substantial but justified by the detailed comment above explaining the rationale based on Elasticsearch's guidelines and observed resource sizes.

34-45: Good refactoring of types to support error handling.

The extraction of IngestRequestResource from IngestRequest and the addition of the optional onItemError callback in IngestBulkRequest improve type safety and enable granular error handling, which is essential for the PR's goals.

115-155: Improved bulk ingestion with robust error tracking.

The refactored ingestBulk method now properly tracks errors by type, provides clear documentation, and includes error reporting through Sentry. This implementation effectively addresses the PR objective of grouping similar errors with their counts.

157-195: Enhanced internal bulk ingestion with error handling.

The ingestBulkInternal method now uses the processErrorsFromBulkResponse function for better error extraction and propagates errors through the callback mechanism. The code is more modular and easier to maintain.

206-227: Well-structured helper function for building bulk requests.

The resourceToBulkRequest function cleanly separates the logic for creating bulk request entries, making the code more maintainable and easier to test.

Ref eng-41 Signed-off-by: Rafael Leite <2132564+leite08@users.noreply.github.com>

- group OS bulk errors to Sentry - store individual errors in NDJON when run from utils Ref eng-41 Signed-off-by: Rafael Leite <2132564+leite08@users.noreply.github.com>

Ref eng-268 Signed-off-by: Rafael Leite <2132564+leite08@users.noreply.github.com>

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

packages/core/src/external/opensearch/text-ingestor-direct.ts (1)
256-272: Consider structured data in Sentry

While the current implementation works, consider providing structured data to Sentry instead of JSON stringifying the error map. This would make errors more searchable and filterable in Sentry.
  const errorMapToObj = Object.fromEntries(errors.entries());
  log(`Errors: `, () => JSON.stringify(errorMapToObj));
  capture.error("Errors ingesting resources into OpenSearch", {
-    extra: { cxId, patientId, countPerErrorType: JSON.stringify(errorMapToObj, null, 2) },
+    extra: { 
+      cxId, 
+      patientId, 
+      countPerErrorType: errorMapToObj,
+      totalErrorCount: Array.from(errors.values()).reduce((sum, count) => sum + count, 0),
+    },
  });

🧹 Nitpick comments (1)

packages/core/src/external/opensearch/text-ingestor-direct.ts (1)

140-150: Consider adding error handling for the chunk processing loop

While the error counting mechanism works well, the code could benefit from try/catch handling around the chunk processing loop to ensure that errors in one chunk don't prevent processing of subsequent chunks.

    const chunks = chunk(resources, bulkChunkSize);
+   let errorsInChunks = false;
    for (const resourceChunk of chunks) {
+     try {
        const errorCountChunk = await this.ingestBulkInternal({
          cxId,
          patientId,
          resources: resourceChunk,
          indexName,
          client,
          onItemError: _onItemError,
        });
        errorCount += errorCountChunk;
+     } catch (error) {
+       errorsInChunks = true;
+       log(`Error processing chunk: ${error.message}`);
+       capture.error("Error processing chunk in OpenSearch bulk import", {
+         extra: { cxId, patientId, error },
+         cause: error,
+       });
+     }
    }
+   
+   if (errorsInChunks) {
+  
1036C
   log("Some chunks failed to process, see errors above");
+   }

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7ad3fc7 and 23c850f.

📒 Files selected for processing (6)

packages/core/src/external/opensearch/bulk.ts (1 hunks)
packages/core/src/external/opensearch/semantic/ingest.ts (3 hunks)
packages/core/src/external/opensearch/text-ingestor-direct.ts (6 hunks)
packages/utils/src/open-search/semantic-ingest.ts (2 hunks)
packages/utils/src/open-search/semantic-text-from-consolidated.ts (1 hunks)
packages/utils/src/shared/sentry.ts (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (5)

packages/utils/src/open-search/semantic-ingest.ts
packages/utils/src/open-search/semantic-text-from-consolidated.ts
packages/core/src/external/opensearch/semantic/ingest.ts
packages/utils/src/shared/sentry.ts
packages/core/src/external/opensearch/bulk.ts

🧰 Additional context used

📓 Path-based instructions (1)

`**/*.ts`: - Use the Onion Pattern to organize a package's code in layers - Try to use immutable code and avoid sharing state across different functions, objects, and systems - Try...

**/*.ts: - Use the Onion Pattern to organize a package's code in layers

Try to use immutable code and avoid sharing state across different functions, objects, and systems

Try to build code that's idempotent whenever possible

Prefer functional programming style functions: small, deterministic, 1 input, 1 output

Minimize coupling / dependencies

Avoid modifying objects received as parameter

Only add comments to code to explain why something was done, not how it works

Naming

classes, enums: PascalCase

constants, variables, functions: camelCase

file names: kebab-case

table and column names: snake_case

Use meaningful names, so whoever is reading the code understands what it means

Don’t use negative names, like notEnabled, prefer isDisabled

For numeric values, if the type doesn’t convey the unit, add the unit to the name

Typescript

Use types

Prefer const instead of let

Avoid any and casting from any to other types

Type predicates: only applicable to narrow down the type, not to force a complete type conversion

Prefer deconstructing parameters for functions instead of multiple parameters that might be of
the same type

Don’t use null inside the app, only on code interacting with external interfaces/services,
like DB and HTTP; convert to undefined before sending inwards into the code

Use async/await instead of .then()

Use the strict equality operator ===, don’t use abstract equality operator ==

When calling a Promise-returning function asynchronously (i.e., not awaiting), use .catch() to
handle errors (see processAsyncError and emptyFunction depending on the case)

Date and Time

Always use buildDayjs() to create dayjs instances

Prefer dayjs.duration(...) to create duration consts and keep them as duration

Prefer Nullish Coalesce (??) than the OR operator (||) to provide a default value

Avoid creating arrow functions

Use truthy syntax instead of in - i.e., if (data.link) not if ('link' in data)

Error handling

Pass the original error as the new one’s cause so the stack trace is persisted

Error messages should have a static message - add dynamic data to MetriportError's additionalInfo prop

Avoid sending multiple events to Sentry for a single error

Global constants and variables

Move literals to constants declared after imports when possible (avoid magic numbers)

Avoid shared, global objects

Avoid using console.log and console.error in packages other than utils, infra and shared,
and try to use out().log instead

Avoid multi-line logs

don't send objects as a second parameter to console.log() or out().log()

don't create multi-line strings when using JSON.stringify()

Use eslint to enforce code style

Use prettier to format code

max column length is 100 chars

multi-line comments use /** */

scripts: top-level comments go after the import

packages/core/src/external/opensearch/text-ingestor-direct.ts

⏰ Context from checks skipped due to timeout of 90000ms (3)

GitHub Check: check-pr / lint-build-test
GitHub Check: check-pr / lint-build-test
GitHub Check: Analyze (javascript)

🔇 Additional comments (13)

packages/core/src/external/opensearch/text-ingestor-direct.ts (13)

8-13: Good use of structured imports for error handling

The addition of these imports from the bulk module provides better organization and separation of concerns for error handling functionality.

19-20: Improved configuration for bulk operations

Increasing the timeout to 2 minutes and adding a maximum retry count of 3 enhances the robustness of bulk operations, particularly for larger datasets.

29-29: Significant increase in bulk chunk size

The increase to 10,000 items per chunk aligns well with the explanatory comment about optimal bulk sizes. This should improve throughput while staying within recommended limits.

34-39: Good type refactoring

Splitting IngestRequest into two types enhances clarity and follows the principle of separation of concerns. This makes the code more maintainable.

43-44: Improved flexibility with optional error callback

Adding the optional error callback aligns with the PR objective of enhancing error handling during bulk operations.

115-124: Well-documented method signature

The JSDoc comments clearly explain the purpose and parameters of the method, which aids in maintainability and follows coding guidelines.

135-138: Effective error tracking implementation

Initializing an error count and map with a default error handler allows for grouping similar errors, directly addressing the PR objectives.

153-154: Good error reporting approach

Only capturing errors when they exist and providing a count in the log message aligns with the PR objectives of improving error reporting.

175-182: Improved code organization with helper function

Extracting the bulk request creation logic to a helper function follows the functional programming principle from the coding guidelines and makes the code more maintainable.

186-187: Enhanced reliability with timeout and retries

Setting an appropriate timeout and adding retries for bulk operations improves reliability, especially for larger datasets or when experiencing transient network issues.

191-194: Effective error processing and logging

Using the specialized error processing function and returning the error count provides clear visibility into operation success/failure rates.

206-227: Well-structured helper function

The resourceToBulkRequest function follows functional programming principles by being small, deterministic, and having clear inputs and outputs as recommended in the coding guidelines.

249-254: Effective error counting implementation

The buildOnItemError function provides a clean way to count errors by type, supporting the PR objective of grouping similar errors.

leite08 force-pushed the eng-41-improve-bulk-ingest branch 2 times, most recently from e78869e to 781f6bf Compare May 8, 2025 17:53

leite08 changed the title ~~ENG-265 Adj OS bulk import error handling - WIP~~ ENG-265 Adj OS bulk import error handling May 9, 2025

leite08 marked this pull request as ready for review May 9, 2025 01:33

leite08 requested review from Goncharo, Orta21, RamilGaripov, thomasyopes and lucasdellabella as code owners May 9, 2025 01:33

coderabbitai bot reviewed May 9, 2025

View reviewed changes

lucasdellabella reviewed May 9, 2025

View reviewed changes

packages/utils/src/open-search/semantic-ingest.ts Show resolved Hide resolved

lucasdellabella reviewed May 9, 2025

View reviewed changes

packages/core/src/external/opensearch/bulk.ts Outdated Show resolved Hide resolved

lucasdellabella approved these changes May 9, 2025

View reviewed changes

leite08 force-pushed the eng-41-improve-bulk-ingest branch from 7eca2da to 6eef068 Compare May 10, 2025 17:15

coderabbitai bot reviewed May 10, 2025

View reviewed changes

leite08 changed the title ~~ENG-265 Adj OS bulk import error handling~~ ENG-41 Adj OS bulk import error handling May 10, 2025

leite08 marked this pull request as draft May 12, 2025 13:22

leite08 mentioned this pull request May 12, 2025

ENG-268 Consolidated search by keyword #3814

Merged

6 tasks

leite08 force-pushed the eng-41-improve-bulk-ingest branch 4 times, most recently from e6b29ee to 358f4a6 Compare May 16, 2025 20:24

leite08 marked this pull request as ready for review May 17, 2025 00:54

leite08 force-pushed the eng-41-improve-bulk-ingest branch from 358f4a6 to 6bef7a4 Compare May 17, 2025 00:54

coderabbitai bot reviewed May 17, 2025

View reviewed changes

leite08 force-pushed the eng-41-improve-bulk-ingest branch from 6bef7a4 to b39caab Compare May 17, 2025 22:13

coderabbitai bot reviewed May 17, 2025

View reviewed changes

leite08 force-pushed the eng-41-improve-bulk-ingest branch from b39caab to f3c604b Compare May 20, 2025 13:29

coderabbitai bot reviewed May 20, 2025

View reviewed changes

leite08 force-pushed the eng-41-improve-bulk-ingest branch from f3c604b to 9bbda99 Compare May 21, 2025 13:01

coderabbitai bot reviewed May 21, 2025

View reviewed changes

leite08 force-pushed the eng-41-improve-bulk-ingest branch from 9bbda99 to 7ad3fc7 Compare May 22, 2025 20:07

leite08 added 3 commits May 23, 2025 10:51

refactor: adj OS bulk import error handling

680ee2a

Ref eng-41 Signed-off-by: Rafael Leite <2132564+leite08@users.noreply.github.com>

refactor: group OS bulk errors to Sentry +

fe79602

- group OS bulk errors to Sentry - store individual errors in NDJON when run from utils Ref eng-41 Signed-off-by: Rafael Leite <2132564+leite08@users.noreply.github.com>

refactor(core): small adj from PR comments

23c850f

Ref eng-268 Signed-off-by: Rafael Leite <2132564+leite08@users.noreply.github.com>

leite08 force-pushed the eng-41-improve-bulk-ingest branch from 7ad3fc7 to 23c850f Compare May 23, 2025 14:17

coderabbitai bot reviewed May 23, 2025

View reviewed changes

leite08 added this pull request to the merge queue May 23, 2025

Merged via the queue into develop with commit 41fb768 May 23, 2025
15 checks passed

leite08 deleted the eng-41-improve-bulk-ingest branch May 23, 2025 20:43

leite08 mentioned this pull request May 23, 2025

RELEASE ENG-268 Updates to GlobalSearch Plumbing / ENG-198 ECW Backend #3892

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ENG-41 Adj OS bulk import error handling #3802

ENG-41 Adj OS bulk import error handling #3802

Uh oh!

Uh oh!

Uh oh!

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ENG-41 Adj OS bulk import error handling #3802

ENG-41 Adj OS bulk import error handling #3802

Uh oh!

Conversation

Uh oh!

Dependencies

Description

Testing

Release Plan

Summary by CodeRabbit

Summary by CodeRabbit

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

CodeRabbit Configuration File (`.coderabbit.yaml`)