retry connection lost by giurgiur99 · Pull Request #1213 · oceanprotocol/ocean-node

giurgiur99 · 2026-02-19T15:31:56Z

Fixes #1211

Changes proposed in this PR:

Retry es connection when db is down
Fix guards for fees object
Allow envs with no free resources
Add debounce for multiple fired events
Stop indexer on lost connection
Adjust config timeouts

alexcos20 · 2026-02-20T11:05:01Z

/run-security-scan

alexcos20

AI automated code review (Gemini 3).

Overall risk: medium

Summary:
This pull request significantly enhances the resilience of the Ocean Node's Elasticsearch connection. It introduces an event-driven mechanism to detect and react to database connection loss and restoration. The OceanIndexer now listens for these events, gracefully stopping and restarting chain indexers as the database connection fluctuates. The ElasticsearchClientSingleton has been refactored to remove its explicit retry loop in favor of continuous health monitoring that attempts single reconnections and emits events. Default Elasticsearch configuration parameters (timeouts, health check intervals) have also been adjusted to be more aggressive, leading to quicker detection of connection issues.

Comments:
• [ERROR][bug] The variable numCrawlAttempts is used here (numCrawlAttempts = 0) but does not appear to be defined or imported within this file's scope. This will lead to a runtime error. It is likely defined in src/components/Indexer/utils.ts and needs to be explicitly imported if it's a mutable shared variable. Please ensure numCrawlAttempts is correctly defined or imported.
• [WARNING][bug] The logic for sniffOnStart has been changed from process.env.ELASTICSEARCH_SNIFF_ON_START !== 'false' to process.env.ELASTICSEARCH_SNIFF_ON_START === 'false'. This inverts the default behavior. Previously, sniffOnStart was true by default unless explicitly set to 'false'. With the current change, if ELASTICSEARCH_SNIFF_ON_START is undefined or any value other than the exact string 'false', sniffOnStart will become false. This might unintentionally disable sniffing on start, which can impact the client's ability to discover available nodes. Please confirm if this inversion is intended; if not, consider reverting to !== 'false' or using process.env.ELASTICSEARCH_SNIFF_ON_START === 'true' for explicit opt-in.
• [WARNING][bug] Similar to sniffOnStart, the sniffInterval logic has changed from defaulting to 30000 (unless explicitly 'false') to process.env.ELASTICSEARCH_SNIFF_INTERVAL === 'true' ? 30000 : false. This means sniffInterval will be false by default if the environment variable is not explicitly set to 'true'. This also inverts the previous default behavior, potentially disabling periodic sniffing of cluster state. Please confirm if this change in default behavior is intentional.
• [INFO][other] When !isHealthy, the code attempts this.client.close() within a try-catch block that silences any errors. While client.close() is generally robust, it's good practice to log any errors that might occur during the close operation, even if they are ultimately handled by setting client = null. This can aid in debugging unexpected behavior.
• [INFO][performance] The removal of the startRetryConnection method with its exponential backoff and maxRetries in favor of a continuous fixed-interval retry within startHealthMonitoring represents a significant change in the reconnection strategy. The new approach continuously attempts reconnection at the healthCheckInterval rate until success, which is generally robust for transient and prolonged outages. However, it means the system will keep trying indefinitely without ever giving up, which could potentially consume resources during unrecoverable errors. This is a valid design choice, but it changes the resource consumption profile and error handling for persistent connection failures.

giurgiur99 · 2026-02-20T12:47:47Z

/run-security-scan

alexcos20

AI automated code review (Gemini 3).

Overall risk: medium

Summary:
This PR significantly refactors Elasticsearch connection management to enhance robustness and introduce an event-driven mechanism. It replaces a previous explicit retry loop with a continuous health monitoring system that emits CONNECTION_LOST and CONNECTION_RESTORED events. The OceanIndexer now subscribes to these events to gracefully stop and restart all chain indexers upon database connectivity changes. Default Elasticsearch client configuration values (timeouts, max retries, health check interval) have also been adjusted for faster detection and response to connection issues.

Comments:
• [INFO][other] The numCrawlAttempts = 0 reset upon CONNECTION_RESTORED is a good addition to ensure the indexer gets a fresh start. It implicitly assumes numCrawlAttempts is a global variable or part of a shared state that needs to be cleared for proper re-initialization. Could add a comment clarifying its purpose if it's not immediately obvious to a new contributor.
• [INFO][performance] The getDatabase(true) call here is crucial for ensuring the Indexer uses a fresh database client instance after a reconnection. This leverages the new singleton behavior in ElasticsearchConfigHelper.ts that allows re-initialization. The comment for ES_CONNECTION_EVENTS.on(DB_EVENTS.CONNECTION_RESTORED) clarifies this in the other file. Adding a small comment here too, explaining why true is passed to getDatabase in this context, would make the code clearer.
• [WARNING][style] Good use of parseInt(..., 10) || 30000 for sniffInterval to ensure a valid number and handle potential NaN from parseInt. For consistency and robustness, consider applying this pattern (parseInt(process.env.VAR || 'default_value', 10) || default_value) to other parseInt calls within this DEFAULT_ELASTICSEARCH_CONFIG object as well (e.g., requestTimeout, pingTimeout, maxRetries, healthCheckInterval). This ensures that if an environment variable is set to a non-numeric string, it gracefully falls back to the default instead of resulting in NaN.
• [INFO][other] The comment explaining why an extra ping is skipped is very helpful. This prevents potential false errors and redundant operations, especially during rapid reconnect cycles.
• [INFO][performance] Changing return this.startRetryConnection(...) to throw new Error(...) is a key part of the new design. It ensures that initial calls to getClient() fail fast if the database is unhealthy, allowing the new event-driven health monitoring to take over reconnection. This is a sound change given the new architecture.
• [INFO][other] The if (this.isRetrying) { return } check within the setInterval callback is crucial for preventing multiple concurrent reconnection attempts if a health check fires while a reconnection is already in progress. This helps maintain a single source of truth for the connection state.
• [INFO][other] The connectionLostEmitted flag is well-placed to ensure that DB_EVENTS.CONNECTION_LOST is emitted only once per actual connection loss, avoiding redundant events for components listening to them.

This reverts commit 325a29f.

This reverts commit f90c3c8.

giurgiur99 · 2026-02-24T19:08:28Z

/run-security-scan

alexcos20

AI automated code review (Gemini 3).

Overall risk: medium

Summary:
This pull request introduces a significant refactor of the Elasticsearch connection management, transitioning to an event-driven health monitoring and reconnection strategy. The Indexer component now subscribes to these events to dynamically stop and restart indexers based on DB connection status. Additionally, it includes an ethers dependency update and improves robustness in C2D fee and resource handling by making configurations more resilient to missing or optional values.

Comments:
• [INFO][other] Updated ethers dependency from ^6.8.1 to ^6.16.0. This is a routine update, but please ensure that all related blockchain interactions and contract calls are still functioning as expected after the upgrade, as minor version bumps can sometimes introduce subtle breaking changes or behavior modifications.
• [INFO][other] The new setupDbConnectionListeners method introduces a robust, event-driven mechanism for handling Elasticsearch connection losses and restorations. This is a significant improvement for the Indexer's resilience. Ensure that the stopAllChainIndexers and startAllChainIndexers methods are idempotent and can be called safely multiple times.
• [WARNING][style] There's a contradiction in the comment for interval. It says default 2 secs but the value is 5000 (5 seconds). Please clarify or correct the comment to match the code's default value.
• [INFO][other] The introduction of EventEmitter and DB_EVENTS for Elasticsearch connection events is a clean architectural improvement, enabling better separation of concerns and reactive handling of DB status changes in other components like the Indexer.
• [WARNING][performance] The default ELASTICSEARCH_REQUEST_TIMEOUT has been reduced from 60000 (60s) to 30000 (30s), ELASTICSEARCH_PING_TIMEOUT from 5000 to 3000, ELASTICSEARCH_MAX_RETRIES from 5 to 3, and ELASTICSEARCH_HEALTH_CHECK_INTERVAL from 60000 to 15000. These changes significantly alter the default behavior under connection issues and load. Please ensure these new, more aggressive timeouts/retries are appropriate for typical deployment environments and won't lead to premature disconnections under transient network stress.
• [INFO][other] The modification in getClient() to return early if this.isMonitoring is true helps prevent redundant health checks during re-initialization, which could otherwise cause race conditions or unnecessary load, especially during recovery phases. This is a good optimization.
• [INFO][other] The refactor of startHealthMonitoring to directly manage single reconnection attempts and emit CONNECTION_LOST/CONNECTION_RESTORED events is a fundamental change to the retry mechanism. This moves from a blocking, exponential backoff retry to a continuous monitoring approach. Extensive testing of various DB outage scenarios (brief, prolonged, intermittent) is crucial to validate its effectiveness and stability.
• [INFO][other] The connectionLostEmitted flag is a good addition to prevent duplicate CONNECTION_LOST events from being emitted during a continuous outage, ensuring that subscribers react only once per distinct disconnection event.
• [INFO][other] Making the fees property optional in C2DDockerConfigSchema and updating the refine condition to require either fees or free configuration is a logical improvement. This allows for more flexible compute environment definitions where 'free' compute might be the primary offering.

giurgiur99 added 4 commits February 19, 2026 14:50

retry connection lost

4f2c968

temporary debug check

4394210

revert nonce comment

6b84b8d

Merge branch 'main' into 1211-enhance-retry-es-connection

17f09cf

giurgiur99 marked this pull request as ready for review February 20, 2026 09:14

giurgiur99 requested review from alexcos20, andreip136, bogdanfazakas, denisiuriet and ndrpp as code owners February 20, 2026 09:14

alexcos20 reviewed Feb 20, 2026

View reviewed changes

revert sniff on start logic and log err

b64ed20

alexcos20 reviewed Feb 20, 2026

View reviewed changes

giurgiur99 added 2 commits February 24, 2026 12:37

log errs

47ac976

free only env and allow no fees object

36bbd8c

giurgiur99 added the Status: DoNotMerge label Feb 24, 2026

giurgiur99 added 12 commits February 24, 2026 14:48

rpc err log

104a1d2

timeout rpc

325a29f

Revert "timeout rpc"

687200d

This reverts commit 325a29f.

log provider

f90c3c8

Revert "log provider"

469f2ff

This reverts commit f90c3c8.

start api before indexer

e32aeef

start indexer later

fabc740

timeout rpc

912c071

catch no rpcs

2f719d9

update deps

d205ca1

start indexer in bg

4b727df

revert unused changes

0b5183d

revert try catch network height

e14a719

giurgiur99 removed the Status: DoNotMerge label Feb 24, 2026

alexcos20 reviewed Feb 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

retry connection lost#1213

retry connection lost#1213
giurgiur99 wants to merge 20 commits intomainfrom
1211-enhance-retry-es-connection

giurgiur99 commented Feb 19, 2026 •

edited

Loading

Uh oh!

alexcos20 commented Feb 20, 2026

Uh oh!

alexcos20 left a comment

Uh oh!

giurgiur99 commented Feb 20, 2026

Uh oh!

alexcos20 left a comment

Uh oh!

giurgiur99 commented Feb 24, 2026

Uh oh!

alexcos20 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

giurgiur99 commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexcos20 commented Feb 20, 2026

Uh oh!

alexcos20 left a comment

Choose a reason for hiding this comment

Uh oh!

giurgiur99 commented Feb 20, 2026

Uh oh!

alexcos20 left a comment

Choose a reason for hiding this comment

Uh oh!

giurgiur99 commented Feb 24, 2026

Uh oh!

alexcos20 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

giurgiur99 commented Feb 19, 2026 •

edited

Loading