Skip to content

perf(notion_datasource): speed up get_authorized_pages for large work…#3171

Open
Kota-Maeda wants to merge 3 commits into
langgenius:mainfrom
Kota-Maeda:perf/notion-datasource-large-workspace
Open

perf(notion_datasource): speed up get_authorized_pages for large work…#3171
Kota-Maeda wants to merge 3 commits into
langgenius:mainfrom
Kota-Maeda:perf/notion-datasource-large-workspace

Conversation

@Kota-Maeda
Copy link
Copy Markdown
Contributor

Summary

Fixes #3170.

langgenius/notion_datasource@0.1.18 cannot enumerate pages on Notion workspaces with more than ~1k shared items. get_authorized_pages() runs three phases serially (search pages, search databases, resolve every parent block — with recursion and no cache), which routinely exceeds the plugin-daemon SSE deadline (PLUGIN_MAX_EXECUTION_TIMEOUT, default 600s) and the request is killed by timeout.

This PR rewrites the hot path without changing the external behavior:

  • Parallel + memoized parent resolution. Resolve block_id parents in a ThreadPoolExecutor (size configurable via the env var NOTION_PARENT_RESOLVE_WORKERS, default 8, clamped to [1, 32]). Results are memoized in a thread-safe dict so sibling pages that share an ancestor don't re-issue the same HTTP request.
  • Single /v1/search loop. The previous two filtered loops (one for object="page", one for object="database") are replaced with a single un-filtered pass; items are dispatched by object type. Halves the number of search round-trips.
  • Unified retries. The three direct requests.post/get call sites are routed through _make_request, which already handles 429 / transient 5xx with backoff. _make_request gains an allow_status parameter so the existing "404 on a parent block → treat as root" behavior is preserved without re-introducing a direct request.

Backwards compatibility:

  • notion_page_search, notion_database_search, notion_block_parent_page_id are kept as thin wrappers delegating to the new internals, so any external caller keeps working.
  • OnlineDocumentPage field values are unchanged. The only observable difference is the ordering of the returned list (parallel completion order rather than insertion order), which downstream code does not rely on.

Change Type

  • Documentation / non-plugin change
  • Non-LLM plugin (tools, extensions, datasource, etc.)
  • LLM plugin

Screenshots / Videos

This PR is an internal performance fix with no observable UI change.
Before/After behavior is shown below as plugin-daemon log excerpts.

Before — request killed by the 600s SSE deadline:

plugin_daemon | ERROR ... PluginDaemonInternalServerError error="killed by timeout"
plugin_daemon | HTTP request POST /plugin/<tenant>/dispatch/datasource/get_online_document_pages
              | status=200 latency_ms=600003
api           | ERROR Exception on /console/api/notion/pre-import/pages
              |   httpcore.ReadTimeout: timed out
              |   httpx.ReadTimeout: timed out

After — same workspace, same credential, completes well inside the deadline:

plugin_daemon | HTTP request POST /plugin/<tenant>/dispatch/datasource/get_online_document_pages
              | status=200 latency_ms=63151   # ← 63 s, no timeout

LLM Plugin Checklist

Not applicable — this is a datasource plugin.

Version

  • Bumped top-level version in manifest.yaml (0.1.180.1.19, not the one under meta)
  • dify_plugin>=0.5.0 is declared in pyproject.toml and locked in uv.lock

A note on the template wording: the template suggests the literal range dify_plugin>=0.3.0,<0.6.0, but adding an upper bound of <0.6.0 makes the lock unsatisfiable in this plugin — dify_plugin 0.5.x pins werkzeug<3.1.dev0, while this plugin already required werkzeug>=3.1.7 before this PR. The current spec (>=0.5.0, no upper bound) matches the convention used by ~all other datasource plugins in this repo (confluence, dropbox, github, gitlab, onedrive, sharepoint, etc.). uv.lock is left as-is on upstream/main.

Patch-level bump rationale:

  • public API of the plugin is unchanged
  • no new manifest.yaml capabilities or provider/notion_datasource.yaml fields
  • purely an internal performance/reliability fix

Testing

  • Local deployment — Dify version: 1.14.1 (self-hosted Docker, plugin-daemon 0.6.0-local)

Verified manually against a multi-thousand-item Notion workspace that was previously hitting the 600s SSE deadline. After this change the same enumeration completes in ~60 seconds end to end with the default worker count. No regressions observed on a small (< 100 items) workspace.

Configuration

Variable Default Range Effect
NOTION_PARENT_RESOLVE_WORKERS 8 1–32 (out-of-range values fall back / are clamped) Worker count for the parallel parent-block resolution. 1 reproduces the old serial behavior. Increase cautiously — Notion's public rate limit is roughly 3 req/s and exceeding it just amplifies 429 backoff.

@dosubot dosubot Bot added the size:L This PR changes 100-499 lines, ignoring generated files. label May 20, 2026
@Kota-Maeda Kota-Maeda temporarily deployed to datasources/notion_datasource May 20, 2026 12:01 — with GitHub Actions Inactive
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the Notion client to optimize workspace enumeration by combining page and database searches into a single pass and resolving parent IDs concurrently using a thread pool with memoization. Feedback highlights that error handling in the parent resolution logic should be broadened to prevent crashes from network errors, and the memoization implementation currently allows redundant I/O due to a race condition. Additionally, the shift to in-memory filtering for specific search methods may lead to performance regressions in large workspaces.

Comment thread datasources/notion_datasource/datasources/utils/notion_client.py
Comment thread datasources/notion_datasource/datasources/utils/notion_client.py Outdated
Comment thread datasources/notion_datasource/datasources/utils/notion_client.py Outdated
@Kota-Maeda Kota-Maeda temporarily deployed to datasources/notion_datasource May 20, 2026 12:17 — with GitHub Actions Inactive
@Kota-Maeda Kota-Maeda force-pushed the perf/notion-datasource-large-workspace branch from 7fb2b2d to 48d9e1c Compare May 21, 2026 06:36
@Kota-Maeda Kota-Maeda temporarily deployed to datasources/notion_datasource May 21, 2026 06:37 — with GitHub Actions Inactive
@Kota-Maeda Kota-Maeda force-pushed the perf/notion-datasource-large-workspace branch from 48d9e1c to 2536d6e Compare May 25, 2026 01:00
@Kota-Maeda Kota-Maeda had a problem deploying to datasources/notion_datasource May 25, 2026 01:01 — with GitHub Actions Failure
Kota-Maeda added a commit to Kota-Maeda/dify-official-plugins that referenced this pull request May 25, 2026
PR langgenius#3192 already published 0.1.19 to the marketplace, so the
"Check If Version Exists" CI step on PR langgenius#3171 fails because the
plugin's manifest still claims 0.1.19. Bumping to 0.1.20 reserves a
fresh version slot for the perf changes in this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Kota-Maeda Kota-Maeda temporarily deployed to datasources/notion_datasource May 25, 2026 01:32 — with GitHub Actions Inactive
Kota-Maeda added a commit to Kota-Maeda/dify-official-plugins that referenced this pull request May 25, 2026
PR langgenius#3192 already published 0.1.19 to the marketplace, so the
"Check If Version Exists" CI step on PR langgenius#3171 fails because the
plugin's manifest still claims 0.1.19. Bumping to 0.1.20 reserves a
fresh version slot for the perf changes in this PR.
@Kota-Maeda Kota-Maeda force-pushed the perf/notion-datasource-large-workspace branch from dec6b61 to 8861f7c Compare May 25, 2026 01:34
@Kota-Maeda Kota-Maeda temporarily deployed to datasources/notion_datasource May 25, 2026 01:34 — with GitHub Actions Inactive
Kota-Maeda added a commit to Kota-Maeda/dify-official-plugins that referenced this pull request May 26, 2026
PR langgenius#3192 already published 0.1.19 to the marketplace, so the
"Check If Version Exists" CI step on PR langgenius#3171 fails because the
plugin's manifest still claims 0.1.19. Bumping to 0.1.20 reserves a
fresh version slot for the perf changes in this PR.
@Kota-Maeda Kota-Maeda force-pushed the perf/notion-datasource-large-workspace branch from 8861f7c to 380f5cb Compare May 26, 2026 01:25
@Kota-Maeda Kota-Maeda temporarily deployed to datasources/notion_datasource May 26, 2026 01:26 — with GitHub Actions Inactive
@Kota-Maeda
Copy link
Copy Markdown
Contributor Author

@cazziwork Could you confirm this PR?

@Kota-Maeda Kota-Maeda force-pushed the perf/notion-datasource-large-workspace branch from 380f5cb to 2c60d54 Compare May 28, 2026 00:57
@Kota-Maeda Kota-Maeda had a problem deploying to datasources/notion_datasource May 28, 2026 00:57 — with GitHub Actions Failure
@Kota-Maeda Kota-Maeda temporarily deployed to datasources/notion_datasource May 28, 2026 01:05 — with GitHub Actions Inactive
@Kota-Maeda Kota-Maeda force-pushed the perf/notion-datasource-large-workspace branch from dd34325 to e3bf9d1 Compare May 28, 2026 01:10
@Kota-Maeda Kota-Maeda temporarily deployed to datasources/notion_datasource May 28, 2026 01:10 — with GitHub Actions Inactive
@Kota-Maeda Kota-Maeda force-pushed the perf/notion-datasource-large-workspace branch from e3bf9d1 to 1c553f8 Compare May 28, 2026 09:19
@Kota-Maeda Kota-Maeda temporarily deployed to datasources/notion_datasource May 28, 2026 09:20 — with GitHub Actions Inactive
…spaces

- Parallelize parent block resolution with ThreadPoolExecutor (configurable via NOTION_PARENT_RESOLVE_WORKERS, default 8) and memoize lookups with a thread-safe cache so shared ancestors are not refetched.
- Replace the two filtered /v1/search loops (one for pages, one for databases) with a single un-filtered pass dispatched by object type, halving the search round-trips.
- Route the three direct requests.* call sites through _make_request so that 429 / transient 5xx are retried uniformly. _make_request now accepts allow_status to preserve the existing 404 -> root fallback for inaccessible parent blocks.

Public method signatures (notion_page_search, notion_database_search, notion_block_parent_page_id) are preserved as thin wrappers so existing callers keep working.

Bump plugin version to 0.1.19.
- Widen the except clause in _build_page_entry to also catch
  requests.exceptions.RequestException, so a single HTTP error during
  parent resolution skips that item instead of aborting the whole
  enumeration (preserves the behaviour established by PR langgenius#2891).
- Coalesce concurrent parent-block lookups via a thread-safe in-flight
  tracker (threading.Event per block_id), so multiple workers asking for
  the same ancestor share one HTTP request instead of racing past the
  cache miss and amplifying 429 backoff.
- Restore API-side filtering in notion_page_search and
  notion_database_search by routing them through a new _search_filtered
  helper. They previously fetched everything and filtered in memory,
  which doubled the data transferred when callers used them independently.
PR langgenius#3207 published 0.1.20 to the marketplace, so the
"Check If Version Exists" CI step rejects this PR again. Bumping to
0.1.21 reserves a fresh version slot for the perf changes in this PR.
@Kota-Maeda Kota-Maeda force-pushed the perf/notion-datasource-large-workspace branch from 1c553f8 to 54b564a Compare May 29, 2026 02:13
@Kota-Maeda Kota-Maeda deployed to datasources/notion_datasource May 29, 2026 02:14 — with GitHub Actions Active
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Notion datasource times out enumerating pages on large workspaces

1 participant