fix: handle URL length limits in Typesense engine for large courses by blarghmatey · Pull Request #263 · openedx/edx-search

blarghmatey · 2026-04-17T17:38:13Z

Summary

When indexing a large course with many content blocks (e.g. 500+), the Typesense search engine can fail with URL-length errors. This PR fixes three scale/robustness issues and separately applies three improvements identified from a review of the Typesense filtering and data-types documentation.

Scale / URL-length fixes

Problem 1 – `search()` raises `httpx.InvalidURL` (reported bug)

httpx.InvalidURL: URL component 'query' too long

An existing fallback for RequestMalformed ("Query string exceeds max allowed length") covered the server-side case but not this client-side one. Both are now caught and routed to the multi-search POST endpoint, which has no URL-length constraint. The httpx.InvalidURL catch is narrowed to messages containing "too long" so that unrelated URL errors are still raised normally.

Problem 2 – `remove()` has the same URL-length risk

remove() sent a single DELETE request with filter_by: id: [id1,id2,...] for every stale document. The Typesense Python client passes filter_by as a URL query parameter, so a large stale-item list triggers the same failure. Documents are now deleted in batches of at most _MAX_IDS_PER_DELETE_BATCH (100) IDs per request.

Problem 3 – `get_search_params()` crashes when `exclude_dictionary=None`

The function signature documents exclude_dictionary=None as the default, but the loop body iterated unconditionally, raising AttributeError: 'NoneType' object has no attribute 'items'. A None guard has been added.

Typesense best-practices improvements (from docs review)

2 – `range_index: True` on datetime fields

The schema parameters reference documents range_index as enabling an optimised index for range-based numerical filtering — exactly how date fields are queried (e.g. "active courses only", "enrollment window"). Added to start, end, enrollment_start, enrollment_end, and start_date.

3 – Strip HTML from `content.*` fields before indexing

The Typesense data-types guide explicitly recommends stripping HTML markup before indexing so that tags are not treated as searchable tokens and do not bloat the in-memory index. XBlock index_dictionary() methods can return raw HTML in content fields. A new _strip_html / _strip_html_from_content helper is applied in process_document() to the nested content object only; other fields (course keys, org names, etc.) are left unmodified.

Testing

search/tests/test_typesense.py is a new test module covering all six changes:

TypesenseEngine.search() normal path, httpx.InvalidURL fallback, RequestMalformed fallback, and that unrelated errors are still re-raised
TypesenseEngine.remove() empty list, small batch, exact batch size, large batch (multiple requests), and correct per-batch ID slicing
_strip_html and _strip_html_from_content helpers (tag removal, entity unescaping, whitespace normalisation, nested structures, non-string values)
TypesenseEngine.process_document() HTML stripping in content field, non-content fields untouched, missing content field handled

All new tests pass; pre-existing failures in the Elasticsearch and Meilisearch suites are unrelated to these changes.

When indexing a large course with many content blocks, the Typesense search engine could hit two related URL-length failures: 1. search(): Building an exclusion filter like 'id:!=[id1, id2, ...]' for hundreds of item IDs produced a URL query string that exceeded httpx's internal limit, raising: httpx.InvalidURL: URL component 'query' too long An existing fallback for RequestMalformed ('Query string exceeds max allowed length') did not cover this client-side failure. Fix: also catch httpx.InvalidURL (narrowed to 'too long' message) and route both cases to the multi-search POST endpoint, which has no URL length constraint. 2. remove(): Deleting many stale documents built a single 'id: [id1,id2,...]' filter sent as a DELETE query parameter, which is equally subject to URL length limits. Fix: chunk deletions into batches of _MAX_IDS_PER_DELETE_BATCH (100) IDs per request. Also fix a latent bug in get_search_params(): the exclude_dictionary loop iterated unconditionally, raising AttributeError when the caller passed exclude_dictionary=None (the documented default). Tests added for all three fixes in search/tests/test_typesense.py. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

openedx-webhooks · 2026-04-17T17:38:20Z

Thanks for the pull request, @blarghmatey!

This repository is currently maintained by @Ali-Salman29.

Once you've gone through the following steps feel free to tag them in a comment and let them know that your changes are ready for engineering review.

🔘 Get product approval

If you haven't already, check this list to see if your contribution needs to go through the product review process.

If it does, you'll need to submit a product proposal for your contribution, and have it reviewed by the Product Working Group.
- This process (including the steps you'll need to take) is documented here.
If it doesn't, simply proceed with the next step.

🔘 Provide context

To help your reviewers and other members of the community understand the purpose and larger context of your changes, feel free to add as much of the following information to the PR description as you can:

Dependencies

This PR must be merged before / after / at the same time as ...
Blockers

This PR is waiting for OEP-1234 to be accepted.
Timeline information

This PR must be merged by XX date because ...
Partner information

This is for a course on edx.org.
Supporting documentation
Relevant Open edX discussion forum threads

🔘 Get a green build

If one or more checks are failing, continue working on your changes until this is no longer the case and your build turns green.

Details

Where can I find more information?

If you'd like to get more details on all aspects of the review process for open source pull requests (OSPRs), check out the following resources:

When can I expect my changes to be merged?

Our goal is to get community contributions seen and reviewed as efficiently as possible.

However, the amount of time that it takes to review and merge a PR can vary significantly based on factors such as:

The size and impact of the changes that it introduces
The need for product review
Maintenance status of the parent repository

💡 As a result it may take up to several weeks or months to complete a review and merge your PR.

Three improvements based on a review of the Typesense filtering and data-types documentation: 1. Use the documented '![]' (is-not-any-of) filter operator instead of '!=[...]'. The operators table in https://typesense.org/docs/guide/tips-for-filtering.html#available-operators documents '![]' as the canonical multi-value exclusion operator; '!=[...]' combines the single-value inequality operator with an array argument and is not an officially documented form. 2. Add range_index: True to all datetime fields (start, end, enrollment_start, enrollment_end, start_date). The schema-parameters reference documents this flag as enabling an optimised index for range-based numerical filtering, which is exactly how date fields are queried (e.g. 'show only active courses'). 3. Strip HTML from content sub-fields before indexing. The Typesense data-types guide explicitly recommends storing plain text rather than raw HTML markup so that tags are not indexed as searchable tokens, wasting in-memory index space and polluting search results. A new _strip_html / _strip_html_from_content helper pair is applied inside process_document() only to the nested 'content' object, since other fields (course keys, org names, etc.) are clean identifiers that should not be modified. Tests added for all three changes in search/tests/test_typesense.py. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

bradenmacdonald · 2026-05-05T00:07:27Z

The Typesense data-types guide explicitly recommends stripping HTML markup before indexing so that tags are not treated as searchable tokens and do not bloat the in-memory index. XBlock index_dictionary() methods can return raw HTML in content fields. A new _strip_html / _strip_html_from_content helper is applied in process_document() to the nested content object only; other fields (course keys, org names, etc.) are left unmodified.

this is great, but really should be done for all search engines, and/or we should change index_dictionary to not accept HTML, because I don't see why we'd ever want HTML markup included in any search index.

bradenmacdonald

Really nice work! Thanks so much.

I tested this and it seems to be working well, although I needed open-craft/tutor-contrib-typesense#8 to get it working with Tutor and #267 to get the "Discover Courses" search working. Neither of those issues are related to this PR however.

openedx-webhooks added the open-source-contribution PR author is not from Axim or 2U label Apr 17, 2026

openedx-webhooks added this to Contributions Apr 17, 2026

github-project-automation Bot moved this to Needs Triage in Contributions Apr 17, 2026

blarghmatey requested review from bradenmacdonald and feanil April 17, 2026 17:47

feanil requested a review from Ali-Salman29 April 21, 2026 19:07

mphilbrick211 moved this from Needs Triage to Ready for Review in Contributions Apr 22, 2026

bradenmacdonald approved these changes May 5, 2026

View reviewed changes

bradenmacdonald merged commit aa82cd0 into openedx:master May 5, 2026
7 checks passed

github-project-automation Bot moved this from Ready for Review to Done in Contributions May 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: handle URL length limits in Typesense engine for large courses#263

fix: handle URL length limits in Typesense engine for large courses#263
bradenmacdonald merged 2 commits into
openedx:masterfrom
mitodl:fix/typesense-url-length-scale

blarghmatey commented Apr 17, 2026 •

edited

Loading

Uh oh!

openedx-webhooks commented Apr 17, 2026

Uh oh!

bradenmacdonald commented May 5, 2026

Uh oh!

bradenmacdonald left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

blarghmatey commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Scale / URL-length fixes

Problem 1 – search() raises httpx.InvalidURL (reported bug)

Problem 2 – remove() has the same URL-length risk

Problem 3 – get_search_params() crashes when exclude_dictionary=None

Typesense best-practices improvements (from docs review)

2 – range_index: True on datetime fields

3 – Strip HTML from content.* fields before indexing

Testing

Related

Uh oh!

openedx-webhooks commented Apr 17, 2026

Uh oh!

bradenmacdonald commented May 5, 2026

Uh oh!

bradenmacdonald left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

blarghmatey commented Apr 17, 2026 •

edited

Loading

Problem 1 – `search()` raises `httpx.InvalidURL` (reported bug)

Problem 2 – `remove()` has the same URL-length risk

Problem 3 – `get_search_params()` crashes when `exclude_dictionary=None`

2 – `range_index: True` on datetime fields

3 – Strip HTML from `content.*` fields before indexing