Skip to content

Comments

feat: improve docs search#1277

Open
cjbell wants to merge 6 commits intomainfrom
cursor/docs-search-content-indexing-c357
Open

feat: improve docs search#1277
cjbell wants to merge 6 commits intomainfrom
cursor/docs-search-content-indexing-c357

Conversation

@cjbell
Copy link
Contributor

@cjbell cjbell commented Jan 22, 2026

Description

This PR introduces a new indexing strategy for the docs, which will now include headings and content within a page in addition to the page title/tags that we previously indexed.

Note: right now we're not indexing API content within this result set, but I can easily change that if we'd like!

Phase 1 of Algolia search improvements:

- Add EnhancedDocsSearchItem type with new fields:
  - pageTitle: Always the parent page title
  - description: From frontmatter (page-level only)
  - content: Text content (truncated ~2000 chars)
  - headingLevel: 0 for page, 2 for H2, 3 for H3
  - isPageLevel: True if page-level record (not a heading)

- Create scripts/indexDocsForSearch.ts:
  - Parses all MDX/MD content files
  - Extracts frontmatter using remark
  - Creates page-level records with intro content
  - Extracts H2/H3 headings with surrounding content
  - Creates heading-level records with anchor links
  - Batches uploads to Algolia (1000 per batch)
  - Gracefully handles missing Algolia credentials

- Update package.json:
  - Add 'index-docs' script
  - Run new indexer in prebuild before index-apis

This enables:
- Deep linking to specific sections via #anchor URLs
- Better relevance for specific queries
- Smaller, more focused search records
- Content-based search (not just titles)

Co-authored-by: chris <chris@knock.app>
@cursor
Copy link

cursor bot commented Jan 22, 2026

Cursor Agent can help with this pull request. Just @cursor in comments and I'll start working on changes in this branch.
Learn more about Cursor Agents

@vercel
Copy link

vercel bot commented Jan 22, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
docs Ready Ready Preview, Comment Feb 24, 2026 3:58am

Request Review

Co-authored-by: chris <chris@knock.app>
@cjbell cjbell changed the title Docs search content indexing feat: improve docs search Jan 23, 2026
@cjbell cjbell marked this pull request as ready for review January 23, 2026 22:09
matches.push({
index: match.index,
level: match[1].length,
title: match[2].trim(),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heading titles retain raw markdown formatting characters

Medium Severity

Heading titles captured by the regex at line 183 are stored directly without cleaning markdown formatting. While heading content is properly cleaned via extractTextContent() at line 200, the title field never is. Headings like ## Using \config` variablesor## Important notes` will display with literal backticks, asterisks, or link syntax visible in search results. The title needs the same markdown cleanup applied to it.

Additional Locations (1)

Fix in Cursor Fix in Web

@cjbell cjbell requested a review from samseely January 27, 2026 19:50
@samseely
Copy link
Contributor

@cjbell should i be able to test this on preview link? I'm trying it out but not getting results by querying for headers or content

Resolve conflicts:
- package.json: Combined split-specs from main with index-docs from this branch
- Autocomplete.tsx: Used main's refactored content variable pattern,
  added EnhancedDocsSearchItem import, and added pageTitle display
  for heading-level search results only (not page-level to avoid
  redundant title display)

Co-authored-by: Chris Bell <chris@cjbell.co>
Move image removal before link removal to prevent images from being
partially processed. Since images use ![alt](url) which contains the
link pattern [alt](url), the link regex was matching first and leaving
behind '!alt text' in the indexed content.

Co-authored-by: Chris Bell <chris@cjbell.co>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 4 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

isPageLevel: true,
};
await queueItem(pageRecord);
pageCount++;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Old page indexing not removed, causing duplicate records

High Severity

The new indexDocsForSearch.ts script creates page-level records with objectID: "page-${urlPath}", but the old indexing mechanism in lib/content.server.ts (generateAlgoliaIndex) still runs during getStaticProps in pages/[...slug].tsx and creates records with objectID: frontmatter.id (the raw path). Both write to the same Algolia index. Since the objectIDs differ, every docs page gets two records, resulting in duplicate search results.

Additional Locations (1)

Fix in Cursor Fix in Web

return filePath
.replace(CONTENT_DIR, "")
.replace(/\.mdx?$/, "")
.replace("/index", "")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-anchored /index replacement corrupts URL paths

Medium Severity

filePathToUrlPath uses .replace("/index", "") which is a plain string replace matching the first occurrence of /index anywhere in the path. If any content file lives under a directory or has a filename containing "index" as a substring (e.g., /indexing/, /reindex), the path would be silently corrupted. This needs to be anchored to the end of the path (e.g., a regex like /\/index$/).

Fix in Cursor Fix in Web

});

return files;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate utility function already exists in codebase

Low Severity

getAllFilesInDir is reimplemented here, but an identical exported function already exists in lib/content.server.ts. This duplication means bug fixes or behavior changes to one copy won't propagate to the other.

Fix in Cursor Fix in Web

level: match[1].length,
title: match[2].trim(),
});
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heading regex matches inside fenced code blocks

Medium Severity

extractHeadings runs the heading regex /^(#{2,3})\s+(.+)$/gm on the full content before code blocks are stripped. Lines like ## some comment inside fenced code blocks (common in shell/Python examples) will be falsely detected as headings. This splits the real heading's content at the wrong boundary and creates spurious heading-level search records. The extractTextContent function already strips code blocks, but it's only applied to the content between the already-detected headings, not before heading detection runs.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants