Conversation
Phase 1 of Algolia search improvements: - Add EnhancedDocsSearchItem type with new fields: - pageTitle: Always the parent page title - description: From frontmatter (page-level only) - content: Text content (truncated ~2000 chars) - headingLevel: 0 for page, 2 for H2, 3 for H3 - isPageLevel: True if page-level record (not a heading) - Create scripts/indexDocsForSearch.ts: - Parses all MDX/MD content files - Extracts frontmatter using remark - Creates page-level records with intro content - Extracts H2/H3 headings with surrounding content - Creates heading-level records with anchor links - Batches uploads to Algolia (1000 per batch) - Gracefully handles missing Algolia credentials - Update package.json: - Add 'index-docs' script - Run new indexer in prebuild before index-apis This enables: - Deep linking to specific sections via #anchor URLs - Better relevance for specific queries - Smaller, more focused search records - Content-based search (not just titles) Co-authored-by: chris <chris@knock.app>
|
Cursor Agent can help with this pull request. Just |
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Co-authored-by: chris <chris@knock.app>
| matches.push({ | ||
| index: match.index, | ||
| level: match[1].length, | ||
| title: match[2].trim(), |
There was a problem hiding this comment.
Heading titles retain raw markdown formatting characters
Medium Severity
Heading titles captured by the regex at line 183 are stored directly without cleaning markdown formatting. While heading content is properly cleaned via extractTextContent() at line 200, the title field never is. Headings like ## Using \config` variablesor## Important notes` will display with literal backticks, asterisks, or link syntax visible in search results. The title needs the same markdown cleanup applied to it.
Additional Locations (1)
|
@cjbell should i be able to test this on preview link? I'm trying it out but not getting results by querying for headers or content |
Resolve conflicts: - package.json: Combined split-specs from main with index-docs from this branch - Autocomplete.tsx: Used main's refactored content variable pattern, added EnhancedDocsSearchItem import, and added pageTitle display for heading-level search results only (not page-level to avoid redundant title display) Co-authored-by: Chris Bell <chris@cjbell.co>
Move image removal before link removal to prevent images from being partially processed. Since images use  which contains the link pattern [alt](url), the link regex was matching first and leaving behind '!alt text' in the indexed content. Co-authored-by: Chris Bell <chris@cjbell.co>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 4 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| isPageLevel: true, | ||
| }; | ||
| await queueItem(pageRecord); | ||
| pageCount++; |
There was a problem hiding this comment.
Old page indexing not removed, causing duplicate records
High Severity
The new indexDocsForSearch.ts script creates page-level records with objectID: "page-${urlPath}", but the old indexing mechanism in lib/content.server.ts (generateAlgoliaIndex) still runs during getStaticProps in pages/[...slug].tsx and creates records with objectID: frontmatter.id (the raw path). Both write to the same Algolia index. Since the objectIDs differ, every docs page gets two records, resulting in duplicate search results.
Additional Locations (1)
| return filePath | ||
| .replace(CONTENT_DIR, "") | ||
| .replace(/\.mdx?$/, "") | ||
| .replace("/index", "") |
There was a problem hiding this comment.
Non-anchored /index replacement corrupts URL paths
Medium Severity
filePathToUrlPath uses .replace("/index", "") which is a plain string replace matching the first occurrence of /index anywhere in the path. If any content file lives under a directory or has a filename containing "index" as a substring (e.g., /indexing/, /reindex), the path would be silently corrupted. This needs to be anchored to the end of the path (e.g., a regex like /\/index$/).
| }); | ||
|
|
||
| return files; | ||
| } |
There was a problem hiding this comment.
| level: match[1].length, | ||
| title: match[2].trim(), | ||
| }); | ||
| } |
There was a problem hiding this comment.
Heading regex matches inside fenced code blocks
Medium Severity
extractHeadings runs the heading regex /^(#{2,3})\s+(.+)$/gm on the full content before code blocks are stripped. Lines like ## some comment inside fenced code blocks (common in shell/Python examples) will be falsely detected as headings. This splits the real heading's content at the wrong boundary and creates spurious heading-level search records. The extractTextContent function already strips code blocks, but it's only applied to the content between the already-detected headings, not before heading detection runs.


Description
This PR introduces a new indexing strategy for the docs, which will now include headings and content within a page in addition to the page title/tags that we previously indexed.
Note: right now we're not indexing API content within this result set, but I can easily change that if we'd like!