Skip to content

[Tock Studio] add Markdown ZIP export with GFM-aware heading demotion and RAG-oriented structure (Dercbot 1836)#2016

Draft
rkuffer wants to merge 3 commits intotheopenconversationkit:masterfrom
CreditMutuelArkea:DERCBOT-1836-Markdown-export-of-faqs
Draft

[Tock Studio] add Markdown ZIP export with GFM-aware heading demotion and RAG-oriented structure (Dercbot 1836)#2016
rkuffer wants to merge 3 commits intotheopenconversationkit:masterfrom
CreditMutuelArkea:DERCBOT-1836-Markdown-export-of-faqs

Conversation

@rkuffer
Copy link
Member

@rkuffer rkuffer commented Mar 17, 2026

Closes #2015


Summary

This PR adds Markdown as a third export format to the DataExportComponent, alongside the existing JSON and CSV options. Note that DataExportComponent is currently only used for FAQ export — this context drove the design decisions around heading structure, GFM detection, and RAG-oriented output.
Each entry in the exported collection produces an individual .md file; all files are bundled into a .zip archive. The implementation is oriented toward RAG pipeline ingestion and handles the case where field values already contain Markdown or GFM content.


Changes

data-export.component.ts

  • Added Formats.md to the Formats enum
  • Added @Input() defaultColumns?: string[] — optional ordered list of pre-checked columns; columns listed here appear first and are checked by default, remaining columns are appended unchecked
  • Added @Input() mdHeadingPrefix?: string — optional prefix for the H1 heading of each generated file (# {prefix}: {title}, # {prefix}, or # {id} depending on available data; no H1 if none available)
  • Refactored column FormGroup construction in ngOnInit to use an intermediate Map, enabling defaultColumns ordering to be applied after all groups are built
  • Fixed canIncreaseColumnIndex bound: now uses this.columns.length instead of this.columnNames.length, which was incorrectly capping reordering for virtual columns produced by deepening strategies
  • Added buildMarkdownFileName(): cross-OS safe filename sanitization (diacritics, forbidden characters, Windows reserved names, 200-char max length), combining title and id with index-based fallback when neither yields a usable string after sanitization
  • Added buildMarkdownHeading(): builds the H1 line from mdHeadingPrefix and/or title/id, returns null when no data is available (no H1 emitted)
  • Added looksLikeMarkdown(): remark/GFM-based AST detection with a tiered confidence model — high (code, link, table), medium (heading, blockquote, strikethrough), low (list, strong) — requiring at least one high signal, two medium, or one medium + one low
  • Added demoteMarkdownHeadings(): AST-based heading demotion via remark-parse + remark-gfm + visit — increments all heading depths by 1 (capped at 6); code block contents are opaque AST nodes and are never affected
  • Added serializeValueToMarkdown(): async serialization dispatcher — bullet list for arrays, **key**: value for objects, Markdown detection + heading demotion for strings, passthrough otherwise; empty results skip their section heading entirely
  • Added generateMarkdownForEntry(): async, builds the full .md content for one entry
  • Markdown ZIP generation uses Promise.all for concurrent entry processing, chained with zip.generateAsync, with a .catch that resets loading and surfaces a danger toast on failure
  • Added two shared readonly remark pipeline instances (remarkParser, remarkSerializer) to avoid per-call unified() instantiation
  • Column validator (validateColumnsFormArray) now also activates for Formats.md

data-export.component.html

  • Added Markdown radio option to the format selector
  • Extracted the column selection block from the CSV-only *ngIf into a shared ng-container conditioned on csv || md
  • CSV delimiter options remain isolated in their own conditional block

Dependencies

Added jszip depency.
The following already present packages are now imported:
unified, remark-parse, remark-stringify, remark-gfm, unist-util-visit (transitive — explicit declaration recommended).


Testing notes

  • Export a collection where some entries have no content for selected columns → empty sections should be absent from output
  • Export a collection where answer contains GFM (headings, tables, lists) → heading levels should be demoted, structure preserved
  • Export a collection where entries have titles with accented characters, special characters only, or no title/id at all → filenames should be valid and unique
  • Export with defaultColumns containing a locale-specific column (e.g. answer.i18n.en) on an app that does not support that locale → column silently absent, no error
  • Simulate a ZIP generation failure → loading resets, danger toast visible

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant