[Tock Studio] add Markdown ZIP export with GFM-aware heading demotion and RAG-oriented structure (Dercbot 1836)#2016
Draft
rkuffer wants to merge 3 commits intotheopenconversationkit:masterfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #2015
Summary
This PR adds Markdown as a third export format to the DataExportComponent, alongside the existing JSON and CSV options. Note that DataExportComponent is currently only used for FAQ export — this context drove the design decisions around heading structure, GFM detection, and RAG-oriented output.
Each entry in the exported collection produces an individual
.mdfile; all files are bundled into a.ziparchive. The implementation is oriented toward RAG pipeline ingestion and handles the case where field values already contain Markdown or GFM content.Changes
data-export.component.tsFormats.mdto theFormatsenum@Input() defaultColumns?: string[]— optional ordered list of pre-checked columns; columns listed here appear first and are checked by default, remaining columns are appended unchecked@Input() mdHeadingPrefix?: string— optional prefix for the H1 heading of each generated file (# {prefix}: {title},# {prefix}, or# {id}depending on available data; no H1 if none available)FormGroupconstruction inngOnInitto use an intermediateMap, enablingdefaultColumnsordering to be applied after all groups are builtcanIncreaseColumnIndexbound: now usesthis.columns.lengthinstead ofthis.columnNames.length, which was incorrectly capping reordering for virtual columns produced by deepening strategiesbuildMarkdownFileName(): cross-OS safe filename sanitization (diacritics, forbidden characters, Windows reserved names, 200-char max length), combiningtitleandidwith index-based fallback when neither yields a usable string after sanitizationbuildMarkdownHeading(): builds the H1 line frommdHeadingPrefixand/ortitle/id, returnsnullwhen no data is available (no H1 emitted)looksLikeMarkdown(): remark/GFM-based AST detection with a tiered confidence model — high (code,link,table), medium (heading,blockquote,strikethrough), low (list,strong) — requiring at least one high signal, two medium, or one medium + one lowdemoteMarkdownHeadings(): AST-based heading demotion viaremark-parse+remark-gfm+visit— increments all heading depths by 1 (capped at 6); code block contents are opaque AST nodes and are never affectedserializeValueToMarkdown(): async serialization dispatcher — bullet list for arrays,**key**: valuefor objects, Markdown detection + heading demotion for strings, passthrough otherwise; empty results skip their section heading entirelygenerateMarkdownForEntry(): async, builds the full.mdcontent for one entryPromise.allfor concurrent entry processing, chained withzip.generateAsync, with a.catchthat resetsloadingand surfaces a danger toast on failurereadonlyremark pipeline instances (remarkParser,remarkSerializer) to avoid per-callunified()instantiationvalidateColumnsFormArray) now also activates forFormats.mddata-export.component.htmlMarkdownradio option to the format selector*ngIfinto a sharedng-containerconditioned oncsv || mdDependencies
Added
jszipdepency.The following already present packages are now imported:
unified,remark-parse,remark-stringify,remark-gfm,unist-util-visit(transitive — explicit declaration recommended).Testing notes
answercontains GFM (headings, tables, lists) → heading levels should be demoted, structure preservedtitle/idat all → filenames should be valid and uniquedefaultColumnscontaining a locale-specific column (e.g.answer.i18n.en) on an app that does not support that locale → column silently absent, no error