Skip to content

Commit 9ccdab9

Browse files
committed
Updates
1 parent eb342b0 commit 9ccdab9

40 files changed

+4352
-30
lines changed

README.md

Lines changed: 200 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,202 @@
11
# go-tokenizer
22

3-
General Tokenizer and Abstract Syntax Tree Generator
3+
A general-purpose tokenizer and Markdown parser with HTML rendering for Go.
4+
5+
[![Go Reference](https://pkg.go.dev/badge/github.com/mutablelogic/go-tokenizer.svg)](https://pkg.go.dev/github.com/mutablelogic/go-tokenizer)
6+
7+
## Features
8+
9+
- **Lexical Scanner**: Tokenizes text into identifiers, numbers, strings, operators, and punctuation
10+
- **Markdown Parser**: Converts Markdown text into an Abstract Syntax Tree (AST)
11+
- **HTML Renderer**: Renders Markdown AST to HTML with proper escaping
12+
- **Configurable**: Optional features like comment parsing, newline handling, and float parsing
13+
14+
## Installation
15+
16+
```bash
17+
go get github.com/mutablelogic/go-tokenizer
18+
```
19+
20+
Requires Go 1.23 or later.
21+
22+
## Quick Start
23+
24+
### Tokenizing Text
25+
26+
```go
27+
package main
28+
29+
import (
30+
"fmt"
31+
"strings"
32+
33+
"github.com/mutablelogic/go-tokenizer"
34+
)
35+
36+
func main() {
37+
scanner := tokenizer.NewScanner(strings.NewReader("hello world 123"), tokenizer.Pos{})
38+
for {
39+
tok := scanner.Next()
40+
if tok.Kind == tokenizer.EOF {
41+
break
42+
}
43+
fmt.Printf("%s: %q\n", tok.Kind, tok.Value)
44+
}
45+
}
46+
```
47+
48+
Output:
49+
50+
```bash
51+
Ident: "hello"
52+
Space: " "
53+
Ident: "world"
54+
Space: " "
55+
NumberInteger: "123"
56+
```
57+
58+
### Parsing Markdown
59+
60+
```go
61+
package main
62+
63+
import (
64+
"fmt"
65+
"strings"
66+
67+
"github.com/mutablelogic/go-tokenizer"
68+
"github.com/mutablelogic/go-tokenizer/pkg/markdown"
69+
"github.com/mutablelogic/go-tokenizer/pkg/markdown/html"
70+
)
71+
72+
func main() {
73+
input := `# Hello World
74+
75+
This is **bold** and _italic_ text.
76+
77+
- Item 1
78+
- Item 2
79+
- Item 3
80+
`
81+
doc := markdown.Parse(strings.NewReader(input), tokenizer.Pos{})
82+
output := html.RenderString(doc)
83+
fmt.Println(output)
84+
}
85+
```
86+
87+
Output:
88+
89+
```html
90+
<h1>Hello World</h1><p>This is <strong>bold</strong> and <em>italic</em> text.</p><ul><li>Item 1</li><li>Item 2</li><li>Item 3</li></ul>
91+
```
92+
93+
## Packages
94+
95+
### `tokenizer` (root package)
96+
97+
The lexical scanner that breaks input text into tokens.
98+
99+
**Token Types:**
100+
101+
- `Ident` - Identifiers (hello, world)
102+
- `NumberInteger`, `NumberFloat`, `NumberHex`, `NumberOctal`, `NumberBinary` - Numbers
103+
- `String`, `QuotedString` - String literals
104+
- `Hash`, `Asterisk`, `Underscore`, `Backtick`, `Tilde` - Special characters
105+
- `Space`, `Newline` - Whitespace
106+
- `Comment` - Comments (when enabled)
107+
- And more...
108+
109+
**Scanner Features:**
110+
111+
```go
112+
// Enable features with bitwise OR
113+
scanner := tokenizer.NewScanner(r, pos,
114+
tokenizer.HashComment | // # style comments
115+
tokenizer.LineComment | // // style comments
116+
tokenizer.BlockComment | // /* */ style comments
117+
tokenizer.NewlineToken | // Emit newlines as separate tokens
118+
tokenizer.UnderscoreToken | // Emit underscores as separate tokens
119+
tokenizer.NumberFloatToken, // Parse floating point numbers
120+
)
121+
```
122+
123+
### `pkg/ast`
124+
125+
Defines the AST node types and tree traversal.
126+
127+
```go
128+
// Node interface
129+
type Node interface {
130+
Kind() Kind
131+
Children() []Node
132+
}
133+
134+
// Walk the AST
135+
ast.Walk(doc, func(node ast.Node, depth int) error {
136+
fmt.Printf("%s%s\n", strings.Repeat(" ", depth), node.Kind())
137+
return nil
138+
})
139+
```
140+
141+
### `pkg/markdown`
142+
143+
Parses Markdown text into an AST.
144+
145+
**Supported Syntax:**
146+
147+
- Headings: `# H1` through `###### H6`
148+
- Paragraphs: Text separated by blank lines
149+
- Emphasis: `_italic_` or `*italic*`
150+
- Strong: `__bold__` or `**bold**`
151+
- Strikethrough: `~~deleted~~`
152+
- Inline code: `` `code` ``
153+
- Code blocks: ` ```language ... ``` `
154+
- Links: `[text](url)` or `<url>`
155+
- Images: `![alt](url)`
156+
- Blockquotes: `> quoted text`
157+
- Unordered lists: `- item`, `* item`, or `+ item`
158+
- Ordered lists: `1. item` or `1) item`
159+
- Horizontal rules: `---`, `***`, or `___`
160+
161+
### `pkg/markdown/html`
162+
163+
Renders Markdown AST to HTML.
164+
165+
```go
166+
// Render to string
167+
output := html.RenderString(doc)
168+
169+
// Render to io.Writer with indentation
170+
renderer := html.NewRenderer(w).WithIndent(true)
171+
err := renderer.Render(doc)
172+
```
173+
174+
**Features:**
175+
176+
- Proper HTML escaping for XSS prevention
177+
- Optional indented output for readability
178+
- Language classes on code blocks: `<code class="language-go">`
179+
180+
## AST Node Types
181+
182+
| Kind | Description | HTML Output |
183+
|------|-------------|-------------|
184+
| `Document` | Root node | (container) |
185+
| `Paragraph` | Text block | `<p>...</p>` |
186+
| `Heading` | H1-H6 | `<h1>...</h1>` |
187+
| `Text` | Plain text | (escaped text) |
188+
| `Emphasis` | Italic | `<em>...</em>` |
189+
| `Strong` | Bold | `<strong>...</strong>` |
190+
| `Strikethrough` | Deleted | `<del>...</del>` |
191+
| `Code` | Inline code | `<code>...</code>` |
192+
| `CodeBlock` | Fenced code | `<pre><code>...</code></pre>` |
193+
| `Link` | Hyperlink | `<a href="...">...</a>` |
194+
| `Image` | Image | `<img src="..." alt="..."/>` |
195+
| `Blockquote` | Quote | `<blockquote>...</blockquote>` |
196+
| `List` | Ordered/Unordered | `<ol>...</ol>` or `<ul>...</ul>` |
197+
| `ListItem` | List item | `<li>...</li>` |
198+
| `HorizontalRule` | Divider | `<hr/>` |
199+
200+
## License
201+
202+
Apache 2.0 - see [LICENSE](LICENSE) for details.

doc.go

Lines changed: 31 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,34 @@
11
/*
2-
The `tokenizer` package implements a generic expression scanner for tokens
2+
Package tokenizer implements a generic lexical scanner for tokenizing text input.
3+
4+
The tokenizer breaks input text into tokens such as identifiers, numbers, strings,
5+
operators, and punctuation. It supports various number formats (integer, float,
6+
hex, octal, binary) and can be configured with optional features like comment
7+
parsing and newline handling.
8+
9+
# Basic Usage
10+
11+
scanner := tokenizer.NewScanner(strings.NewReader("hello world"), tokenizer.Pos{})
12+
for {
13+
tok := scanner.Next()
14+
if tok.Kind == tokenizer.EOF {
15+
break
16+
}
17+
fmt.Println(tok)
18+
}
19+
20+
# Features
21+
22+
The scanner supports optional features that can be enabled:
23+
24+
- HashComment: Enable # style single-line comments
25+
- LineComment: Enable // style single-line comments
26+
- BlockComment: Enable block comments
27+
- UnderscoreToken: Emit underscores as separate tokens (for markdown parsing)
28+
- NewlineToken: Emit newlines as separate tokens instead of whitespace
29+
30+
Features are combined using bitwise OR:
31+
32+
scanner := tokenizer.NewScanner(r, pos, tokenizer.HashComment|tokenizer.LineComment)
333
*/
434
package tokenizer

pkg/ast/kind.go

Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
package ast
2+
3+
///////////////////////////////////////////////////////////////////////////////
4+
// TYPES
5+
6+
// Kind identifies the type of an AST node.
7+
// Each node in the syntax tree has a Kind that describes what it represents,
8+
// such as a document, paragraph, heading, or inline formatting like emphasis.
9+
type Kind uint
10+
11+
///////////////////////////////////////////////////////////////////////////////
12+
// GLOBALS
13+
14+
const (
15+
Any Kind = iota
16+
Attr
17+
Block
18+
BlockList
19+
Ident
20+
Label
21+
List
22+
Map
23+
Ref
24+
String
25+
26+
// Markdown kinds
27+
Document
28+
Paragraph
29+
Heading
30+
CodeBlock
31+
Blockquote
32+
ListItem
33+
Text
34+
Emphasis // *italic* or _italic_
35+
Strong // **bold** or __bold__
36+
Strikethrough // ~~deleted~~
37+
Code // `code`
38+
Link // [text](url)
39+
Image // ![alt](url)
40+
HorizontalRule // --- or *** or ___
41+
LineBreak
42+
)
43+
44+
///////////////////////////////////////////////////////////////////////////////
45+
// STRINGIFY
46+
47+
func (k Kind) String() string {
48+
switch k {
49+
case Any:
50+
return "Any"
51+
case Attr:
52+
return "Attr"
53+
case Block:
54+
return "Block"
55+
case BlockList:
56+
return "BlockList"
57+
case Ident:
58+
return "Ident"
59+
case Label:
60+
return "Label"
61+
case List:
62+
return "List"
63+
case Map:
64+
return "Map"
65+
case Ref:
66+
return "Ref"
67+
case String:
68+
return "String"
69+
case Document:
70+
return "Document"
71+
case Paragraph:
72+
return "Paragraph"
73+
case Heading:
74+
return "Heading"
75+
case CodeBlock:
76+
return "CodeBlock"
77+
case Blockquote:
78+
return "Blockquote"
79+
case ListItem:
80+
return "ListItem"
81+
case Text:
82+
return "Text"
83+
case Emphasis:
84+
return "Emphasis"
85+
case Strong:
86+
return "Strong"
87+
case Strikethrough:
88+
return "Strikethrough"
89+
case Code:
90+
return "Code"
91+
case Link:
92+
return "Link"
93+
case Image:
94+
return "Image"
95+
case HorizontalRule:
96+
return "HorizontalRule"
97+
case LineBreak:
98+
return "LineBreak"
99+
default:
100+
return "[?? Invalid Kind value]"
101+
}
102+
}

pkg/ast/node.go

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
// Package ast defines the abstract syntax tree node types used by parsers.
2+
// It provides a common Node interface that all AST nodes implement,
3+
// allowing uniform traversal of syntax trees.
4+
package ast
5+
6+
///////////////////////////////////////////////////////////////////////////////
7+
// TYPES
8+
9+
// Node is the interface implemented by all AST nodes.
10+
// It provides methods for inspecting the node's type and accessing child nodes.
11+
type Node interface {
12+
// Kind returns the type of this node (e.g., Document, Paragraph, Text).
13+
Kind() Kind
14+
15+
// Children returns the immediate child nodes of this node.
16+
// Leaf nodes (like Text) return nil.
17+
Children() []Node
18+
}
19+
20+
// WalkFunc is the function signature for the callback used by Walk.
21+
// It receives the current node and its depth in the tree (0 for root).
22+
// Return an error to stop walking, or nil to continue.
23+
type WalkFunc func(node Node, depth int) error
24+
25+
///////////////////////////////////////////////////////////////////////////////
26+
// PUBLIC FUNCTIONS
27+
28+
// Walk traverses the AST starting from the given node, calling fn for each node
29+
// in depth-first pre-order (parent before children). The depth parameter indicates
30+
// how deep in the tree the current node is (0 for the root).
31+
// If fn returns an error, walking stops and that error is returned.
32+
func Walk(node Node, fn WalkFunc) error {
33+
return walk(node, fn, 0)
34+
}
35+
36+
///////////////////////////////////////////////////////////////////////////////
37+
// PRIVATE FUNCTIONS
38+
39+
func walk(node Node, fn WalkFunc, depth int) error {
40+
if node == nil {
41+
return nil
42+
}
43+
// Visit this node
44+
if err := fn(node, depth); err != nil {
45+
return err
46+
}
47+
// Visit children
48+
for _, child := range node.Children() {
49+
if err := walk(child, fn, depth+1); err != nil {
50+
return err
51+
}
52+
}
53+
return nil
54+
}

0 commit comments

Comments
 (0)