Skip to content

Commit 3baa356

Browse files
committed
fix(dashboard): add macOS entitlements for notarization
Missing entitlements.plist was causing Apple's notarization service to hang indefinitely on submissions. Added: - entitlements.plist with JIT, unsigned memory, dyld, and library validation entitlements (matching OpenCode's working config) - macOS.entitlements reference in tauri.conf.json - macOSPrivateApi: true for WebKit private API access Also adds LongMemEval benchmark runner scaffolding.
1 parent ed6ce3b commit 3baa356

File tree

9 files changed

+1805
-4
lines changed

9 files changed

+1805
-4
lines changed

packages/dashboard/src-tauri/Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ crate-type = ["lib", "cdylib", "staticlib"]
1212
tauri-build = { version = "2", features = [] }
1313

1414
[dependencies]
15-
tauri = { version = "2", features = ["tray-icon"] }
15+
tauri = { version = "2", features = ["macos-private-api", "tray-icon"] }
1616
serde = { version = "1", features = ["derive"] }
1717
serde_json = "1"
1818
rusqlite = { version = "0.31", features = ["bundled"] }
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
<?xml version="1.0" encoding="UTF-8"?>
2+
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
3+
<plist version="1.0">
4+
<dict>
5+
<key>com.apple.security.cs.allow-jit</key>
6+
<true/>
7+
<key>com.apple.security.cs.allow-unsigned-executable-memory</key>
8+
<true/>
9+
<key>com.apple.security.cs.disable-executable-page-protection</key>
10+
<true/>
11+
<key>com.apple.security.cs.allow-dyld-environment-variables</key>
12+
<true/>
13+
<key>com.apple.security.cs.disable-library-validation</key>
14+
<true/>
15+
</dict>
16+
</plist>

packages/dashboard/src-tauri/tauri.conf.json

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,8 @@
2929
},
3030
"security": {
3131
"csp": "default-src 'self'; script-src 'self'; style-src 'self' 'unsafe-inline'; connect-src 'self' https:"
32-
}
32+
},
33+
"macOSPrivateApi": true
3334
},
3435
"bundle": {
3536
"active": true,
@@ -41,7 +42,10 @@
4142
"icons/128x128@2x.png",
4243
"icons/icon.icns",
4344
"icons/icon.ico"
44-
]
45+
],
46+
"macOS": {
47+
"entitlements": "./entitlements.plist"
48+
}
4549
},
4650
"plugins": {
4751
"updater": {
Lines changed: 133 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,133 @@
1+
# LongMemEval benchmark runner
2+
3+
This runner executes LongMemEval against a live OpenCode instance with the magic-context plugin enabled.
4+
5+
It uses the real OpenCode HTTP API via `@opencode-ai/sdk`, creates real sessions, replays dataset user turns, asks the benchmark question in a new session, and judges the answer with the official LongMemEval prompt logic.
6+
7+
## What it benchmarks
8+
9+
- One OpenCode session per haystack conversation session
10+
- All sessions share the same project/directory so magic-context shares project identity and memory state
11+
- Final benchmark question is asked in a new session to test cross-session memory retrieval
12+
- No dataset assistant responses are injected
13+
- All user messages go through the real OpenCode API
14+
15+
## Files
16+
17+
- `runner.ts` — end-to-end orchestration, replay, resume, logging, summary
18+
- `judge.ts` — official LongMemEval judge prompt templates and GPT-4o evaluation
19+
- `types.ts` — dataset, state, and result types
20+
- `config.ts` — CLI parsing and runtime configuration
21+
22+
## Prerequisites
23+
24+
1. OpenCode running locally with magic-context enabled
25+
2. Dataset JSON downloaded locally
26+
3. Judge API key available in `OPENAI_API_KEY` by default
27+
28+
Default OpenCode URL: `http://127.0.0.1:21354`
29+
30+
## Run
31+
32+
From `packages/plugin/`:
33+
34+
```bash
35+
# Full run
36+
bun run scripts/longmemeval/runner.ts --dataset ./longmemeval_s_cleaned.json
37+
38+
# Subset by index
39+
bun run scripts/longmemeval/runner.ts --dataset ./longmemeval_s_cleaned.json --start 0 --end 10
40+
41+
# Specific categories
42+
bun run scripts/longmemeval/runner.ts --dataset ./longmemeval_s_cleaned.json --types temporal-reasoning,knowledge-update
43+
44+
# Resume
45+
bun run scripts/longmemeval/runner.ts --dataset ./longmemeval_s_cleaned.json --resume
46+
47+
# Fast mode
48+
bun run scripts/longmemeval/runner.ts --dataset ./longmemeval_s_cleaned.json --fast
49+
```
50+
51+
## Important flags
52+
53+
```bash
54+
--parallel <n> # questions in parallel, default 1
55+
--cleanup # delete OpenCode sessions after judging
56+
--output-dir <path> # override run artifact directory
57+
--opencode-url <url> # override OpenCode base URL
58+
--turn-delay-ms <ms> # delay between replayed user turns
59+
--session-delay-ms <ms> # delay between haystack sessions
60+
--final-delay-ms <ms> # delay before asking final question
61+
--question-ids <a,b> # explicit question id filter
62+
--max-attempts <n> # retry attempts for request failures
63+
--retry-base-delay-ms <ms> # base exponential backoff delay
64+
```
65+
66+
## Resumability
67+
68+
The runner is crash-safe at per-question execution granularity and persists state after each meaningful step.
69+
70+
Saved state includes:
71+
72+
- created OpenCode session IDs
73+
- current haystack session index
74+
- current turn index within the active haystack session
75+
- whether per-session banner/date marker was sent
76+
- whether the final question session exists
77+
- whether the final question banner was sent
78+
- whether the final question was asked
79+
- captured hypothesis text
80+
- accumulated OpenCode token usage and costs
81+
- accumulated judge token usage and costs
82+
- last error snapshot
83+
84+
Artifacts written under the run output directory:
85+
86+
- `runner-state.json` — resumable in-progress state
87+
- `results.jsonl` — append-only completed results
88+
- `summary.json` — latest aggregated summary
89+
- `runner.log` — timestamped runner log
90+
91+
On `--resume`, the runner validates that the saved selection matches the current dataset/filter selection.
92+
93+
## Question prompt
94+
95+
The final benchmark question is wrapped as:
96+
97+
```text
98+
Based on our previous conversations, please answer this question. Do not search files or use tools — answer purely from what you remember about our past interactions.
99+
100+
Question: {question}
101+
```
102+
103+
The final question also sends a system instruction reinforcing that it must answer from conversational memory only.
104+
105+
## Judge behavior
106+
107+
`judge.ts` implements the official LongMemEval category-specific prompts for:
108+
109+
- `single-session-user`
110+
- `single-session-assistant`
111+
- `single-session-preference`
112+
- `multi-session`
113+
- `temporal-reasoning`
114+
- `knowledge-update`
115+
- `_abs` abstention questions
116+
117+
Scoring follows the official evaluator behavior: the label is `true` if the judge response contains `yes` case-insensitively.
118+
119+
## Cost tracking
120+
121+
The runner records:
122+
123+
- actual OpenCode cost reported by assistant messages
124+
- estimated OpenCode cost from optional CLI pricing inputs
125+
- estimated judge cost from configured judge pricing
126+
127+
If OpenCode pricing flags are omitted, OpenCode estimated cost defaults to zero while actual OpenCode cost still uses the API-reported value.
128+
129+
## Notes
130+
131+
- This runner is designed to build the benchmark harness; it does not modify plugin runtime code.
132+
- It does not inject dataset assistant messages.
133+
- It does not bypass the OpenCode API.

0 commit comments

Comments
 (0)