diff --git a/.gitignore b/.gitignore index 6cf9326..fea65f3 100644 --- a/.gitignore +++ b/.gitignore @@ -5,6 +5,8 @@ # Testing artefacts .temp-profile +tests/.env +tests/.env.local # logs geckodriver.log diff --git a/docs/test-plan.md b/docs/test-plan.md new file mode 100644 index 0000000..a4265eb --- /dev/null +++ b/docs/test-plan.md @@ -0,0 +1,162 @@ +# Selenium Test Harness — Improvement Plan + +Date: 2026-04-30 + +Overview + +This document captures an actionable plan to improve the Selenium-based integration tests in `tests/test.py` for the Zeeschuimer Firefox extension. The goals are to: + +- Make profile handling reliable and reusable (so logged-in sessions persist across runs). +- Preserve and export captured data per platform for offline analysis and for passing to 4CAT. +- Add optional automated upload to a 4CAT instance for mapping/validation tests. +- Reduce fragility caused by popups and interactive dialogs (pausing/dismissal patterns). +- Improve robustness, error handling, and machine-readable results. + +Scope + +All changes are confined to the test harness and test metadata (`tests/test.py` and `tests/tests.json`) and to this planning document. No changes are required in the extension source for the planned items (the test harness will interact with the extension's UI pages and background DB). + +Phases & Changes + +Phase 1 — Profile management + +- Problem: copying an entire profile can race with a running Firefox and the current ignore rule hides potentially useful session data. +- Changes: + - Detect if the selected profile directory appears locked (presence of `lock` or `.parentlock`) and warn if Firefox is running. + - Replace the naive ignore lambda used in `shutil.copytree` with a function that only excludes `storage`, `extensions`, and `signedInUser.json` at the profile root. + - Add CLI flags: `--profile-name NAME` (choose profile by display name from `profiles.ini`), `--save-profile PATH` (save the temp profile for reuse), and `--no-cleanup` (do not remove `.temp-profile` after run). + +Implementation note (copytree ignore example): + +```python +def _profile_ignore(root, names): + # Only ignore these entries in the root profile dir + if os.path.abspath(root) == os.path.abspath(profile_dir): + return {"storage", "extensions", "signedInUser.json"} + return set() + +shutil.copytree(profile_dir, profile_file, ignore=_profile_ignore) +``` + +Phase 2 — Data preservation & export + +- Problem: `reset-all` wipes the DB before each URL; no artifacts are kept for post-mortem or mapping tests. +- Decision: export a single combined NDJSON file per platform containing items collected while testing that platform. +- Changes: + - Add CLI `--export-dir PATH` (default `./zeeschuimer-exports/{timestamp}/`). + - Before clicking `reset-all` for each URL, read the current DB contents from the extension background page (Dexie) via `execute_async_script` and append those items to a per-platform in-memory list in Python. After all URLs for a platform are done, write `{export-dir}/{platform}.ndjson`. + - Optionally add `--no-reset` to skip the `reset-all` call entirely (default behavior remains to reset before each URL). + +Execute_async_script pattern (example): + +```python +script = ''' +const cb = arguments[0]; +background.db.items.toArray().then(items => cb(JSON.stringify(items))).catch(e => cb(JSON.stringify({error: String(e)}))); +''' +items_json = driver.execute_async_script(script) +items = json.loads(items_json) +``` + +Phase 3 — 4CAT integration (optional) + +- Problem: mapping tests live in 4CAT and need NDJSON input. +- Changes: + - Add CLI flags: `--4cat-url URL` and `--4cat-key KEY` (API key). Require both for upload. + - After writing the per-platform NDJSON, POST it to `{4cat_url.rstrip('/')}/api/import-dataset/` with header `X-Zeeschuimer-Platform: {platform}` and `Authorization: {key}` (confirm header with your 4CAT instance; alternative is to trigger the extension UI upload button when cookie-based auth is required). + - Do not fail the test run on 4CAT errors — print status and continue. + +Example upload with `requests`: + +```python +import requests +with open(ndjson_path, 'rb') as f: + headers = { + 'X-Zeeschuimer-Platform': platform, + 'Authorization': f'{fourcat_key}' + } + r = requests.post(f"{fourcat_url.rstrip('/')}/api/import-dataset/", headers=headers, data=f) + # check r.status_code and r.text for details +``` + +Phase 4 — Interactive controls & popup dismissals + +- Problem: cookie banners, paywall prompts, and other popups frequently interfere with automated navigation and can cause false failures. +- Decision: pause by default **once per platform** (not before every URL) so the tester can clear residual prompts; provide opt-out and finer-grained options. +- Changes: + - CLI flags: `--no-interactive` (disable all pauses), `--pause-before-url` (pause before each URL), `--pause-on-fail` (pause on failure), `--extra-wait N` (add N seconds to every wait), `--screenshot-dir PATH` (capture screenshots on fail/warning). + - Add a `dismiss-selectors` optional field in `tests.json` per URL: a list of CSS selectors to click to dismiss known popups. Example: + +```json +"dismiss-selectors": ["button.cookie-accept", ".modal .close"] +``` + + - Add per-URL `timeout` (page load timeout override). + +Phase 5 — Runner robustness & reporting + +- Problem: unhandled exceptions abort the run; final runtime is calculated incorrectly; no machine-readable results. +- Changes: + - Wrap each URL test body in try/except, increment `failed` on exceptions, and continue. + - Move the global `start_time = time.time()` to before the outer platform loop so the final elapsed time is for the full run. + - Add CLI flags: `--results-file PATH` (write JSON summary), `--resume-from PLATFORM` (skip earlier platforms), and `--screenshot-dir PATH` (as noted). + - Fix small test metadata issues (e.g., `more-after-scrolll` typo in `tests.json`). + +tests.json schema additions + +- Per-URL optional fields: + - `dismiss-selectors`: array of CSS selectors to click after page load + - `timeout`: numeric page load timeout seconds for this URL + - `extra-wait`: per-URL additional wait seconds + +CLI flags (summary) + +- `--profiledir PATH` — explicit profile path (existing) +- `--profile-name NAME` — choose Firefox profile by display name +- `--save-profile PATH` — persist the copied profile for reuse +- `--no-cleanup` — keep `.temp-profile` +- `--export-dir PATH` — where to write NDJSON exports +- `--no-reset` — do not click `reset-all` between URLs +- `--4cat-url URL` — base URL for 4CAT server +- `--4cat-key KEY` — API key for 4CAT uploads +- `--4cat-per-url` — upload per URL instead of per platform (optional) +- `--no-interactive` — disable pausing (default is to pause per-platform) +- `--pause-before-url` — pause before each URL +- `--pause-on-fail` — pause when a test fails +- `--extra-wait N` — add N seconds to every URL wait +- `--screenshot-dir PATH` — save screenshots on fail/warning +- `--results-file PATH` — write machine-readable results JSON +- `--resume-from PLATFORM` — resume a run from a platform + +Verification checklist + +1. `python tests/test.py --sources instagram.com --export-dir ./exports` -> `exports/instagram.com.ndjson` exists and contains NDJSON with captured items. +2. `python tests/test.py --save-profile .saved-profile --login` -> create a saved profile that can be reused with `--profiledir .saved-profile`. +3. Run with default interactive behavior and confirm one pause per platform. +4. `python tests/test.py --results-file results.json` -> JSON summary produced with per-URL status and counts. +5. Test 4CAT upload using a local mock server and `--4cat-url http://localhost:8000 --4cat-key KEY`. + +Implementation steps (recommended order) + +1. Docs and small fixes (this document + tests.json typo fix). +2. Profile management changes (`--profile-name`, improved copy ignore, `--save-profile`, lock detection). +3. Export behavior: `--export-dir` + `execute_async_script` collection and NDJSON write. +4. Runner robustness: try/except around URL loop, `--results-file`, fix `start_time` placement. +5. Interactive and dismissal features (`dismiss-selectors`, pause flags, screenshots). +6. 4CAT upload integration (optional, requires confirmation of auth header). + +Estimated effort: 6–10 hours of focused work to implement and test everything end-to-end; can be split into 3-4 incremental PRs. + +Open questions / confirmations needed + +- Confirm 4CAT API key header format (currently suggested: `Authorization: {key}`). If your 4CAT requires cookie-based auth, we should emulate the extension upload button via Selenium instead. +- Confirm desired default for interactive mode. (Current recommendation: pause once per platform by default; provide `--no-interactive` to run fully headless.) + +Next steps + +- I have created a matching TODO list in the session tracker and written this document to `docs/test-plan.md`. +- If you want, I can start implementing Phase 1 (profile management) in `tests/test.py` now and submit incremental changes. + +--- + +Requested file: `docs/test-plan.md` diff --git a/js/lib.js b/js/lib.js index e38430e..c618a6a 100644 --- a/js/lib.js +++ b/js/lib.js @@ -57,6 +57,12 @@ class MissingMappedField { toString() { return `${this.value}`; } + + // Mirror 4CAT's API serialization so JSON.stringify produces the same + // tagged form on both sides. See docs/4cat-map-item-api.md. + toJSON() { + return { __missing: true, value: this.value }; + } } /** diff --git a/modules/package.json b/modules/package.json new file mode 100644 index 0000000..3dbc1ca --- /dev/null +++ b/modules/package.json @@ -0,0 +1,3 @@ +{ + "type": "module" +} diff --git a/tests/.env.example b/tests/.env.example new file mode 100644 index 0000000..2e021bb --- /dev/null +++ b/tests/.env.example @@ -0,0 +1,9 @@ +# 4CAT API config for the map_item comparison tests. +# Copy this file to .env in this directory and fill in real values. +# .env is gitignored; .env.example is the committed template. + +# Base URL of the 4CAT instance to hit. No trailing slash. +FOURCAT_URL=http://localhost + +# API key for that 4CAT instance. Get one from the 4CAT UI; tied to your user. +FOURCAT_API_KEY=your-api-key-here diff --git a/tests/__pycache__/test.cpython-39.pyc b/tests/__pycache__/test.cpython-39.pyc new file mode 100644 index 0000000..745e2b4 Binary files /dev/null and b/tests/__pycache__/test.cpython-39.pyc differ diff --git a/tests/_module-info.js b/tests/_module-info.js new file mode 100644 index 0000000..e261e4e --- /dev/null +++ b/tests/_module-info.js @@ -0,0 +1,45 @@ +/** + * Shared helper for the map_item test drivers. + * + * Pre-validates a module by: + * 1. Running `node --check` on its file (syntax check; avoids the + * worker-killing experimental-ESM crash when a syntax error reaches + * the dynamic importer). + * 2. Dynamically importing it and checking for a `map_item` export. + * + * Returns one of four states the test driver can branch on: + * { state: 'ok', map_item: } + * { state: 'no_map_item' } + * { state: 'syntax_error', error: } + * { state: 'import_error', error: } + */ + +import { spawnSync } from 'node:child_process'; +import { join, dirname } from 'node:path'; +import { fileURLToPath } from 'node:url'; + +const __dirname = dirname(fileURLToPath(import.meta.url)); +const MODULES_ROOT = join(__dirname, '..', 'modules'); + +function check_module_syntax(module_name) { + const module_path = join(MODULES_ROOT, `${module_name}.js`); + const result = spawnSync(process.execPath, ['--check', module_path], { encoding: 'utf8' }); + if (result.status === 0) return null; + return (result.stderr || result.stdout || `exit code ${result.status}`).trim(); +} + +export async function inspect_module(module_name) { + const syntax_error = check_module_syntax(module_name); + if (syntax_error) { + return { state: 'syntax_error', error: syntax_error }; + } + try { + const mod = await import(`../modules/${module_name}.js`); + if (typeof mod.map_item !== 'function') { + return { state: 'no_map_item' }; + } + return { state: 'ok', map_item: mod.map_item }; + } catch (e) { + return { state: 'import_error', error: e }; + } +} diff --git a/tests/duplicate-behavior.test.js b/tests/duplicate-behavior.test.js index 031f663..9f0662b 100644 --- a/tests/duplicate-behavior.test.js +++ b/tests/duplicate-behavior.test.js @@ -5,8 +5,9 @@ * update or merge behaviors to duplicates across navigation boundaries. */ +import 'fake-indexeddb/auto'; + let Dexie; -require('fake-indexeddb/auto'); // Mock browser extension APIs global.browser = { diff --git a/tests/fixtures/.gitignore b/tests/fixtures/.gitignore new file mode 100644 index 0000000..8e89a83 --- /dev/null +++ b/tests/fixtures/.gitignore @@ -0,0 +1,5 @@ +# Ignore everything in this directory +* +# Except these files +!.gitignore +!README.md \ No newline at end of file diff --git a/tests/fixtures/README.md b/tests/fixtures/README.md new file mode 100644 index 0000000..d24fe06 --- /dev/null +++ b/tests/fixtures/README.md @@ -0,0 +1,29 @@ +# Test fixtures for `map_item` + +Real captured items used to exercise each module's auto-generated `map_item` +function. + +## Layout + +``` +tests/fixtures/ + / + .ndjson + .ndjson +``` + +`` matches the filename in `modules/` without `.js` — +e.g. `tiktok/` → `modules/tiktok.js`, `pinterest/` → `modules/pinterest.js`. +You can drop multiple `.ndjson` files in a module folder; each gets its own +`describe` block and each line becomes its own `test`. + +Filenames are free-form — the auto-export filename from the popup +(`zeeschuimer-export--.ndjson`) is fine. + +## Privacy / committing + +These files contain real captured platform data — usernames, post +content, URLs, sometimes images and other PII. + +If we want to create test exports or annonomize real exports, add them to +.gitignore. \ No newline at end of file diff --git a/tests/jest.config.js b/tests/jest.config.cjs similarity index 64% rename from tests/jest.config.js rename to tests/jest.config.cjs index 7dd5b02..ea72b10 100644 --- a/tests/jest.config.js +++ b/tests/jest.config.cjs @@ -3,6 +3,7 @@ module.exports = { testMatch: ['**/*.test.js'], transform: {}, moduleFileExtensions: ['js', 'json'], - collectCoverageFrom: ['duplicate-behavior.test.js'], + collectCoverageFrom: ['*.test.js'], + setupFiles: ['/setup-globals.cjs'], verbose: true }; diff --git a/tests/map_item.test.js b/tests/map_item.test.js new file mode 100644 index 0000000..2dc1bb6 --- /dev/null +++ b/tests/map_item.test.js @@ -0,0 +1,121 @@ +/** + * Smoke test driver for module `map_item` functions. + * + * Convention: + * tests/fixtures//*.ndjson + * + * matches a file in modules/ (e.g. "tiktok" maps to modules/tiktok.js). + * Each .ndjson line is one Zeeschuimer-stored item exported from the popup. + * + * Each item is wrapped via wrap_for_map_item to mirror how 4CAT's importer + * presents items to a map_item function, then run through the module's + * map_item. Tests assert: function returns a non-null object, and any fields + * listed in REQUIRED_NON_EMPTY for that module are present and non-empty. + * + * Module-level state is determined upfront by inspect_module(): + * - 'ok' → register per-item tests + * - 'no_map_item' → register a single skipped test (not applicable) + * - 'syntax_error' → register a single failing test pointing at the line + * - 'import_error' → register a single failing test with the message + */ + +import { readdirSync, readFileSync, statSync, existsSync } from 'node:fs'; +import { join, dirname } from 'node:path'; +import { fileURLToPath } from 'node:url'; +import { inspect_module } from './_module-info.js'; + +const __dirname = dirname(fileURLToPath(import.meta.url)); +const FIXTURE_ROOT = join(__dirname, 'fixtures'); + +const REQUIRED_NON_EMPTY = { + tiktok: ['id', 'author', 'unix_timestamp'], +}; + +/** + * Local mirror of wrap_for_map_item from js/lib.js. lib.js is loaded by + * the browser as a plain script and so cannot be imported from Node; this + * three-line mirror is cheaper than restructuring lib.js into a module. + */ +function wrap_for_map_item(stored_item) { + const { data, ...meta } = stored_item; + return { ...data, __import_meta: meta }; +} + +function list_module_dirs() { + if (!existsSync(FIXTURE_ROOT)) return []; + return readdirSync(FIXTURE_ROOT).filter(name => { + try { return statSync(join(FIXTURE_ROOT, name)).isDirectory(); } + catch { return false; } + }); +} + +const module_dirs = list_module_dirs(); + +// Pre-pass: synchronously determine each module's state so we can branch +// on it at describe/test registration time. Top-level await is supported +// in Jest's experimental-vm-modules mode. +const module_info = {}; +for (const module_name of module_dirs) { + module_info[module_name] = await inspect_module(module_name); +} + +let total_fixtures = 0; + +for (const module_name of module_dirs) { + const fixture_dir = join(FIXTURE_ROOT, module_name); + const fixture_files = readdirSync(fixture_dir).filter(f => f.endsWith('.ndjson')); + if (fixture_files.length === 0) continue; + total_fixtures += fixture_files.length; + + const info = module_info[module_name]; + + if (info.state === 'no_map_item') { + describe(`map_item: ${module_name}`, () => { + test.skip(`modules/${module_name}.js does not export a map_item function — nothing to smoke test`, () => {}); + }); + continue; + } + + if (info.state === 'syntax_error' || info.state === 'import_error') { + const msg = info.state === 'syntax_error' + ? `syntax error:\n${info.error}` + : `import failed: ${info.error.message}`; + describe(`map_item: ${module_name}`, () => { + test(`module loads`, () => { throw new Error(msg); }); + }); + continue; + } + + // state === 'ok' — register per-item tests + const map_item = info.map_item; + + describe(`map_item: ${module_name}`, () => { + for (const fixture_file of fixture_files) { + const lines = readFileSync(join(fixture_dir, fixture_file), 'utf8') + .split('\n') + .filter(line => line.trim().length > 0); + + describe(fixture_file, () => { + lines.forEach((line, i) => { + test(`item ${i} maps without throwing`, () => { + const stored_item = JSON.parse(line); + const mapped = map_item(wrap_for_map_item(stored_item)); + expect(mapped).not.toBeNull(); + expect(typeof mapped).toBe('object'); + for (const field of REQUIRED_NON_EMPTY[module_name] ?? []) { + expect(mapped[field]).toBeDefined(); + expect(mapped[field]).not.toBe(''); + expect(mapped[field]).not.toBeNull(); + } + }); + }); + }); + } + }); +} + +if (total_fixtures === 0) { + describe('map_item', () => { + test.skip('no fixtures found under tests/fixtures//*.ndjson', () => {}); + }); +} diff --git a/tests/map_item_compare.test.js b/tests/map_item_compare.test.js new file mode 100644 index 0000000..37e3e4c --- /dev/null +++ b/tests/map_item_compare.test.js @@ -0,0 +1,283 @@ +/** + * @jest-environment node + * + * This file runs in Node test environment (not jsdom) because undici's + * fetch implementation uses Node-internal APIs (`clearImmediate`, + * `markResourceTiming`, fast-now timers, etc.) that jsdom shadows or + * doesn't expose. Polyfilling them into jsdom is whack-a-mole; node env + * has them all natively. + * + * Trade-off: no DOMParser in node env. The four modules that use + * `strip_tags` (gab, pinterest, rednote, truth) will need a DOMParser + * polyfill (e.g. via linkedom) before the comparator can run against + * them. Other modules (including instagram) work as-is. + */ +/** + * Compare JS map_item output against 4CAT's Python map_item via the API. + * + * For every line in every fixture, runs the JS map_item locally AND sends + * the same stored item to 4CAT's /api/map-item// endpoint, then + * diffs the two outputs field-by-field. Each item is its own Jest test — + * failures point at exactly which item and which fields diverge. + * + * Skips itself entirely if FOURCAT_URL / FOURCAT_API_KEY aren't set, so + * `npm test` keeps working without 4CAT configuration. Drop real values in + * tests/.env to enable. + * + * Datasource id mapping: tests/zeeschuimer-to-4cat.json (Zeeschuimer + * module filename → 4CAT datasource id, for the few names that diverge). + * + * Module-level state is determined upfront by inspect_module() (no + * map_item / syntax errors / import errors are handled before tests are + * registered, so they appear once per module, not once per item). + */ + +import 'dotenv/config'; +import { jest } from '@jest/globals'; +import { readdirSync, readFileSync, statSync, existsSync } from 'node:fs'; +import { join, dirname } from 'node:path'; +import { fileURLToPath } from 'node:url'; +import { inspect_module } from './_module-info.js'; + +const __dirname = dirname(fileURLToPath(import.meta.url)); + +const FOURCAT_URL = process.env.FOURCAT_URL?.replace(/\/$/, ''); +const FOURCAT_API_KEY = process.env.FOURCAT_API_KEY; +const HAS_4CAT = Boolean( + FOURCAT_URL && FOURCAT_API_KEY && FOURCAT_API_KEY !== 'your-api-key-here' +); + +// When true (default), once any item in a module fails, subsequent items +// in that same module skip the HTTP + map_item work and fail fast with a +// "halted" message. Saves time when generator output is broken at the top. +// Set FAIL_FAST=0 in env to run all items regardless. +// Trim because cmd.exe's `set FAIL_FAST=0 && ...` includes the trailing +// space in the variable value, which would otherwise defeat `!== '0'`. +const FAIL_FAST = (process.env.FAIL_FAST ?? '').trim() !== '0'; +const halted_modules = new Set(); + +const FIXTURE_ROOT = join(__dirname, 'fixtures'); +const ID_MAP_PATH = join(__dirname, 'zeeschuimer-to-4cat.json'); +const ID_MAP = existsSync(ID_MAP_PATH) + ? JSON.parse(readFileSync(ID_MAP_PATH, 'utf8')) + : {}; + +function wrap_for_map_item(stored_item) { + const { data, ...meta } = stored_item; + return { ...data, __import_meta: meta }; +} + +async function call_4cat_map_item(datasource_id, item) { + const res = await fetch(`${FOURCAT_URL}/api/map-item/${datasource_id}/`, { + method: 'POST', + headers: { + // 4CAT accepts the raw key without a `Bearer ` prefix, per probe + 'Authorization': FOURCAT_API_KEY, + 'Content-Type': 'application/json', + }, + body: JSON.stringify({ item }), + }); + const text = await res.text(); + if (!res.ok) { + throw new Error(`HTTP ${res.status} from 4CAT: ${text}`); + } + return JSON.parse(text); +} + +// Round-trip a value through JSON so MappedItem, MissingMappedField, etc. +// become plain JSON-compatible objects matching what 4CAT emits. +function normalize(value) { + return JSON.parse(JSON.stringify(value)); +} + +// Recursive structural equality. Doesn't care about object key order, which +// matters for nested values like {__missing: true, value: ""} where JS and +// Python might emit keys in different orders. +function deep_equal(a, b) { + if (a === b) return true; + if (a === null || b === null) return a === b; + if (typeof a !== typeof b) return false; + if (typeof a !== 'object') return false; + if (Array.isArray(a) !== Array.isArray(b)) return false; + if (Array.isArray(a)) { + if (a.length !== b.length) return false; + return a.every((v, i) => deep_equal(v, b[i])); + } + const a_keys = Object.keys(a); + const b_keys = Object.keys(b); + if (a_keys.length !== b_keys.length) return false; + return a_keys.every(k => k in b && deep_equal(a[k], b[k])); +} + +function diff_objects(js_obj, py_obj) { + const diffs = []; + const keys = new Set([...Object.keys(js_obj ?? {}), ...Object.keys(py_obj ?? {})]); + for (const key of keys) { + const in_js = js_obj && key in js_obj; + const in_py = py_obj && key in py_obj; + if (!in_js) { + diffs.push({ key, kind: 'only_python', python: py_obj[key] }); + } else if (!in_py) { + diffs.push({ key, kind: 'only_js', js: js_obj[key] }); + } else if (!deep_equal(js_obj[key], py_obj[key])) { + diffs.push({ key, kind: 'mismatch', js: js_obj[key], python: py_obj[key] }); + } + } + return diffs; +} + +function format_diffs(diffs) { + return diffs.map(d => { + if (d.kind === 'only_js') { + return ` + only in JS: ${d.key} = ${JSON.stringify(d.js)}`; + } + if (d.kind === 'only_python') { + return ` - only in Python: ${d.key} = ${JSON.stringify(d.python)}`; + } + return ` ~ ${d.key}\n JS: ${JSON.stringify(d.js)}\n Python: ${JSON.stringify(d.python)}`; + }).join('\n'); +} + +// Pull out the first few module-frame lines from an error's stack so the +// failure message points at where in modules/.js the throw happened. +function format_error_with_location(err) { + if (!err) return String(err); + const message = err.message || String(err); + const stack = err.stack || ''; + const module_frames = stack.split('\n') + .filter(l => l.includes('/modules/') || l.includes('\\modules\\')) + .slice(0, 3) + .map(l => l.trim()); + return module_frames.length + ? `${message}\n ${module_frames.join('\n ')}` + : message; +} + +function list_module_dirs() { + if (!existsSync(FIXTURE_ROOT)) return []; + return readdirSync(FIXTURE_ROOT).filter(name => { + try { return statSync(join(FIXTURE_ROOT, name)).isDirectory(); } + catch { return false; } + }); +} + +// Per-test timeout: each test does one HTTP round-trip to 4CAT. Jest's +// default 5s is tight under load. +jest.setTimeout(30000); + +if (!HAS_4CAT) { + describe('map_item compare (JS vs 4CAT Python)', () => { + test.skip('FOURCAT_URL / FOURCAT_API_KEY not configured — set them in tests/.env to enable', () => {}); + }); +} else { + const module_dirs = list_module_dirs(); + + // Pre-pass: synchronously determine each module's state so we can branch + // on it at registration time. + const module_info = {}; + for (const module_name of module_dirs) { + module_info[module_name] = await inspect_module(module_name); + } + + let any_fixtures = false; + + for (const module_name of module_dirs) { + const fixture_dir = join(FIXTURE_ROOT, module_name); + const fixture_files = readdirSync(fixture_dir).filter(f => f.endsWith('.ndjson')); + if (fixture_files.length === 0) continue; + any_fixtures = true; + + const datasource_id = ID_MAP[module_name] ?? module_name; + const info = module_info[module_name]; + + if (info.state === 'no_map_item') { + // eslint-disable-next-line no-console + console.log(`[compare] skipping ${module_name}: modules/${module_name}.js does not export a map_item`); + continue; + } + + if (info.state === 'syntax_error' || info.state === 'import_error') { + const msg = info.state === 'syntax_error' + ? `syntax error:\n${info.error}` + : `import failed: ${info.error.message}`; + describe(`map_item compare: ${module_name}`, () => { + test(`module loads`, () => { throw new Error(msg); }); + }); + continue; + } + + // state === 'ok' — register per-item comparison tests + const map_item = info.map_item; + + describe(`map_item compare: ${module_name} (4CAT id: ${datasource_id})`, () => { + for (const fixture_file of fixture_files) { + const lines = readFileSync(join(fixture_dir, fixture_file), 'utf8') + .split('\n') + .filter(line => line.trim().length > 0); + + describe(fixture_file, () => { + lines.forEach((line, i) => { + test(`item ${i}`, async () => { + if (FAIL_FAST && halted_modules.has(module_name)) { + throw new Error( + '[halted after prior failure in this module — set FAIL_FAST=0 to run all items]' + ); + } + try { + const stored_item = JSON.parse(line); + + // 4CAT side + const response = await call_4cat_map_item(datasource_id, stored_item); + + // JS side + let js_result; + let js_error; + try { + js_result = map_item(wrap_for_map_item(stored_item)); + } catch (e) { + js_error = e; + } + + if (response.status === 'mapped') { + if (js_error) { + throw new Error( + `4CAT mapped this item but JS threw: ${format_error_with_location(js_error)}` + ); + } + const js_obj = normalize(js_result); + const py_obj = normalize(response.item); + const diffs = diff_objects(js_obj, py_obj); + if (diffs.length > 0) { + throw new Error( + `${diffs.length} field(s) differ between JS and 4CAT:\n${format_diffs(diffs)}` + ); + } + } else if (response.status === 'skipped') { + if (!js_error) { + throw new Error( + `4CAT skipped this item ("${response.reason}") but JS produced a result` + ); + } + // Both rejected — good. Skip reasons may differ in wording. + } else if (response.status === 'error') { + throw new Error(`4CAT errored on this item: ${response.message}`); + } else { + throw new Error(`unexpected 4CAT response status: ${JSON.stringify(response)}`); + } + } catch (e) { + if (FAIL_FAST) halted_modules.add(module_name); + throw e; + } + }); + }); + }); + } + }); + } + + if (!any_fixtures) { + describe('map_item compare (JS vs 4CAT Python)', () => { + test.skip('no fixtures under tests/fixtures//*.ndjson', () => {}); + }); + } +} diff --git a/tests/package-lock.json b/tests/package-lock.json index cc8f457..7758e9f 100644 --- a/tests/package-lock.json +++ b/tests/package-lock.json @@ -9,9 +9,11 @@ "version": "1.0.0", "devDependencies": { "dexie": "^3.2.4", + "dotenv": "^16.4.5", "fake-indexeddb": "^5.0.1", "jest": "^29.7.0", - "jest-environment-jsdom": "^29.7.0" + "jest-environment-jsdom": "^29.7.0", + "undici": "^6.20.0" } }, "node_modules/@babel/code-frame": { @@ -1758,6 +1760,19 @@ "node": ">=12" } }, + "node_modules/dotenv": { + "version": "16.6.1", + "resolved": "https://registry.npmjs.org/dotenv/-/dotenv-16.6.1.tgz", + "integrity": "sha512-uBq4egWHTcTt33a72vpSG0z3HnPuIl6NqYcTrKEg2azoEyl2hpW0zqlxysq2pK9HlDIHyHyakeYaYnSAwd8bow==", + "dev": true, + "license": "BSD-2-Clause", + "engines": { + "node": ">=12" + }, + "funding": { + "url": "https://dotenvx.com" + } + }, "node_modules/dunder-proto": { "version": "1.0.1", "resolved": "https://registry.npmjs.org/dunder-proto/-/dunder-proto-1.0.1.tgz", @@ -4183,6 +4198,16 @@ "url": "https://github.com/sponsors/sindresorhus" } }, + "node_modules/undici": { + "version": "6.26.0", + "resolved": "https://registry.npmjs.org/undici/-/undici-6.26.0.tgz", + "integrity": "sha512-4yqz8a3n5HmGTlsbADNtr/dJlhkh/55Rq798G6ibiULcXbDtaLpTl1pvdqcbFfeoj3iSi52lePFM7h9H21cw/A==", + "dev": true, + "license": "MIT", + "engines": { + "node": ">=18.17" + } + }, "node_modules/undici-types": { "version": "7.16.0", "resolved": "https://registry.npmjs.org/undici-types/-/undici-types-7.16.0.tgz", diff --git a/tests/package.json b/tests/package.json index dc3654c..390fdd3 100644 --- a/tests/package.json +++ b/tests/package.json @@ -2,14 +2,18 @@ "name": "zeeschuimer-db-tests", "version": "1.0.0", "description": "Unit tests for Zeeschuimer duplicate handling logic", + "type": "module", "scripts": { - "test": "jest", - "test:watch": "jest --watch" + "test": "node --experimental-vm-modules node_modules/jest/bin/jest.js", + "test:watch": "node --experimental-vm-modules node_modules/jest/bin/jest.js --watch", + "probe": "node probe-4cat.mjs" }, "devDependencies": { "dexie": "^3.2.4", + "dotenv": "^16.4.5", "fake-indexeddb": "^5.0.1", "jest": "^29.7.0", - "jest-environment-jsdom": "^29.7.0" + "jest-environment-jsdom": "^29.7.0", + "undici": "^6.20.0" } } diff --git a/tests/probe-4cat.mjs b/tests/probe-4cat.mjs new file mode 100644 index 0000000..0bf4e4d --- /dev/null +++ b/tests/probe-4cat.mjs @@ -0,0 +1,140 @@ +/** + * Manually exercise 4CAT's /api/map-item/ endpoint against a fixture item. + * + * Usage: + * node probe-4cat.mjs [] [--index N] + * + * is the Zeeschuimer module filename without `.js` (e.g. + * "tiktok", "pinterest"). If is omitted, the first + * .ndjson in tests/fixtures// is used. --index selects which + * line of the fixture to send (default 0). + * + * Requires tests/.env with FOURCAT_URL and FOURCAT_API_KEY. + */ + +import 'dotenv/config'; +import { readFileSync, existsSync, readdirSync } from 'node:fs'; +import { join, dirname } from 'node:path'; +import { fileURLToPath } from 'node:url'; + +const __dirname = dirname(fileURLToPath(import.meta.url)); + +const FOURCAT_URL = process.env.FOURCAT_URL?.replace(/\/$/, ''); +const FOURCAT_API_KEY = process.env.FOURCAT_API_KEY; + +if (!FOURCAT_URL || !FOURCAT_API_KEY || FOURCAT_API_KEY === 'your-api-key-here') { + console.error('error: FOURCAT_URL and FOURCAT_API_KEY must be set in tests/.env'); + console.error(' (copy tests/.env.example to tests/.env and fill in real values)'); + process.exit(1); +} + +const ID_MAP_PATH = join(__dirname, 'zeeschuimer-to-4cat.json'); +const ID_MAP = existsSync(ID_MAP_PATH) + ? JSON.parse(readFileSync(ID_MAP_PATH, 'utf8')) + : {}; + +function auth_headers() { + return { 'Authorization': `${FOURCAT_API_KEY}` }; +} + +async function list_datasources() { + const res = await fetch(`${FOURCAT_URL}/api/datasources/`, { headers: auth_headers() }); + if (!res.ok) { + throw new Error(`GET /api/datasources/ → ${res.status}: ${await res.text()}`); + } + const body = await res.json(); + return body.datasources ?? []; +} + +async function map_item(datasource_id, item) { + const res = await fetch(`${FOURCAT_URL}/api/map-item/${datasource_id}/`, { + method: 'POST', + headers: { ...auth_headers(), 'Content-Type': 'application/json' }, + body: JSON.stringify({ item }), + }); + const text = await res.text(); + let body; + try { body = JSON.parse(text); } catch { body = { raw: text }; } + return { status_code: res.status, body }; +} + +function parse_args(argv) { + const args = { module: null, fixture: null, index: 0 }; + const positional = []; + for (let i = 2; i < argv.length; i++) { + if (argv[i] === '--index') { + args.index = parseInt(argv[++i], 10); + } else if (argv[i].startsWith('--index=')) { + args.index = parseInt(argv[i].split('=')[1], 10); + } else { + positional.push(argv[i]); + } + } + args.module = positional[0]; + args.fixture = positional[1]; + return args; +} + +async function main() { + const args = parse_args(process.argv); + if (!args.module) { + console.error('Usage: node probe-4cat.mjs [] [--index N]'); + process.exit(1); + } + + const datasource_id = ID_MAP[args.module] ?? args.module; + const fixture_dir = join(__dirname, 'fixtures', args.module); + + if (!existsSync(fixture_dir)) { + console.error(`error: no fixture dir at ${fixture_dir}`); + process.exit(1); + } + + const candidates = readdirSync(fixture_dir).filter(f => f.endsWith('.ndjson')); + if (candidates.length === 0) { + console.error(`error: no .ndjson fixtures under ${fixture_dir}`); + process.exit(1); + } + const fixture_name = args.fixture ?? candidates[0]; + const fixture_path = join(fixture_dir, fixture_name); + if (!existsSync(fixture_path)) { + console.error(`error: fixture ${fixture_path} not found`); + process.exit(1); + } + + const lines = readFileSync(fixture_path, 'utf8').split('\n').filter(l => l.trim().length > 0); + if (args.index >= lines.length) { + console.error(`error: --index ${args.index} but fixture has ${lines.length} items`); + process.exit(1); + } + const item = JSON.parse(lines[args.index]); + + console.log(`Module: ${args.module}`); + console.log(`Datasource id: ${datasource_id}${ID_MAP[args.module] ? ' (mapped via zeeschuimer-to-4cat.json)' : ''}`); + console.log(`URL: ${FOURCAT_URL}/api/map-item/${datasource_id}/`); + console.log(`Fixture: ${fixture_name}, item ${args.index} (item_id=${item.item_id ?? item.id})`); + console.log(''); + + const { status_code, body } = await map_item(datasource_id, item); + console.log(`HTTP ${status_code}`); + console.log(JSON.stringify(body, null, 2)); + + if (status_code === 404) { + console.error(''); + console.error('Hint: datasource id may be wrong. Available Zeeschuimer-origin datasources:'); + try { + const datasources = await list_datasources(); + datasources + .filter(d => d.is_from_zeeschuimer && d.has_map_item) + .forEach(d => console.error(` - ${d.id} (${d.name})`)); + } catch (e) { + console.error(` (couldn't fetch list: ${e.message})`); + } + process.exit(2); + } +} + +main().catch(e => { + console.error(`probe failed: ${e.message}`); + process.exit(2); +}); diff --git a/tests/setup-globals.cjs b/tests/setup-globals.cjs new file mode 100644 index 0000000..6793cc0 --- /dev/null +++ b/tests/setup-globals.cjs @@ -0,0 +1,53 @@ +/** + * Make js/lib.js's helpers available as globals inside the Jest test + * environment, mirroring how the browser sees them after the manifest + * loads lib.js as a plain script. + * + * map_item bodies reference these as free identifiers (MappedItem, + * MissingMappedField, strip_tags, normalize_url_encoding, ...). Without this + * shim they'd hit ReferenceError as soon as a test invokes map_item. + * + * Approach: read lib.js, wrap it in a new Function() body that returns the + * named helpers, call the function, and assign the returned object onto + * globalThis. (Earlier attempt with vm.runInThisContext failed because in + * the jsdom env the vm context's global differs from jsdom's window.) + * + * If a new helper is added to lib.js, append its name to EXPOSED_NAMES. + */ + +const fs = require('node:fs'); +const path = require('node:path'); + +const EXPOSED_NAMES = [ + 'traverse_data', + 'MappedItem', + 'MissingMappedField', + 'MapItemException', + 'wrap_for_map_item', + 'strip_tags', + 'normalize_url_encoding', + 'formatUtcTimestamp', +]; + +const lib_source = fs.readFileSync( + path.join(__dirname, '..', 'js', 'lib.js'), + 'utf8', +); + +const factory = new Function(` +${lib_source} +return { ${EXPOSED_NAMES.join(', ')} }; +`); + +Object.assign(globalThis, factory()); + +// jsdom doesn't expose fetch and Jest's jsdom env shadows Node's global +// fetch, so the comparator can't hit 4CAT without help. Polyfill from +// undici (a Node-friendly HTTP client, separately installable on npm — +// distinct from the undici bundled internally by Node, which isn't +// require()-able by name). +// Note: tests that use fetch (e.g. map_item_compare.test.js) declare +// `@jest-environment node` at the top of the file. Node env has fetch +// natively. Don't try to polyfill into jsdom — undici's internals use +// Node-specific globals that jsdom shadows (clearImmediate, +// markResourceTiming, fast timers), and polyfilling them all is brittle. diff --git a/tests/translation-errors.md b/tests/translation-errors.md new file mode 100644 index 0000000..fcc160d --- /dev/null +++ b/tests/translation-errors.md @@ -0,0 +1,430 @@ +# Auto-generator translation errors + +Patterns of incorrect Python → JavaScript translation observed in +auto-generated `modules/*.js` files. Each entry has a search pattern so +this doc doubles as a checklist when reviewing a new auto-generator PR. + +When an entry is fixed at the generator level (no longer appears in +fresh output), mark it `[fixed]` and keep the entry around — useful +history when something regresses. + +## How to use + +- Found a new pattern? Add an entry below following the template. +- Reviewing a generator PR? `grep` each `Search pattern` against the + changed module files. Anything that hits is worth a manual look. +- Iterating on the generator prompt? The "Why" lines are the + feedback to add — they describe the exact Python-vs-JS semantic + difference the LLM keeps missing. + +## Template + +``` +### + +**Status:** open | fixed in generator | accepted + +**Why it happens:** + +**Wrong JS:** +```js + +``` + +**Correct JS:** +```js + +``` + +**Example:** `modules/.js:` + +**Search pattern:** `` +``` + +--- + +## Observed patterns + +### `in` operator on strings + +**Status:** open + +**Why it happens:** In Python, `"x" in some_string` is a substring check. +In JavaScript, the `in` operator only works on **objects** and checks for +property/key existence; using it with a string on the right-hand side +throws `TypeError: cannot use 'in' operator to search for "x" in `. + +**Wrong JS:** +```js +const is_polaris = '__typename' in item && 'polaris' in item.__typename.toLowerCase(); +``` + +**Correct JS:** +```js +const is_polaris = '__typename' in item && item.__typename.toLowerCase().includes('polaris'); +``` + +**Example:** `modules/instagram.js:513` + +**Search pattern:** `'[^']+' in [a-zA-Z_$][\w$]*\.` — quoted string followed +by `in` followed by a method call. Quick rough check: `grep -E "' in [a-zA-Z]" modules/` + +**Watch out for partial fixes:** seen as `'polaris' in (item.__typename ?? '').toLowerCase()` +— adding `?? ''` guards against `undefined` but the `in` operator itself +still throws on the resulting *string*. The fix is `.includes()`, not just +defaulting the operand. + +--- + +### Python f-string syntax left in single-quoted JS strings + +**Status:** open + +**Why it happens:** Python `f"... {var} ..."` interpolates. JS uses +template literals (backticks) with `${var}`. The auto-generator leaves the +`{var}` notation in a regular single- or double-quoted JS string, which is +just literal text — no interpolation happens. + +**Wrong JS:** +```js +throw new MapItemException('Unable to parse item: different user {user.id} and owner {owner.id}'); +``` + +**Correct JS:** +```js +throw new MapItemException(`Unable to parse item: different user ${user.id} and owner ${owner.id}`); +``` + +**Example:** `modules/instagram.js:754` + +**Search pattern:** `'[^']*\{[a-zA-Z_$][\w$.]*\}[^']*'` or `"[^"]*\{[a-zA-Z_$][\w$.]*\}[^"]*"` +— a non-template-literal string containing `{identifier}` or `{identifier.path}`. +Quick check: `grep -nE "['\"][^'\"]*\{[a-zA-Z_][a-zA-Z0-9_.]*\}[^'\"]*['\"]" modules/` + +--- + +### `?? {}` default that defeats subsequent truthy checks + +**Status:** open + +**Why it happens:** When porting Python's `node.get('user') or {}` (which is +intended to make subsequent code safe to call), the generator emits +`node.user ?? {}`. That's a *valid* Python-equivalent, **but** any following +`if (user && owner) { ... }` guard then never short-circuits because both +`{}` references are truthy. The check ends up reading "if user and owner +*objects* exist" when the intent was "if user and owner data exist." +Subsequent property accesses then compare real ids/usernames against +`undefined` on the missing side, often throwing. + +**Wrong JS:** +```js +const user = node.user ?? {}; +const owner = node.owner ?? {}; +if (user && owner) { + if (user.id === owner.id) { /* … */ } + else if (user.username !== owner.username) { + throw new MapItemException('different user and owner'); + } +} +``` + +**Correct JS** (depending on intent — pick one): +```js +// (a) drop the defaults so truthy guard means "both present" +const user = node.user; +const owner = node.owner; +if (user && owner) { /* compare */ } +``` +```js +// (b) check for actual content, not just object identity +const user = node.user ?? {}; +const owner = node.owner ?? {}; +if (Object.keys(user).length && Object.keys(owner).length) { /* compare */ } +``` + +**Example:** `modules/instagram.js:748-756` + +**Search pattern:** `\?\?\s*\{\s*\}` — any `?? {}` occurrence is worth a +review of subsequent guards. Quick check: `grep -nE "\?\?\s*\{\s*\}" modules/` + +--- + +### Bare relative path as a statement (junk auto-imports section) + +**Status:** open + +**Why it happens:** The generator emits an "auto-generated imports" marker +block at the top of the module but writes the import target as a bare +relative path on its own line (`../js/lib.js`) instead of a real `import` +statement. JS parses that as `..` then `.` then `/js/lib.js` — syntax error. + +**Wrong JS:** +```js +// === auto-generated imports for map_item — DO NOT EDIT BY HAND === +../js/lib.js +// === end auto-generated imports === +``` + +**Correct JS** (one of): +```js +// === auto-generated imports — DO NOT EDIT BY HAND === +// Provided as globals by js/lib.js (loaded via manifest.json): +// MappedItem, MissingMappedField, MapItemException, traverse_data, +// strip_tags, normalize_url_encoding, formatUtcTimestamp +// === end auto-generated imports === +``` + +Or, if a real import is intended, an ESM import with named bindings: +```js +import { MappedItem, MissingMappedField } from '../js/lib.js'; +``` + +**Example:** seen historically in `modules/tiktok.js:2` + +**Search pattern:** `^\.\./` at the start of a line in module files. +Quick check: `grep -nE "^\.\." modules/*.js` + +--- + +### Key-existence check (`'X' in obj`) used where Python intended value-truthiness (`obj.get('X')`) + +**Status:** open + +**Why it happens:** Python's `if node.get('usertags'):` is a *truthy check on +the value* — returns False if the key is missing **or** if the value is +`None`/empty/falsy. The generator translates this to `if ('usertags' in +node)`, which in JS is a *key-existence check* — returns True even when +the value is `null`. Subsequent property accesses on the null value then +throw `Cannot read properties of null`. + +**Wrong JS:** +```js +const usertags = 'usertags' in node ? node.usertags.in.map(...).join(',') : ''; +// node.usertags can be null → .in.map blows up +``` + +**Correct JS:** +```js +const usertags = node.usertags ? node.usertags.in.map(...).join(',') : ''; +``` + +**Example:** `modules/instagram.js:777` + +**Search pattern:** `'[^']+' in [a-zA-Z_$][\w$]*\s*\?` — quoted-string `in` +identifier followed by `?` (ternary). Quick check: +`grep -nE "'[^']+' in [a-zA-Z_]+ \?" modules/` + +--- + +### Datetime serialization format mismatch + +**Status:** open + +**Why it happens:** Python's `datetime.utcfromtimestamp(t).strftime('%Y-%m-%d %H:%M:%S')` +produces `"2026-05-13 21:27:31"` — space-separated, no timezone marker. JS's +`new Date(t * 1000).toISOString()` produces `"2026-05-13T21:27:31.000Z"` — T +separator, milliseconds, Z. The generator emits the JS `.toISOString()` form +instead of using the existing `formatUtcTimestamp` helper from lib.js that +mimics Python's output exactly. + +**Wrong JS:** +```js +collected_at = new Date(node.taken_at * 1000).toISOString(); +``` + +**Correct JS:** +```js +collected_at = formatUtcTimestamp(node.taken_at); +// formatUtcTimestamp is defined in js/lib.js as: +// new Date(unixSeconds * 1000).toISOString().replace('T', ' ').slice(0, 19) +``` + +**Example:** `modules/instagram.js:782` + +**Search pattern:** `new Date\([^)]+\)\.toISOString\(\)` — any use of +`.toISOString()`. The helper should be used instead. Quick check: +`grep -nE "\.toISOString\(\)" modules/` + +--- + +### `re.findall` capture groups vs JS `.match` with /g flag + +**Status:** open + +**Why it happens:** Python's `re.findall(r'#(\w+)', s)` returns the **capture +group contents**: `['lotr', 'woodart']`. JS's `s.match(/#(\w+)/g)` (with the +global flag) returns the **full matches**: `['#lotr', '#woodart']` — capture +groups are ignored. The generator translates the regex literally without +adjusting for this semantic difference, so the resulting strings keep +prefixes/wrappers that Python would have stripped. + +**Wrong JS:** +```js +hashtags: caption.match(/#([^\s!@#$%^&*()_+{}:"|<>?;',./`~]+)/g)?.join(',') +// produces "#lotr,#woodart" +``` + +**Correct JS:** +```js +// Option A: strip the literal prefix from each full match +hashtags: caption.match(/#([^\s...]+)/g)?.map(h => h.slice(1)).join(',') ?? '' +// Option B: use matchAll to get capture groups properly +hashtags: [...caption.matchAll(/#([^\s...]+)/g)].map(m => m[1]).join(',') ?? '' +``` + +**Example:** `modules/instagram.js:812` (also 766, 870 — three copies) + +**Search pattern:** `\.match\(/[^/]*\([^/]*\)[^/]*/g\)` — any `.match()` with +a global-flag regex containing a capture group. Quick check: +`grep -nE "\.match\(/.*\(.*\).*\/g\)" modules/` + +--- + +### `undefined` field values get dropped from JSON, but Python's `None` becomes `null` + +**Status:** open + +**Why it happens:** When `JSON.stringify` encounters an object property whose +value is `undefined`, it **omits the key entirely** from the output. Python's +`json.dumps` serializes `None` as `null`, keeping the key. The generator +writes assignments like `location.city = node.location.city` where the +right-hand side can be `undefined`, producing missing keys in JS output +that show up as `only in Python: = null` diffs against 4CAT. + +**Wrong JS:** +```js +location.city = node.location.city; // undefined if .city missing +// JSON.stringify({location_city: undefined}) → "{}" (key omitted) + +body: caption, // null if no caption — Python returns "" here, not null +``` + +**Correct JS:** +```js +// Whichever fallback Python uses for that specific field: +location.city = node.location.city ?? null; // some fields → null +body: caption ?? '', // other fields → "" +``` + +**Example:** `modules/instagram.js:745, 853` (`null` flavor), +559, 648, 798 (`""` flavor for `body`) + +**Note:** Python's choice of `None` vs `""` is per-field — there's no +universal rule. When the comparator reports `~ X JS: null Python: ""` use +`?? ''`. When it reports `- only in Python: X = null` use `?? null`. The +distinction matters because the JS output should match Python's choice +exactly for that field. + +**Search pattern:** harder to grep automatically — any property assignment +where the RHS could be `undefined`/`null` and the resulting field is +expected to appear in the mapped output. Look at "only in Python: X = null" +and "~ X JS: null Python: \"\"" diffs in the comparator output to find +specific cases. + +--- + +### Object-reference inequality used as type check + +**Status:** open + +**Why it happens:** The generator emits `caption !== new MissingMappedField('')` +to mean "caption is not a missing-marker", but `new MissingMappedField('')` +creates a fresh object every time, and `!==` on objects compares references. +The expression is **always true**, so the conditional never takes the +"missing" branch. Likely originates from Python idioms like `caption != ""` +or `caption is not None`, mistranslated through the MissingMappedField +abstraction. + +**Wrong JS:** +```js +hashtags: caption !== new MissingMappedField('') ? caption.match(...) : '', +// !== between two different object references is always true +``` + +**Correct JS:** +```js +// If the intent was "if caption has content", just truthy-check it: +hashtags: caption ? caption.match(...) : '', +// If the intent was "if caption is not a MissingMappedField instance": +hashtags: !(caption instanceof MissingMappedField) ? caption.match(...) : '', +``` + +**Example:** `modules/instagram.js:812` (and two other copies) + +**Search pattern:** `!== new [A-Z]` or `=== new [A-Z]` — any equality +comparison with a freshly-constructed object. Quick check: +`grep -nE "(!==|===) new [A-Z]" modules/` + +--- + +### `.method()` chain on potentially-null result + +**Status:** open + +**Why it happens:** In Python, calling a method on `None` raises +`AttributeError`, which 4CAT sometimes catches. In JS, calling a method on +`null`/`undefined` throws `TypeError: Cannot read properties of null +(reading '')`. The generator emits the same dotted chain without +optional-chaining (`?.`) protection. + +**Wrong JS:** +```js +hashtags: caption !== new MissingMappedField('') + ? caption.match(/#([^\s!@#$%^&*()_+{}:"|<>?;',./`~]+)/g)?.join(',') + : '', +``` +(here `caption` is allowed to be `null`, so `caption.match(...)` blows up +on null caption) + +**Correct JS:** +```js +hashtags: caption + ? caption.match(/#([^\s!@#$%^&*()_+{}:"|<>?;',./`~]+)/g)?.join(',') ?? '' + : '', +``` + +**Example:** `modules/instagram.js:809` + +**Search pattern:** harder to grep — needs reading. Worth manual review of +any field that uses `caption.match`, `something.split`, `something.join` +without `?.` on a value that could be null/undefined. + +--- + +## Generator prompt feedback (running list) + +Concrete things to fold into the generator's prompt over time: + +1. **Python `x in y` where `y` is a string** → use `y.includes(x)` in JS, + never `x in y`. +2. **Python f-strings** → use JS template literals (backticks) with + `${...}` syntax. Never leave `{...}` in single- or double-quoted strings. +3. **`?? {}` after a `.get(...) or {}` translation** → only use this if the + following code does property-access. If the following code does a + truthy guard (`if (x && y)`), drop the default and use just `node.user`. +4. **Method chains on possibly-null values** → use `?.` (optional + chaining) instead of `.` whenever the receiver could be null/undefined. +5. **The auto-imports header block** → emit either real `import { ... }` + statements with valid relative paths, or a comment-only header. + Never emit bare paths as JS statements. +6. **Python `node.get('X')` truthy check** → in JS, use `node.X` (or + `node.X != null`), not `'X' in node`. The `in` operator checks key + existence, which is True even for explicit-null values. +7. **Datetime serialization** → use the `formatUtcTimestamp` helper from + lib.js (which mimics Python's `strftime('%Y-%m-%d %H:%M:%S')` format), + not `new Date(...).toISOString()` (which has a different output shape: + T separator, milliseconds, Z suffix). +8. **`re.findall` with capture groups** → in JS, `.match(/.../g)` returns + full matches, NOT capture groups. To get capture-group behavior, use + either `[...s.matchAll(/.../g)].map(m => m[1])` or post-process the + full matches with `.map(...)` to strip the literal parts. +9. **Object-reference equality (`!== new X(...)`)** → never. Creating an + object with `new` produces a fresh reference; `===`/`!==` compares + identity. Use `instanceof X` for type checks, or compare values + directly. The MissingMappedField "is this missing?" check should be + `caption instanceof MissingMappedField` or just truthy-check the value. +10. **Python `None` → JSON `null` vs JS `undefined` → omitted** — when a + field's value could be missing and Python returns `null` for it, + JS must explicitly assign `null` (not leave the value as `undefined`). + `JSON.stringify` drops `undefined` keys silently. Use `value ?? null` + when the field is expected to appear in the mapped output. diff --git a/tests/zeeschuimer-to-4cat.json b/tests/zeeschuimer-to-4cat.json new file mode 100644 index 0000000..f7de942 --- /dev/null +++ b/tests/zeeschuimer-to-4cat.json @@ -0,0 +1,7 @@ +{ + "_comment": "Maps Zeeschuimer module filenames (without .js) to 4CAT datasource ids when they differ. Default behavior is identity — only include entries where the two diverge. Discovered via http://localhost/api/datasources/.", + "9gag": "ninegag", + "truth": "truthsocial", + "rednote": "xiaohongshu", + "rednote-comments": "xiaohongshu-comments" +}