diff --git a/.agents/skills/building-dashboards/.meta/.gitkeep b/.agents/skills/building-dashboards/.meta/.gitkeep new file mode 100644 index 00000000..e69de29b diff --git a/.agents/skills/building-dashboards/README.md b/.agents/skills/building-dashboards/README.md new file mode 100644 index 00000000..5cfd110e --- /dev/null +++ b/.agents/skills/building-dashboards/README.md @@ -0,0 +1,91 @@ +# building-dashboards + +Designs and builds Axiom dashboards via API. Covers chart types, APL patterns, SmartFilters, layout, and configuration options. + +## What It Does + +- **Dashboard Design** - Blueprint structure: at-a-glance stats, trends, breakdowns, evidence +- **Chart Types** - Statistic, TimeSeries, Table, Pie, LogStream, Heatmap, SmartFilter, Note +- **APL + Metrics/MPL Patterns** - Golden signals, percentiles, error rates, and metrics chart queries via `query.apl` +- **Layout Composition** - Grid-based layouts with section templates +- **Deployment** - Scripts to validate, create, update, and manage dashboards + +## Installation + +```bash +npx skills add axiomhq/skills +``` + +## Prerequisites + +- `axiom-sre` skill (for API access and schema discovery) +- `query-metrics` skill (for metrics dataset/metric/tag discovery; also vendored locally in `scripts/metrics/`) +- Tools: `jq`, `curl` + +The install command above includes all skill dependencies. + +## Configuration + +Create `~/.axiom.toml` with your Axiom deployment(s): + +```toml +[deployments.prod] +url = "https://api.axiom.co" +token = "xaat-your-api-token" +org_id = "your-org-id" +``` + +- **`org_id`** - The organization ID. Get it from Settings → Organization. +- **`token`** - Use an advanced API token with minimal privileges. + +**Tip:** Run `scripts/setup` from the `axiom-sre` skill for interactive configuration. + +## Usage + +```bash +# Setup and check requirements +scripts/setup + +# Create dashboard from template +scripts/dashboard-from-template service-overview "my-service" "my-dataset" ./dashboard.json + +# Validate dashboard JSON +scripts/dashboard-validate ./dashboard.json + +# Deploy dashboard +scripts/dashboard-create ./dashboard.json + +# List, update, delete +scripts/dashboard-list +scripts/dashboard-update +scripts/dashboard-chart-patch --version +scripts/dashboard-delete +``` + +## Scripts + +| Script | Purpose | +|--------|---------| +| `dashboard-create` | Deploy new dashboard | +| `dashboard-validate` | Validate JSON structure | +| `dashboard-list` | List all dashboards | +| `dashboard-get` | Fetch dashboard JSON | +| `dashboard-update` | Update existing dashboard | +| `dashboard-chart-patch` | Patch one chart in an existing dashboard | +| `dashboard-copy` | Clone a dashboard | +| `dashboard-delete` | Delete with confirmation | +| `dashboard-from-template` | Generate from template | + +## Templates + +Pre-built templates in `reference/templates/`: +- `service-overview.json` - Single service oncall dashboard +- `service-overview-with-filters.json` - With SmartFilter dropdowns +- `api-health.json` - HTTP API health dashboard +- `blank.json` - Minimal skeleton + +## Related Skills + +- `axiom-sre` - Schema discovery and query exploration +- `query-metrics` - Discover metric names, tags, and tag values for MPL queries +- `spl-to-apl` - Translate Splunk dashboards to Axiom diff --git a/.agents/skills/building-dashboards/SKILL.md b/.agents/skills/building-dashboards/SKILL.md new file mode 100644 index 00000000..83dbb32a --- /dev/null +++ b/.agents/skills/building-dashboards/SKILL.md @@ -0,0 +1,655 @@ +--- +name: building-dashboards +description: Designs and builds Axiom dashboards via API. Covers chart types, APL and metrics/MPL query patterns, SmartFilters, layout, and configuration options. Use when creating dashboards, migrating from Splunk, or configuring chart options. +--- + +# Building Dashboards + +You design dashboards that help humans make decisions quickly. Dashboards are products: audience, questions, and actions matter more than chart count. + +## Philosophy + +1. **Decisions first.** Every panel answers a question that leads to an action. +2. **Overview → drilldown → evidence.** Start broad, narrow on click/filter, end with raw logs. +3. **Rates and percentiles over averages.** Averages hide problems; p95/p99 expose them. +4. **Simple beats dense.** One question per panel. No chart junk. +5. **Validate with data.** Never guess fields—discover schema first. + +--- + +## Entry Points + +Choose your starting point: + +| Starting from | Workflow | +|---------------|----------| +| **Vague description** | Intake → check dataset kind → design blueprint (APL or MPL) → queries per panel → deploy | +| **Template** | Pick template → customize dataset/service/env → deploy | +| **Splunk dashboard** | Extract SPL → translate via spl-to-apl → map to chart types → deploy | +| **Exploration** | Use axiom-sre to discover schema/signals → productize into panels | + +--- + +## Intake: What to Ask First + +Before designing, clarify: + +1. **Audience & decision** + - Oncall triage? (fast refresh, error-focused) + - Team health? (daily trends, SLO tracking) + - Exec reporting? (weekly summaries, high-level) + +2. **Scope** + - Service, environment, region, cluster, endpoint? + - Single service or cross-service view? + +3. **Dataset kind (mandatory first step)** + - Run `scripts/metrics/datasets ` to identify each dataset's `kind` + - **If `kind` is `otel:metrics:v1`** → this is a metrics dataset. Follow the **Metrics path** below. + - **Otherwise** → this is an events/logs dataset. Follow the **APL path** below. + + > **⚠️ NEVER run `getschema` on a metrics dataset.** APL queries against `otel:metrics:v1` datasets return 0 rows without error — you will waste calls widening time ranges before realizing it's the wrong discovery method. + + **APL path** (events/logs datasets): + - Discover fields with `getschema`: + ```apl + ['dataset'] | where _time between (ago(1h) .. now()) | getschema + ``` + - Continue to steps 4–5 below. + + **Metrics path** (`otel:metrics:v1` datasets): + - Run `scripts/metrics/metrics-spec ` — **mandatory before composing any MPL query** + - Discover available metrics: `scripts/metrics/metrics-info metrics` + - Discover tags: `scripts/metrics/metrics-info tags` + - Explore tag values: `scripts/metrics/metrics-info tags values` + - If discovery returns empty results, retry with `--start` set to 7 days ago — sparse metrics (sensors, batch jobs, crons) may not have data in the default 24h window + - `find-metrics ` searches **tag values**, not metric names — use it only when you know a specific entity name (service, host, device) to find which metrics are associated with it + - Skip to the **Metrics/MPL Blueprint** below for panel design. + +4. **Golden signals** (APL path) + - Traffic: requests/sec, events/min + - Errors: error rate, 5xx count + - Latency: p50, p95, p99 duration + - Saturation: CPU, memory, queue depth, connections + +5. **Drilldown dimensions** (APL path) + - What do users filter/group by? (service, route, status, pod, customer_id) + +--- + +## Dashboard Blueprint + +Choose the blueprint that matches your dataset kind (identified in Intake step 3). + +### APL Blueprint (events/logs datasets) + +#### 1. At-a-Glance (Statistic panels) +Single numbers that answer "is it broken right now?" +- Error rate (last 5m) +- p95 latency (last 5m) +- Request rate (last 5m) +- Active alerts (if applicable) + +#### 2. Trends (TimeSeries panels) +Time-based patterns that answer "what changed?" +- Traffic over time +- Error rate over time +- Latency percentiles over time +- Stacked by status/service for comparison + +#### 3. Breakdowns (Table/Pie panels) +Top-N analysis that answers "where should I look?" +- Top 10 failing routes +- Top 10 error messages +- Worst pods by error rate +- Request distribution by status + +#### 4. Evidence (LogStream + SmartFilter) +Raw events that answer "what exactly happened?" +- LogStream filtered to errors +- SmartFilter for service/env/route +- Key fields projected for readability + +### Metrics/MPL Blueprint (metrics datasets) + +> **Prerequisite:** You MUST have run `scripts/metrics/metrics-spec` and `scripts/metrics/metrics-info` before designing panels. Never guess MPL syntax or metric/tag names. + +> **🚨 ALIGNMENT RULE — non-negotiable for dashboard panels:** Always align to the dashboard-supplied variable `$__interval`, not a fixed window. The dashboard runtime substitutes `$__interval` based on the active time range and panel width, so the same chart stays usable from a 5-minute to a 30-day view. Hard-coding `align to 1m` (or any constant) over-resolves long ranges and under-resolves short ones. +> +> ```mpl +> | align to $__interval using avg ✅ dashboard panels +> | align to 1m using avg ❌ fixed window — wrong granularity at most time ranges +> ``` +> +> **No `param` declaration needed in the chart `query.apl`** — the dashboard runtime injects `param $__interval: Duration;` automatically. (The Grafana datasource does the same via a preamble; the Axiom-native dashboard runtime behaves identically — verified against working production dashboards.) +> +> **Exceptions:** If you are pre-validating a query through `scripts/metrics/metrics-query` (which has no dashboard runtime), substitute a concrete duration for the test call only — do NOT commit that to the chart JSON. For genuinely sparse metrics where `$__interval` would round to an empty bucket (sensors, batch jobs, crons), a fixed wider window (e.g. `1h`) is acceptable; document why in the chart description. + +#### 1. At-a-Glance (Statistic panels) +Current values for key metrics — answer "what's the state right now?" +- Latest value of primary metrics (e.g., current temperature, power draw) +- Use `group using avg` or `group using last` depending on metric type (gauge vs counter) + +#### 2. Trends (TimeSeries panels) +Metric trends over time — answer "what changed?" +- Primary metrics over time, grouped by key dimension +- Use `align to $__interval using avg|sum|last` for proper time bucketing — `$__interval` is supplied by the dashboard runtime +- Group by low-cardinality tags only (≤10 series per chart) + +#### 3. Breakdowns (TimeSeries or Table panels) +Per-entity detail — answer "where should I look?" +- Metrics broken down by entity (room, host, pod, service) +- Filter by tag values to keep series count manageable +- Use separate panels per dimension rather than one overloaded chart + +#### 4. Entity State (TimeSeries or Table panels) +Boolean/state metrics — answer "what is on/off/active?" +- Use `align to $__interval using last` for state metrics +- Sparse metrics may need wider **fixed** align intervals (1h+) to show data — this is the documented exception to the `$__interval` rule + +--- + +## Layout Auto-Normalization + +The console uses `react-grid-layout` which requires `minH`, `minW`, `moved`, and `static` on every layout entry. The `dashboard-create` and `dashboard-update` scripts auto-fill these if omitted, so layout entries only need `i`, `x`, `y`, `w`, `h`. + +--- + +## Required Chart Structure + +**Every chart MUST have a unique `id` field.** Every layout entry's `i` field MUST reference a chart `id`. Missing or mismatched IDs will corrupt the dashboard in the UI (blank state, unable to save/revert). + +```json +{ + "charts": [ + { + "id": "error-rate", + "name": "Error Rate", + "type": "Statistic", + "query": { "apl": "..." } + } + ], + "layout": [ + {"i": "error-rate", "x": 0, "y": 0, "w": 3, "h": 2} + ] +} +``` + +Use descriptive kebab-case IDs (e.g. `error-rate`, `p95-latency`, `traffic-rps`). The `dashboard-validate` and deploy scripts enforce this automatically. + +--- + +## Metrics/MPL Chart Contract + +Metrics-backed charts require both `query.apl` (the MPL pipeline string) and `query.metricsDataset` (the dataset name). The `metricsDataset` field is what tells the backend to interpret `apl` as MPL rather than APL — omitting it causes the chart to misbehave even if the pipeline string is well-formed. + +> **CRITICAL:** Run `scripts/metrics/metrics-spec ` before composing your first MPL query in a session. NEVER guess MPL syntax. +> +> **API gotcha:** Set `query.metricsDataset` to the dataset name (e.g. `"otel-metrics"`). The create API rejects `query.mpl` even though GET responses for existing metrics dashboards may include it — put the MPL string in `query.apl` instead. + +```json +{ + "type": "TimeSeries", + "query": { + "apl": "`otel-metrics`:`http.server.duration`\n| where `service.name` == \"api\"\n| align to $__interval using avg\n| group by `service.name` using avg", + "metricsDataset": "otel-metrics" + } +} +``` + +Validate queries with `scripts/metrics/metrics-query` before embedding in dashboard JSON. + +See `reference/metrics-mpl.md` for the full contract and discovery scripts. + +--- + +## Chart Types + +**Note:** Dashboard queries inherit time from the UI picker—no explicit `_time` filter needed. + +**Validation:** TimeSeries, Statistic, Table, Pie, LogStream, Note, MonitorList are fully validated by `dashboard-validate`. Heatmap, ScatterPlot, SmartFilter work but may trigger warnings. + +### Statistic +**When:** Single KPI, current value, threshold comparison. + +```apl +['logs'] +| where service == "api" +| summarize + total = count(), + errors = countif(status >= 500) +| extend error_rate = round(100.0 * errors / total, 2) +| project error_rate +``` + +**Pitfalls:** Don't use for time series; ensure query returns single row. + +### TimeSeries +**When:** Trends over time, before/after comparison, rate changes. + +```apl +// Single metric - use bin_auto for automatic sizing +['logs'] +| summarize ['req/min'] = count() by bin_auto(_time) + +// Latency percentiles - use percentiles_array for proper overlay +['logs'] +| summarize percentiles_array(duration_ms, 50, 95, 99) by bin_auto(_time) +``` + +**Best practices:** +- Use `bin_auto(_time)` instead of fixed `bin(_time, 1m)` — auto-adjusts to time window +- Use `percentiles_array()` instead of multiple `percentile()` calls — renders as one chart +- Too many series = unreadable; use `top N` or filter + +### Table +**When:** Top-N lists, detailed breakdowns, exportable data. + +```apl +['logs'] +| where status >= 500 +| summarize errors = count() by route, error_message +| top 10 by errors +| project route, error_message, errors +``` + +**Pitfalls:** +- Always use `top N` to prevent unbounded results +- Use `project` to control column order and names + +### Pie +**When:** Share-of-total for LOW cardinality dimensions (≤6 slices). + +```apl +['logs'] +| summarize count() by status_class = case( + status < 300, "2xx", + status < 400, "3xx", + status < 500, "4xx", + "5xx" + ) +``` + +**Pitfalls:** +- Never use for high cardinality (routes, user IDs) +- Prefer tables for >6 categories +- Always aggregate to reduce slices + +### LogStream +**When:** Raw event inspection, debugging, evidence gathering. + +```apl +['logs'] +| where service == "api" and status >= 500 +| project-keep _time, trace_id, route, status, error_message, duration_ms +| take 100 +``` + +**Pitfalls:** +- Always include `take N` (100-500 max) +- Use `project-keep` to show relevant fields only +- Filter aggressively—raw logs are expensive + +### Heatmap +**When:** Distribution visualization, latency patterns, density analysis. + +```apl +['logs'] +| summarize histogram(duration_ms, 15) by bin_auto(_time) +``` + +**Best for:** Latency distributions, response time patterns, identifying outliers. + +### Scatter Plot +**When:** Correlation between two metrics, identifying patterns. + +```apl +['logs'] +| summarize avg(duration_ms), avg(resp_size_bytes) by route +``` + +**Best for:** Response size vs latency correlation, resource usage patterns. + +### SmartFilter (Filter Bar) +**When:** Interactive filtering for the entire dashboard. + +SmartFilter is a **chart type** that creates dropdown/search filters. Requires: +1. A `SmartFilter` chart with filter definitions +2. `declare query_parameters` in each panel query + +**Filter types:** +- `selectType: "apl"` — Dynamic dropdown from APL query +- `selectType: "list"` — Static dropdown with predefined options +- `type: "search"` — Free-text input + +**Panel query pattern:** +```apl +declare query_parameters (country_filter:string = ""); +['logs'] | where isempty(country_filter) or ['geo.country'] == country_filter +``` + +See `reference/smartfilter.md` for full JSON structure and cascading filter examples. + +### Monitor List +**When:** Display monitor status on operational dashboards. + +No APL needed—select monitors from the UI. Shows: +- Monitor status (normal/triggered/off) +- Run history (green/red squares) +- Dataset, type, notifiers + +### Note +**When:** Context, instructions, section headers. + +Use GitHub Flavored Markdown for: +- Dashboard purpose and audience +- Runbook links +- Section dividers +- On-call instructions + +--- + +## Chart Configuration + +Charts support JSON configuration options beyond the query. See `reference/chart-config.md` for full details. + +**Quick reference:** + +| Chart Type | Key Options | +|------------|-------------| +| Statistic | `colorScheme`, `customUnits`, `unit`, `showChart` (sparkline), `errorThreshold`/`warningThreshold` | +| TimeSeries | `aggChartOpts`: `variant` (line/area/bars), `scaleDistr` (linear/log), `displayNull` | +| LogStream/Table | `tableSettings`: `columns`, `fontSize`, `highlightSeverity`, `wrapLines` | +| Pie | `hideHeader` | +| Note | `text` (markdown), `variant` | + +**Common options (all charts):** +- `overrideDashboardTimeRange`: boolean +- `overrideDashboardCompareAgainst`: boolean +- `hideHeader`: boolean + +--- + +## APL Patterns + +### Time Filtering in Dashboards vs Ad-hoc Queries + +**Dashboard panel queries do NOT need explicit time filters.** The dashboard UI time picker automatically scopes all queries to the selected time window. + +```apl +// DASHBOARD QUERY — no time filter needed +['logs'] +| where service == "api" +| summarize count() by bin_auto(_time) +``` + +**Ad-hoc queries (Axiom Query tab, axiom-sre exploration) MUST have explicit time filters:** + +```apl +// AD-HOC QUERY — always include time filter +['logs'] +| where _time between (ago(1h) .. now()) +| where service == "api" +| summarize count() by bin_auto(_time) +``` + +### Bin Size Selection + +**Prefer `bin_auto(_time)`** — it automatically adjusts to the dashboard time window. + +Manual bin sizes (only when auto doesn't fit your needs): + +| Time window | Bin size | +|-------------|----------| +| 15m | 10s–30s | +| 1h | 1m | +| 6h | 5m | +| 24h | 15m–1h | +| 7d | 1h–6h | + +### Cardinality Guardrails +Prevent query explosion: + +```apl +// GOOD: bounded +| summarize count() by route | top 10 by count_ + +// BAD: unbounded high-cardinality grouping +| summarize count() by user_id // millions of rows +``` + +### Field Escaping +Fields with dots need bracket notation: + +```apl +| where ['kubernetes.pod.name'] == "frontend" +``` + +Fields with dots IN the name (not hierarchy) need escaping: + +```apl +| where ['kubernetes.labels.app\\.kubernetes\\.io/name'] == "frontend" +``` + +### Golden Signal Queries + +**Traffic:** +```apl +| summarize requests = count() by bin_auto(_time) +``` + +**Errors (as rate %):** +```apl +| summarize total = count(), errors = countif(status >= 500) by bin_auto(_time) +| extend error_rate = iff(total > 0, round(100.0 * errors / total, 2), 0.0) +| project _time, error_rate +``` + +**Latency (use percentiles_array for proper chart overlay):** +```apl +| summarize percentiles_array(duration_ms, 50, 95, 99) by bin_auto(_time) +``` + +--- + +## Layout Composition + +### Grid Principles +- Dashboard width = 12 units +- Typical panel: w=3 (quarter), w=4 (third), w=6 (half), w=12 (full) +- Stats row: 4 panels × w=3, h=2 +- TimeSeries row: 2 panels × w=6, h=4 +- Tables: w=6 or w=12, h=4–6 +- LogStream: w=12, h=6–8 + +### Section Layout Pattern + +``` +Row 0-1: [Stat w=3] [Stat w=3] [Stat w=3] [Stat w=3] +Row 2-5: [TimeSeries w=6, h=4] [TimeSeries w=6, h=4] +Row 6-9: [Table w=6, h=4] [Pie w=6, h=4] +Row 10+: [LogStream w=12, h=6] +``` + +### Naming Conventions +- Use question-style titles: "Error rate by route" not "Errors" +- Prefix with context if multi-service: "[API] Error rate" +- Include units: "Latency (ms)", "Traffic (req/s)" + +--- + +## Dashboard Settings + +### Refresh Rate +Dashboard auto-refreshes at configured interval. Options: 15s, 30s, 1m, 5m, etc. + +**⚠️ Query cost warning:** Short refresh (15s) + long time range (90d) = expensive queries running constantly. + +Recommendations: +| Use case | Refresh rate | +|----------|-------------| +| Oncall/real-time | 15s–30s | +| Team health | 1m–5m | +| Executive/weekly | 5m–15m | + +### Sharing +All dashboards created via API tokens are shared with everyone in the org (`owner: "X-AXIOM-EVERYONE"`). Private dashboards are not supported with API tokens. + +Data visibility is still governed by dataset permissions—users only see data from datasets they can access. + +### URL Time Range Parameters + +`?t_qr=24h` (quick range), `?t_ts=...&t_te=...` (custom), `?t_against=-1d` (comparison) + +--- + +## Setup + +Run `scripts/setup` to check requirements (curl, jq, ~/.axiom.toml). + +Config in `~/.axiom.toml` (shared with axiom-sre): +```toml +[deployments.prod] +url = "https://api.axiom.co" +token = "xaat-your-token" +org_id = "your-org-id" +``` + +--- + +## Deployment + +### Scripts + +| Script | Usage | +|--------|-------| +| `scripts/dashboard-list ` | List all dashboards | +| `scripts/dashboard-get ` | Fetch dashboard JSON | +| `scripts/dashboard-validate ` | Validate JSON structure | +| `scripts/dashboard-create ` | Create dashboard | +| `scripts/dashboard-update ` | Update (needs version) | +| `scripts/dashboard-chart-patch (--version \| --overwrite)` | Patch one chart | +| `scripts/dashboard-copy ` | Clone dashboard | +| `scripts/dashboard-link ` | Get shareable URL | +| `scripts/dashboard-delete ` | Delete (with confirm) | +| `scripts/axiom-api ` | **Dashboard/app API only** (rewrites to `app.*`). For data/metrics endpoints use `scripts/metrics/axiom-api` | +| `scripts/metrics/axiom-api ` | **Data/metrics API** (supports `AXIOM_URL_OVERRIDE` for edge routing) | +| `scripts/metrics/datasets ` | List datasets with `kind` and edge deployment | +| `scripts/metrics/metrics-spec ` | Fetch MPL query specification | +| `scripts/metrics/metrics-info ...` | Discover metrics, tags, and values | +| `scripts/metrics/metrics-query ` | Execute a metrics query | + +> **⚠️ Two `axiom-api` scripts exist with different behaviors.** `scripts/axiom-api` rewrites URLs for the dashboard app API (`app.*`). `scripts/metrics/axiom-api` uses raw URLs and supports edge deployment routing. Using the wrong one will produce 404 errors. + +### Targeted Chart Updates + +Use `scripts/dashboard-chart-patch` when changing one existing chart and the dashboard layout, metadata, and other charts should remain untouched. It calls `PATCH /v2/dashboards/uid/{uid}/charts/{chartId}` with a JSON Merge Patch under the `chart` request field. + +Patch files contain only the chart fields to change: + +```json +{ + "name": "Error Rate (5m)", + "query": { "apl": "['logs'] | summarize errors=countif(status >= 500)" }, + "config": { "stale": null } +} +``` + +`null` removes an existing field. Nested objects merge recursively. If `id` is present in the patch, it must match the `` path argument. The server validates the resulting full dashboard before saving. + +Use `--version ` for optimistic concurrency after fetching the dashboard with `dashboard-get`. Use `--overwrite` only when last-write-wins behavior is intended. Continue using `dashboard-update` for layout changes, multi-chart edits, dashboard metadata, owner, refresh interval, or time window updates. + +### Workflow + +**⚠️ CRITICAL: Always validate queries BEFORE deploying.** + +**APL workflow:** +1. Design dashboard (sections + panels) +2. Write APL for each panel +3. Build JSON (from template or manually) +4. **Validate queries** using axiom-sre with explicit time filter +5. `dashboard-validate` to check structure +6. `dashboard-create` or `dashboard-update` to deploy +7. **`dashboard-link` to get URL** — NEVER construct Axiom URLs manually (org IDs and base URLs vary per deployment) +8. Share link with user + +**Metrics/MPL workflow:** +1. Run `scripts/metrics/metrics-spec` to learn MPL syntax +2. Run `scripts/metrics/metrics-info` to discover metrics and tags +3. Design dashboard using the Metrics/MPL Blueprint +4. Write MPL for each panel +5. **Validate queries** with `scripts/metrics/metrics-query` using explicit time range +6. Build JSON: put the full MPL string in `query.apl` AND set `query.metricsDataset` to the dataset name (required — denotes the chart as MPL). Do not set `query.mpl` (rejected by create API). +7. `dashboard-validate` to check structure +8. `dashboard-create` or `dashboard-update` to deploy +9. **`dashboard-link` to get URL** +10. Share link with user + +--- + +## Sibling Skill Integration + +**spl-to-apl:** Translate Splunk SPL → APL. Map `timechart` → TimeSeries, `stats` → Statistic/Table. See `reference/splunk-migration.md`. + +**axiom-sre:** Discover schema with `getschema`, explore baselines, identify dimensions, then productize into panels. + +**query-metrics:** Discover metrics datasets, metric names, tags, and tag values. Metrics discovery scripts are also vendored locally in `scripts/metrics/`. + +--- + +## Templates + +Pre-built templates in `reference/templates/`: + +| Template | Use case | +|----------|----------| +| `service-overview.json` | Single service oncall dashboard with Heatmap | +| `service-overview-with-filters.json` | Same with SmartFilter (route/status dropdowns) | +| `api-health.json` | HTTP API with traffic/errors/latency | +| `blank.json` | Minimal skeleton | + +**Placeholders:** `{{service}}`, `{{dataset}}` + +**Usage:** +```bash +scripts/dashboard-from-template service-overview "my-service" "my-dataset" ./dashboard.json +scripts/dashboard-validate ./dashboard.json +scripts/dashboard-create prod ./dashboard.json +``` + +**⚠️ Templates assume field names** (`service`, `status`, `route`, `duration_ms`). Discover your schema first and use `sed` to fix mismatches. + +--- + +## Common Pitfalls + +| Problem | Cause | Solution | +|---------|-------|----------| +| "unable to find dataset" errors | Dataset name doesn't exist in your org | Check available datasets in Axiom UI | +| "creating private dashboards" 403 | API tokens can only create shared dashboards | Use `owner: "X-AXIOM-EVERYONE"` (the default) | +| All panels show errors | Field names don't match your schema | Discover schema first, use sed to fix field names | +| Dashboard shows no data | Service filter too restrictive | Remove or adjust `where service == 'x'` filters | +| Queries time out | Missing time filter or too broad | Dashboard inherits time from picker; ad-hoc queries need explicit time filter | +| Wrong org in dashboard URL | Manually constructed URL | **Always use `dashboard-link `** — never guess org IDs or base URLs | +| `getschema` returns 0 rows | Dataset is `otel:metrics:v1`, not events | Run `scripts/metrics/datasets ` to check kind; use `scripts/metrics/metrics-info` for metrics discovery | +| Metrics discovery returns empty | Sparse metrics (sensors, batch, cron) outside default 24h window | Retry with `--start` set to 7 days ago; some metrics only report intermittently | +| 404 from metrics API calls | Used `scripts/axiom-api` (dashboard) instead of `scripts/metrics/axiom-api` (data) | Use `scripts/metrics/axiom-api` for all `/v1/query/`, `/v1/datasets` paths | +| `find-metrics` returns unexpected results | It searches tag values, not metric names | Use `metrics-info metrics` to list metric names; `find-metrics` finds metrics associated with a known tag value | +| Metrics chart renders blank or wrong values | Missing `query.metricsDataset` — backend treats `apl` as APL, not MPL | Set `query.metricsDataset` to the dataset name alongside `query.apl` | +| `query.mpl` rejected on create | GET may return `query.mpl` for existing metrics charts, but create expects `query.apl` | Move/copy the MPL string into `query.apl` before deploy | +| `decimals` rejected on create | Create API does not accept chart-level `decimals` even though GET may return it | Omit `decimals` from create payloads | + +--- + +## Reference + +- `reference/chart-config.md` — All chart configuration options (JSON) +- `reference/metrics-mpl.md` — Metrics/MPL chart contract and discovery scripts +- `reference/smartfilter.md` — SmartFilter/FilterBar full configuration +- `reference/chart-cookbook.md` — APL patterns per chart type +- `reference/layout-recipes.md` — Grid layouts and section blueprints +- `reference/splunk-migration.md` — Splunk panel → Axiom mapping +- `reference/design-playbook.md` — Decision-first design principles +- `reference/templates/` — Ready-to-use dashboard JSON files + +For APL syntax: https://axiom.co/docs/apl/introduction diff --git a/.agents/skills/building-dashboards/reference/chart-config.md b/.agents/skills/building-dashboards/reference/chart-config.md new file mode 100644 index 00000000..30974821 --- /dev/null +++ b/.agents/skills/building-dashboards/reference/chart-config.md @@ -0,0 +1,285 @@ +# Chart Configuration Options + +Charts support JSON configuration options beyond the query. These are set at the chart level. + +## Common Options (All Charts) + +```json +{ + "overrideDashboardTimeRange": false, + "overrideDashboardCompareAgainst": false, + "hideHeader": false +} +``` + +## Metrics/MPL Query (MetricsDB Charts) + +Metrics charts require both `query.apl` (the MPL pipeline string) and `query.metricsDataset` (the dataset name, e.g. `"otel-metrics"`). The `metricsDataset` field is what flags the chart as MPL; without it the backend treats `apl` as APL and the chart misbehaves. Do not send `query.mpl` — the create API rejects it. Run `scripts/metrics/metrics-spec` to learn the full syntax before composing queries. + +### Minimal Metrics Query + +```json +{ + "type": "TimeSeries", + "query": { + "apl": "`otel-metrics`:`system.cpu.utilization`", + "metricsDataset": "otel-metrics" + } +} +``` + +### Metrics Query with Filters and Transformations + +```json +{ + "type": "TimeSeries", + "query": { + "apl": "`otel-metrics`:`http.server.duration`\n| where `service.name` == \"api\"\n| where `deployment.environment` == \"prod\"\n| align to $__interval using avg\n| group by `service.name` using avg", + "metricsDataset": "otel-metrics" + } +} +``` + +For full contract details, see `reference/metrics-mpl.md`. + +## Statistic Options + +```json +{ + "type": "Statistic", + "colorScheme": "Blue", + "customUnits": "req/s", + "unit": "Auto", + "showChart": true, + "hideValue": false, + "errorThreshold": "Above", + "errorThresholdValue": "100", + "warningThreshold": "Above", + "warningThresholdValue": "50", + "invertTheme": false +} +``` + +> **API gotcha:** `decimals` is returned by GET and may appear in existing dashboards, but the create API rejects it. Omit `decimals` from create payloads. + +| Option | Values | Description | +|--------|--------|-------------| +| `colorScheme` | Blue, Orange, Red, Purple, Teal, Yellow, Green, Pink, Grey, Brown | Color theme | +| `customUnits` | string | Unit suffix (e.g., "ms", "req/s") | +| `unit` | Auto, Abbreviated, Byte, KB, MB, GB, TimeMS, TimeSec, Percent, etc. | Value formatting | +| `decimals` | number | Decimal places in readback/GET payloads; omit on create because the API rejects it | +| `showChart` | boolean | Show sparkline | +| `hideValue` | boolean | Hide the main value | +| `errorThreshold` | Above, AboveOrEqual, Below, BelowOrEqual, AboveOrBelow | Error condition | +| `errorThresholdValue` | string | Error threshold value | +| `warningThreshold` | same as error | Warning condition | +| `warningThresholdValue` | string | Warning threshold value | +| `invertTheme` | boolean | Invert colors | + +### Available Units + +- **Numbers**: `Auto`, `Abbreviated` +- **Data**: `Byte`, `Kilobyte`, `Megabyte`, `Gigabyte` +- **Data rates**: `BitsSec`, `BytesSec`, `KilobitsSec`, `KilobytesSec`, `MegabitsSec`, `MegabytesSec`, `GigabitsSec`, `GigabytesSec` +- **Time**: `TimeNS`, `TimeUS`, `TimeMS`, `TimeSec`, `TimeMin`, `TimeHour`, `TimeDay` +- **Percent**: `Percent` (0-1), `Percent100` (0-100) +- **Currency**: `CurrencyUSD`, `CurrencyEUR`, `CurrencyGBP`, `CurrencyCAD`, `CurrencyAUD`, `CurrencyJPY`, `CurrencyINR`, `CurrencyCZK`, `CurrencyPLN` +- **Date**: `DateDateTime`, `DateFromNow`, `DateYYYYMMDDHHmmss` + +## TimeSeries Options + +TimeSeries chart options are stored in `query.queryOptions.aggChartOpts` as a JSON string. + +### Key Formats + +**Important:** The `"*"` wildcard is unreliable. Always use the specific key format derived from your query. + +#### Deriving the Key + +The key format depends on how the column is computed: + +| Query Pattern | Key Format | +|---------------|------------| +| `summarize count()` | `{"alias":"count_","op":"count"}` | +| `summarize sum(field)` | `{"alias":"sum_field","op":"sum"}` | +| `summarize ['Name'] = sum(field) / 1000` | `{"alias":"Name","field":"field","op":"computed"}` | +| `summarize ['Name'] = round(sum(field), 1)` | `{"alias":"Name","field":"field","op":"computed"}` | + +**Rule:** If the column uses any expression (math, `round()`, etc.), use `"op":"computed"` and include the source `"field"`. + +#### Simple Aggregation Example + +```json +{ + "type": "TimeSeries", + "query": { + "apl": "['logs'] | summarize count() by bin_auto(_time)", + "queryOptions": { + "aggChartOpts": "{\"{\\\"alias\\\":\\\"count_\\\",\\\"op\\\":\\\"count\\\"}\":{\"variant\":\"bars\"}}" + } + } +} +``` + +#### Computed Column Example + +For `['Ingest GB'] = round(sum(['properties.hourly_ingest_bytes']) / 1e9, 1)`: + +```json +{ + "aggChartOpts": "{\"{\\\"alias\\\":\\\"Ingest GB\\\",\\\"field\\\":\\\"properties.hourly_ingest_bytes\\\",\\\"op\\\":\\\"computed\\\"}\":{\"variant\":\"bars\",\"displayNull\":\"auto\"}}" +} +``` + +**Note:** The `field` value is the source field name without brackets or the `properties.` prefix path as written in the query. + +### View Mode (timeSeriesView) + +Controls what the TimeSeries panel displays. Set in `query.queryOptions.timeSeriesView`. + +| Value | Description | +|-------|-------------| +| `charts` | Chart only (default) | +| `resultsTable` | Summary totals table only | +| `charts\|resultsTable` | Chart with totals table below — shows both the time series and an aggregated summary | + +```json +{ + "type": "TimeSeries", + "query": { + "apl": "['logs'] | summarize count() by bin_auto(_time), service", + "queryOptions": { + "timeSeriesView": "charts|resultsTable" + } + } +} +``` + +### Per-Series Options (inside aggChartOpts) + +| Option | Values | Description | +|--------|--------|-------------| +| `variant` | `line`, `area`, `bars` | Chart display mode | +| `scaleDistr` | `linear`, `log` | Y-axis scale | +| `displayNull` | `auto`, `null`, `span`, `zero` | Missing data handling | + +### displayNull Values + +- `auto`: Best representation based on chart type +- `null`: Skip/ignore missing values (gaps in chart) +- `span`: Join adjacent values across gaps +- `zero`: Fill missing with zeros + +## LogStream / Table Options + +```json +{ + "type": "LogStream", + "tableSettings": { + "columns": [ + {"name": "_time", "width": 150}, + {"name": "message", "width": 400} + ], + "settings": { + "fontSize": "12px", + "highlightSeverity": true, + "showRaw": true, + "showEvent": true, + "showTimestamp": true, + "wrapLines": true, + "hideNulls": true + } + } +} +``` + +| Option | Type | Description | +|--------|------|-------------| +| `columns` | array | Column order and widths (objects with `name` and `width`) | +| `fontSize` | string | Font size (e.g., "12px") | +| `highlightSeverity` | boolean | Color-code by log level | +| `showRaw` | boolean | Show raw JSON | +| `showEvent` | boolean | Show event column | +| `showTimestamp` | boolean | Show timestamp column | +| `wrapLines` | boolean | Wrap long lines | +| `hideNulls` | boolean | Hide null values | + +## Pie Options + +```json +{ + "type": "Pie", + "hideHeader": false +} +``` + +## Note Options + +```json +{ + "type": "Note", + "text": "## Section Header\n\nMarkdown content here.", + "variant": "default" +} +``` + +Note content supports GitHub Flavored Markdown. + +## Heatmap Options + +Heatmap charts use the default options. Color scheme is fixed to blue gradient. + +```json +{ + "type": "Heatmap", + "query": { + "apl": "['logs'] | summarize histogram(duration_ms, 15) by bin_auto(_time)" + } +} +``` + +## Annotations + +Display deployment markers, incidents, or custom events on charts. + +Annotations are managed via the Axiom API `/v2/annotations` endpoint: + +```bash +curl -X 'POST' 'https://api.axiom.co/v2/annotations' \ + -H 'Authorization: Bearer $AXIOM_TOKEN' \ + -H 'Content-Type: application/json' \ + -d '{ + "time": "2024-03-18T08:39:28.382Z", + "type": "deploy", + "datasets": ["http-logs"], + "title": "Production deployment", + "description": "Deploy v2.1.0", + "url": "https://github.com/org/repo/releases/tag/v2.1.0" + }' +``` + +Or use GitHub Actions: +```yaml +- name: Add annotation + uses: axiomhq/annotation-action@v0.1.0 + with: + axiomToken: ${{ secrets.AXIOM_TOKEN }} + datasets: http-logs + type: "deploy" + title: "Production deployment" +``` + +## Comparison Period (Against) + +Compare current time range against a historical period: +- `-1D`: Same time yesterday +- `-1W`: Same time last week +- Custom offset + +Use in dashboard URL: `?t_qr=24h&t_against=-1d` + +## Custom Time Range per Panel + +Individual panels can override the dashboard time range: +- Set `overrideDashboardTimeRange: true` in chart config +- Via UI: Edit panel → Time range → Custom diff --git a/.agents/skills/building-dashboards/reference/chart-cookbook.md b/.agents/skills/building-dashboards/reference/chart-cookbook.md new file mode 100644 index 00000000..a91d283a --- /dev/null +++ b/.agents/skills/building-dashboards/reference/chart-cookbook.md @@ -0,0 +1,472 @@ +# Chart Cookbook + +Detailed APL patterns for each chart type with real-world examples. + +> **Note:** Dashboard panel queries inherit time from the UI picker—no explicit `_time` filter needed. The examples below show ad-hoc query patterns with time filters for testing in the Query tab. Remove the `where _time between (...)` line when using these in dashboards. + +--- + +## Statistic + +Single-value panels for KPIs and current state. + +### Error Rate (Percentage) +```apl +['http-logs'] +| where _time between (ago(5m) .. now()) +| where service == "api-gateway" +| summarize + total = count(), + errors = countif(status >= 500) +| extend error_rate = round(100.0 * errors / total, 2) +| project error_rate +``` + +### Current p95 Latency +```apl +['http-logs'] +| where _time between (ago(5m) .. now()) +| where service == "api-gateway" +| summarize p95 = percentile(duration_ms, 95) +``` + +### Request Rate (per second) +```apl +['http-logs'] +| where _time between (ago(5m) .. now()) +| where service == "api-gateway" +| summarize requests = count() +| extend rps = round(requests / 300.0, 1) // 300 seconds = 5 min +| project rps +``` + +### Active Errors (count) +```apl +['http-logs'] +| where _time between (ago(5m) .. now()) +| where status >= 500 +| summarize error_count = count() +``` + +### Comparison to Baseline +```apl +['http-logs'] +| where _time between (ago(1h) .. now()) +| summarize + last_5m = countif(_time >= ago(5m) and status >= 500), + prev_55m = countif(_time < ago(5m) and status >= 500) +| extend change_pct = round(100.0 * (last_5m - prev_55m/11) / (prev_55m/11 + 0.001), 1) +| project last_5m, change_pct +``` + +--- + +## TimeSeries + +Time-based trends with proper binning. + +### Traffic Over Time +```apl +['http-logs'] +| where _time between (ago(1h) .. now()) +| where service == "api-gateway" +| summarize requests = count() by bin(_time, 1m) +``` + +### Error Rate Over Time +```apl +['http-logs'] +| where _time between (ago(1h) .. now()) +| where service == "api-gateway" +| summarize + total = count(), + errors = countif(status >= 500) + by bin(_time, 1m) +| extend error_rate = 100.0 * errors / total +| project _time, error_rate +``` + +### Latency Percentiles Over Time +```apl +['http-logs'] +| where _time between (ago(1h) .. now()) +| where service == "api-gateway" +| summarize + p50 = percentile(duration_ms, 50), + p95 = percentile(duration_ms, 95), + p99 = percentile(duration_ms, 99) + by bin(_time, 1m) +``` + +### Traffic by Status Class (Stacked) +```apl +['http-logs'] +| where _time between (ago(1h) .. now()) +| extend status_class = case( + status < 300, "2xx", + status < 400, "3xx", + status < 500, "4xx", + "5xx" + ) +| summarize count() by bin(_time, 1m), status_class +``` + +### Multi-Service Comparison +```apl +['http-logs'] +| where _time between (ago(1h) .. now()) +| where service in ("api-gateway", "auth-service", "payment-service") +| summarize requests = count() by bin(_time, 1m), service +``` + +### Rate of Change (Derivative) +```apl +['http-logs'] +| where _time between (ago(1h) .. now()) +| summarize requests = count() by bin(_time, 1m) +| order by _time asc +| extend prev = prev(requests) +| extend rate_change = requests - prev +| where isnotnull(prev) +``` + +--- + +## Table + +Top-N breakdowns and detailed lists. + +### Top 10 Failing Routes +```apl +['http-logs'] +| where _time between (ago(1h) .. now()) +| where status >= 500 +| summarize errors = count() by route +| top 10 by errors +| project Route = route, Errors = errors +``` + +### Top Error Messages +```apl +['http-logs'] +| where _time between (ago(1h) .. now()) +| where status >= 500 +| summarize count = count() by error_message +| top 10 by count +| project Message = error_message, Count = count +``` + +### Worst Pods by Error Rate +```apl +['http-logs'] +| where _time between (ago(1h) .. now()) +| summarize + total = count(), + errors = countif(status >= 500) + by pod = ['kubernetes.pod.name'] +| extend error_rate = round(100.0 * errors / total, 2) +| where total >= 100 // minimum sample size +| top 10 by error_rate +| project Pod = pod, "Error Rate %" = error_rate, Total = total, Errors = errors +``` + +### Latency by Route +```apl +['http-logs'] +| where _time between (ago(1h) .. now()) +| summarize + requests = count(), + p50 = percentile(duration_ms, 50), + p95 = percentile(duration_ms, 95), + p99 = percentile(duration_ms, 99) + by route +| top 10 by p95 +| project Route = route, Requests = requests, "p50 (ms)" = p50, "p95 (ms)" = p95, "p99 (ms)" = p99 +``` + +### Recent Errors with Details +```apl +['http-logs'] +| where _time between (ago(15m) .. now()) +| where status >= 500 +| top 20 by _time +| project Time = _time, Route = route, Status = status, Message = error_message, TraceID = trace_id +``` + +### Customer Impact Summary +```apl +['http-logs'] +| where _time between (ago(1h) .. now()) +| where status >= 500 +| summarize + errors = count(), + affected_requests = dcount(trace_id) + by customer_id +| top 10 by errors +| project Customer = customer_id, Errors = errors, "Affected Requests" = affected_requests +``` + +--- + +## Pie + +Share-of-total for low-cardinality dimensions only. + +### Status Code Distribution +```apl +['http-logs'] +| where _time between (ago(1h) .. now()) +| extend status_class = case( + status < 300, "2xx Success", + status < 400, "3xx Redirect", + status < 500, "4xx Client Error", + "5xx Server Error" + ) +| summarize count() by status_class +``` + +### Traffic by Region +```apl +['http-logs'] +| where _time between (ago(1h) .. now()) +| summarize count() by region +| top 6 by count_ // Limit slices +``` + +### Error Types Distribution +```apl +['http-logs'] +| where _time between (ago(1h) .. now()) +| where status >= 500 +| extend error_type = case( + status == 500, "Internal Error", + status == 502, "Bad Gateway", + status == 503, "Service Unavailable", + status == 504, "Gateway Timeout", + "Other 5xx" + ) +| summarize count() by error_type +``` + +### Request Method Mix +```apl +['http-logs'] +| where _time between (ago(1h) .. now()) +| summarize count() by method +``` + +**Warning:** If dimension has >6 values, use a Table instead. + +--- + +## LogStream + +Raw event inspection with focused fields. + +### Recent Errors +```apl +['http-logs'] +| where _time between (ago(15m) .. now()) +| where status >= 500 +| project-keep _time, trace_id, service, route, status, error_message, duration_ms +| order by _time desc +| take 100 +``` + +### Slow Requests +```apl +['http-logs'] +| where _time between (ago(15m) .. now()) +| where duration_ms > 5000 +| project-keep _time, trace_id, service, route, duration_ms, status +| order by duration_ms desc +| take 100 +``` + +### Authentication Failures +```apl +['auth-logs'] +| where _time between (ago(1h) .. now()) +| where event_type == "login_failed" +| project-keep _time, user_id, ip_address, failure_reason, user_agent +| order by _time desc +| take 100 +``` + +### Kubernetes Events +```apl +['k8s-events'] +| where _time between (ago(1h) .. now()) +| where type in ("Warning", "Error") +| project-keep _time, type, reason, ['involvedObject.name'], message +| order by _time desc +| take 100 +``` + +### Filtered by Trace ID +```apl +['http-logs'] +| where _time between (ago(24h) .. now()) +| where trace_id == "abc123xyz" +| project-keep _time, service, route, status, duration_ms, error_message +| order by _time asc +``` + +--- + +## SmartFilter + +No APL needed—configure these fields for interactive filtering: + +### Recommended Filter Fields +- `service` — Which service to focus on +- `environment` — prod/staging/dev +- `region` — Geographic region +- `route` — API endpoint +- `status` — HTTP status code +- `customer_id` — For multi-tenant systems +- `kubernetes.namespace` — K8s namespace +- `kubernetes.pod.name` — Specific pod + +### Configuration Tips +- Place SmartFilter at top of dashboard +- Include 3–5 most useful filter dimensions +- Avoid high-cardinality fields as primary filters (trace_id, request_id) + +--- + +## Note + +Markdown panels for context and navigation. + +### Dashboard Header +```markdown +# API Gateway - Oncall Dashboard + +**Purpose:** Quick triage for API-related incidents. + +**Escalation:** If error rate > 5%, page #platform-oncall. + +**Runbook:** [API Incident Response](https://wiki.example.com/api-runbook) +``` + +### Section Divider +```markdown +--- +## Error Analysis +``` + +### Instructions +```markdown +### How to Use This Dashboard + +1. Check the error rate stat (top-left) +2. If elevated, check the "Top Failing Routes" table +3. Click a route to filter logs below +4. Copy trace_id for detailed investigation +``` + +--- + +## Heatmap + +Visualize distributions and density patterns. + +### Latency Distribution Over Time +```apl +['http-logs'] +| where _time between (ago(1h) .. now()) +| summarize histogram(duration_ms, 20) by bin_auto(_time) +``` + +### Response Size Distribution +```apl +['http-logs'] +| where _time between (ago(1h) .. now()) +| summarize histogram(resp_body_size_bytes, 15) by bin_auto(_time) +``` + +### Request Rate by Hour of Day +```apl +['http-logs'] +| where _time between (ago(7d) .. now()) +| extend hour = hourofday(_time), day = dayofweek(_time) +| summarize count() by hour, day +``` + +--- + +## Scatter Plot + +Identify correlations between metrics. + +### Latency vs Response Size +```apl +['http-logs'] +| where _time between (ago(1h) .. now()) +| summarize avg(duration_ms), avg(resp_body_size_bytes) by route +``` + +### Request Rate vs Error Rate by Route +```apl +['http-logs'] +| where _time between (ago(1h) .. now()) +| summarize + requests = count(), + error_rate = round(100.0 * countif(status >= 500) / count(), 2) + by route +| where requests >= 10 +``` + +### CPU vs Memory by Pod +```apl +['metrics'] +| where _time between (ago(1h) .. now()) +| summarize avg(cpu_percent), avg(memory_percent) by pod +``` + +--- + +## Filter Bar + +Interactive filters for dashboard-wide filtering. + +### Dynamic Country Filter Query +```apl +['http-logs'] +| where _time between (ago(1h) .. now()) +| distinct ['geo.country'] +| project key=['geo.country'], value=['geo.country'] +| sort by key asc +``` + +### Panel Using Filters +```apl +declare query_parameters (_country:string = "", _status:string = ""); +['http-logs'] +| where _time between (ago(1h) .. now()) +| where isempty(_country) or ['geo.country'] == _country +| where isempty(_status) or tostring(status) == _status +| summarize count() by bin_auto(_time) +``` + +### Dependent City Filter (depends on country) +```apl +declare query_parameters (_country:string = ""); +['http-logs'] +| where _time between (ago(1h) .. now()) +| where isnotempty(['geo.country']) and isnotempty(['geo.city']) +| where ['geo.country'] == _country +| distinct ['geo.city'] +| project key=['geo.city'], value=['geo.city'] +| sort by key asc +``` + +### Dataset Selector Filter +For multi-dataset dashboards, let users choose which dataset to view: +```apl +declare query_parameters (_dataset:string = "http-logs"); +table(_dataset) +| where _time between (ago(1h) .. now()) +| summarize count() by bin_auto(_time) +``` diff --git a/.agents/skills/building-dashboards/reference/design-playbook.md b/.agents/skills/building-dashboards/reference/design-playbook.md new file mode 100644 index 00000000..cb028939 --- /dev/null +++ b/.agents/skills/building-dashboards/reference/design-playbook.md @@ -0,0 +1,182 @@ +# Dashboard Design Playbook + +## Decision-First Design + +Every dashboard exists to help someone make a decision. Before adding panels, answer: + +1. **Who is the audience?** + - Oncall engineer (needs fast triage, error focus) + - Team lead (needs weekly trends, SLO tracking) + - Executive (needs high-level health, business impact) + +2. **What decisions will they make?** + - "Should I page someone?" + - "Which service is causing this?" + - "Are we meeting our SLOs?" + - "What changed after the deploy?" + +3. **What actions follow?** + - Rollback, scale, investigate, escalate, ignore + +If a panel doesn't inform a decision → remove it. + +--- + +## The Overview → Drilldown → Evidence Pattern + +Structure dashboards in layers: + +``` +┌─────────────────────────────────────────────────────────┐ +│ OVERVIEW: Is anything broken? (Stats + TimeSeries) │ +│ Answer in <5 seconds │ +├─────────────────────────────────────────────────────────┤ +│ DRILLDOWN: Where is it broken? (Tables + Pies) │ +│ Identify the component/route/customer │ +├─────────────────────────────────────────────────────────┤ +│ EVIDENCE: What exactly happened? (LogStream) │ +│ Raw events for root cause │ +└─────────────────────────────────────────────────────────┘ +``` + +Users should be able to: +1. Glance at overview → "something's wrong with errors" +2. Scan drilldown → "it's the /checkout route" +3. Dive into evidence → "null pointer in payment handler" + +--- + +## Audience-Specific Defaults + +### Oncall Dashboard +- **Time window:** 15m–1h +- **Refresh:** 30s–1m +- **Focus:** Errors, latency spikes, recent changes +- **Stats:** Current error rate, p95, traffic +- **Priority:** Speed over completeness + +### Team Health Dashboard +- **Time window:** 24h–7d +- **Refresh:** 5m–15m +- **Focus:** SLO tracking, trends, regression detection +- **Stats:** SLO budget remaining, weekly error rate +- **Priority:** Context over immediacy + +### Executive Dashboard +- **Time window:** 7d–30d +- **Refresh:** 1h +- **Focus:** Business metrics, availability, cost +- **Stats:** Uptime %, request volume, top customers +- **Priority:** Clarity over detail + +--- + +## Anti-Patterns + +### Too Many Panels +**Problem:** Cognitive overload, slow rendering, no clear hierarchy. +**Fix:** Limit to 8–12 panels max. If more needed, split into multiple dashboards. + +### Pie Charts for High Cardinality +**Problem:** 50+ slices = unreadable rainbow. +**Fix:** Use tables for high cardinality. Pies only for ≤6 categories. + +### Missing Time Filters (Ad-hoc Queries Only) +**Problem:** Ad-hoc queries scan entire dataset history. +**Fix:** Always `where _time between (...)` as first filter in Query tab. +**Note:** Dashboard panel queries don't need this—they inherit time from the UI picker. + +### Averages Without Percentiles +**Problem:** Averages hide tail latency that affects real users. +**Fix:** Show p50, p95, p99 together. If only one, show p95 or p99. + +### Unbounded GROUP BY +**Problem:** `summarize by user_id` returns millions of rows. +**Fix:** Always add `| top N by ...` after high-cardinality groupings. + +### No Drilldown Path +**Problem:** Dashboard shows "errors are high" but no way to find where. +**Fix:** Always include breakdown tables that show top contributors. + +### Stale Data with Fast Refresh +**Problem:** Dashboard refreshes every 30s but queries 7 days. +**Fix:** Match refresh to time window. Fast refresh = short window. + +### Generic Panel Names +**Problem:** "Errors", "Latency", "Traffic" don't explain what you're looking at. +**Fix:** Question-style names: "Error rate by route", "p95 latency trend", "Requests per minute". + +--- + +## Golden Signals Coverage + +Every service dashboard should cover the four golden signals: + +| Signal | What to show | Chart type | +|--------|--------------|------------| +| **Traffic** | Requests/sec over time | TimeSeries | +| **Errors** | Error rate %, error count by type | TimeSeries + Table | +| **Latency** | p50/p95/p99 over time | TimeSeries | +| **Saturation** | CPU, memory, connections, queue depth | TimeSeries | + +If you can't show all four, prioritize: Errors > Latency > Traffic > Saturation. + +--- + +## Time Window Guidelines + +| Use case | Window | Bin size | +|----------|--------|----------| +| Active incident | 15m–1h | 10s–1m | +| Recent regression | 6h–24h | 5m–15m | +| Weekly review | 7d | 1h | +| Capacity planning | 30d | 6h–1d | + +**Rule of thumb:** Aim for 50–200 data points per series. +- 1h window ÷ 1m bins = 60 points ✓ +- 24h window ÷ 1m bins = 1440 points ✗ (too dense) +- 24h window ÷ 15m bins = 96 points ✓ + +--- + +## Refresh Rate Guidelines + +| Dashboard type | Refresh | +|----------------|---------| +| Oncall/incident | 30s–1m | +| Operational | 1m–5m | +| Daily health | 5m–15m | +| Reporting | Manual or 1h | + +Fast refresh on long time windows wastes resources. Match them. + +--- + +## Panel Ordering Principles + +1. **Most critical at top-left** (Stats row) +2. **Time series below stats** (context for the numbers) +3. **Breakdowns in middle** (drilldown path) +4. **Raw logs at bottom** (evidence, least used) + +Visual flow should match investigation flow: notice → narrow → verify. + +--- + +## Naming Conventions + +### Dashboard Names +- Include service/scope: "API Gateway - Oncall" +- Include purpose: "Payment Service - SLO Tracking" +- Avoid generic: "Dashboard 1", "Main" + +### Panel Titles +- Question format: "What is the error rate by route?" +- Include units: "Latency (ms)", "Traffic (req/s)" +- Include scope if multi-service: "[API] Error Rate" + +### Field Aliases +In APL, use `project` or aliases to create readable column names: +```apl +| project Route = route, Errors = error_count, "Error Rate %" = error_rate +``` diff --git a/.agents/skills/building-dashboards/reference/layout-recipes.md b/.agents/skills/building-dashboards/reference/layout-recipes.md new file mode 100644 index 00000000..a5b793a1 --- /dev/null +++ b/.agents/skills/building-dashboards/reference/layout-recipes.md @@ -0,0 +1,226 @@ +# Layout Recipes + +Grid-based layout patterns for common dashboard structures. + +--- + +## Grid Basics + +- **Dashboard width:** 24 units +- **Minimum panel width:** 3 units +- **Panel positioning:** (x, y) coordinates with (w)idth and (h)eight + +### Common Panel Sizes + +| Panel type | Width (w) | Height (h) | Description | +|------------|-----------|------------|-------------| +| Statistic (compact) | 6 | 2 | Quarter width, KPI | +| Statistic (large) | 8 | 3 | Third width, featured KPI | +| TimeSeries (half) | 12 | 4 | Side-by-side charts | +| TimeSeries (full) | 24 | 4–6 | Full-width trend | +| Table (half) | 12 | 4–6 | Side-by-side tables | +| Table (full) | 24 | 5–8 | Detailed breakdown | +| Pie | 8–12 | 4 | Share visualization | +| LogStream | 24 | 6–10 | Raw events | +| Note (header) | 24 | 1–2 | Section title | +| SmartFilter | 24 | 2 | Dashboard filters | + +--- + +## Service Overview Layout + +Classic 4-section structure for oncall dashboards. + +``` +┌──────────────────────────────────────────────────────────────────────────────┐ +│ Row 0-1: Stats (h=2) │ +│ ┌─────────────┬─────────────┬─────────────┬─────────────┐ │ +│ │ Error Rate │ p95 Latency │ Traffic/s │ Active │ │ +│ │ x=0, w=6 │ x=6, w=6 │ x=12, w=6 │ Alerts x=18 │ │ +│ └─────────────┴─────────────┴─────────────┴─────────────┘ │ +├──────────────────────────────────────────────────────────────────────────────┤ +│ Row 2-5: TimeSeries (h=4) │ +│ ┌────────────────────────────────┬────────────────────────────────┐ │ +│ │ Traffic + Errors Over Time │ Latency Percentiles │ │ +│ │ x=0, w=12 │ x=12, w=12 │ │ +│ └────────────────────────────────┴────────────────────────────────┘ │ +├──────────────────────────────────────────────────────────────────────────────┤ +│ Row 6-9: Tables (h=4) │ +│ ┌────────────────────────────────┬────────────────────────────────┐ │ +│ │ Top Failing Routes │ Top Error Messages │ │ +│ │ x=0, w=12 │ x=12, w=12 │ │ +│ └────────────────────────────────┴────────────────────────────────┘ │ +├──────────────────────────────────────────────────────────────────────────────┤ +│ Row 10-15: LogStream (h=6) │ +│ ┌────────────────────────────────────────────────────────────────────────────┐ +│ │ Recent Errors │ +│ │ x=0, w=24 │ +│ └────────────────────────────────────────────────────────────────────────────┘ +└──────────────────────────────────────────────────────────────────────────────┘ +``` + +**Layout JSON:** +```json +[ + {"i": "error-rate", "x": 0, "y": 0, "w": 6, "h": 2}, + {"i": "p95-latency", "x": 6, "y": 0, "w": 6, "h": 2}, + {"i": "traffic", "x": 12, "y": 0, "w": 6, "h": 2}, + {"i": "alerts", "x": 18, "y": 0, "w": 6, "h": 2}, + {"i": "traffic-errors-ts", "x": 0, "y": 2, "w": 12, "h": 4}, + {"i": "latency-ts", "x": 12, "y": 2, "w": 12, "h": 4}, + {"i": "top-routes", "x": 0, "y": 6, "w": 12, "h": 4}, + {"i": "top-errors", "x": 12, "y": 6, "w": 12, "h": 4}, + {"i": "logs", "x": 0, "y": 10, "w": 24, "h": 6} +] +``` + +--- + +## Multi-Service Comparison Layout + +Side-by-side comparison of multiple services. + +``` +┌──────────────────────────────────────────────────────────────────────────────┐ +│ Row 0: SmartFilter │ +│ ┌────────────────────────────────────────────────────────────────────────────┐ +│ │ Filters: environment, region │ +│ └────────────────────────────────────────────────────────────────────────────┘ +├──────────────────────────────────────────────────────────────────────────────┤ +│ Row 2-5: TimeSeries (by service) │ +│ ┌────────────────────────────────────────────────────────────────────────────┐ +│ │ Traffic by Service (stacked) │ +│ └────────────────────────────────────────────────────────────────────────────┘ +├──────────────────────────────────────────────────────────────────────────────┤ +│ Row 6-9: TimeSeries (by service) │ +│ ┌────────────────────────────────────────────────────────────────────────────┐ +│ │ Error Rate by Service │ +│ └────────────────────────────────────────────────────────────────────────────┘ +├──────────────────────────────────────────────────────────────────────────────┤ +│ Row 10-13: Per-service columns │ +│ ┌──────────────────┬──────────────────┬──────────────────┐ │ +│ │ API Gateway │ Auth Service │ Payment Service │ │ +│ │ Stats + Table │ Stats + Table │ Stats + Table │ │ +│ └──────────────────┴──────────────────┴──────────────────┘ │ +└──────────────────────────────────────────────────────────────────────────────┘ +``` + +--- + +## SLO Tracking Layout + +Focus on service level objectives and budget. + +``` +┌──────────────────────────────────────────────────────────────────────────────┐ +│ Row 0-2: SLO Stats (large) │ +│ ┌───────────────────┬───────────────────┬───────────────────┐ │ +│ │ Availability │ Latency SLO │ Error Budget │ │ +│ │ 99.95% │ p99 < 500ms │ 23% remaining │ │ +│ │ x=0, w=8, h=3 │ x=8, w=8, h=3 │ x=16, w=8, h=3 │ │ +│ └───────────────────┴───────────────────┴───────────────────┘ │ +├──────────────────────────────────────────────────────────────────────────────┤ +│ Row 3-8: SLO Trends │ +│ ┌────────────────────────────────────────────────────────────────────────────┐ +│ │ Availability Over Time (7d) with SLO threshold line │ +│ └────────────────────────────────────────────────────────────────────────────┘ +│ ┌────────────────────────────────────────────────────────────────────────────┐ +│ │ Error Budget Burn Rate │ +│ └────────────────────────────────────────────────────────────────────────────┘ +├──────────────────────────────────────────────────────────────────────────────┤ +│ Row 14+: SLO Violations │ +│ ┌────────────────────────────────────────────────────────────────────────────┐ +│ │ Table: SLO Violations by Route/Time │ +│ └────────────────────────────────────────────────────────────────────────────┘ +└──────────────────────────────────────────────────────────────────────────────┘ +``` + +--- + +## Incident Investigation Layout + +Detailed drilldown for active incidents. + +``` +┌──────────────────────────────────────────────────────────────────────────────┐ +│ Row 0: SmartFilter (service, route, status, trace_id) │ +├──────────────────────────────────────────────────────────────────────────────┤ +│ Row 2-5: Impact Overview │ +│ ┌─────────────┬─────────────┬─────────────┬─────────────┐ │ +│ │ Error Count │ Affected │ Start Time │ Duration │ │ +│ │ │ Customers │ │ │ │ +│ └─────────────┴─────────────┴─────────────┴─────────────┘ │ +├──────────────────────────────────────────────────────────────────────────────┤ +│ Row 6-11: Timeline │ +│ ┌────────────────────────────────────────────────────────────────────────────┐ +│ │ Error Timeline (narrow bins, 10s-30s) │ +│ └────────────────────────────────────────────────────────────────────────────┘ +├──────────────────────────────────────────────────────────────────────────────┤ +│ Row 12-17: Breakdown │ +│ ┌────────────────────────────────┬────────────────────────────────┐ │ +│ │ Errors by Route │ Errors by Error Message │ │ +│ └────────────────────────────────┴────────────────────────────────┘ │ +│ ┌────────────────────────────────┬────────────────────────────────┐ │ +│ │ Errors by Pod │ Errors by Customer │ │ +│ └────────────────────────────────┴────────────────────────────────┘ │ +├──────────────────────────────────────────────────────────────────────────────┤ +│ Row 18+: Evidence (large LogStream) │ +│ ┌────────────────────────────────────────────────────────────────────────────┐ +│ │ Raw Error Logs (h=10) │ +│ └────────────────────────────────────────────────────────────────────────────┘ +└──────────────────────────────────────────────────────────────────────────────┘ +``` + +--- + +## Kubernetes Cluster Overview + +Infrastructure-focused layout. + +``` +┌──────────────────────────────────────────────────────────────────────────────┐ +│ Row 0: Cluster Health Stats │ +│ ┌─────────────┬─────────────┬─────────────┬─────────────┐ │ +│ │ Nodes Ready │ Pods Running│ Restarts │ OOMKills │ │ +│ └─────────────┴─────────────┴─────────────┴─────────────┘ │ +├──────────────────────────────────────────────────────────────────────────────┤ +│ Row 2-5: Resource Usage │ +│ ┌────────────────────────────────┬────────────────────────────────┐ │ +│ │ CPU Usage by Namespace │ Memory Usage by Namespace │ │ +│ └────────────────────────────────┴────────────────────────────────┘ │ +├──────────────────────────────────────────────────────────────────────────────┤ +│ Row 6-9: Pod Issues │ +│ ┌────────────────────────────────┬────────────────────────────────┐ │ +│ │ Pods with Restarts │ Pods with High CPU │ │ +│ └────────────────────────────────┴────────────────────────────────┘ │ +├──────────────────────────────────────────────────────────────────────────────┤ +│ Row 10+: Events │ +│ ┌────────────────────────────────────────────────────────────────────────────┐ +│ │ Warning/Error Events (LogStream) │ +│ └────────────────────────────────────────────────────────────────────────────┘ +└──────────────────────────────────────────────────────────────────────────────┘ +``` + +--- + +## Layout Best Practices + +### Alignment +- Align related panels vertically or horizontally +- Keep consistent heights within rows +- Don't mix units in adjacent panels without clear separation + +### Visual Hierarchy +- Most important panels: top-left, larger +- Supporting context: smaller, below/right +- Evidence/logs: bottom, full-width + +### Responsive Considerations +- Minimum useful width: w=6 for stats, w=12 for charts/tables +- Full-width panels (w=24) for logs and complex tables +- Test at common screen sizes + +### Section Separation +- Use Note panels as section headers +- Or use vertical spacing (leave y gaps) +- Group related panels by theme/question diff --git a/.agents/skills/building-dashboards/reference/metrics-mpl.md b/.agents/skills/building-dashboards/reference/metrics-mpl.md new file mode 100644 index 00000000..8298d8df --- /dev/null +++ b/.agents/skills/building-dashboards/reference/metrics-mpl.md @@ -0,0 +1,61 @@ +# Metrics/MPL Chart Contract + +This reference documents the chart query contract for *metrics-backed* dashboard charts. + +Metrics charts require **two** fields: + +- `query.apl` — the MPL pipeline string (same field name used for APL queries). +- `query.metricsDataset` — the dataset name (e.g. `"otel-metrics"`). This field is what tells the backend to interpret `apl` as MPL. Without it, the chart will not behave correctly even if the pipeline string is well-formed. + +Do not send `query.mpl` in create payloads — the create API rejects it even though GET responses for existing metrics dashboards may include it. + +> **CRITICAL:** Run `scripts/metrics/metrics-spec ` before composing your first MPL query in a session. NEVER guess MPL syntax. + +## Canonical JSON Shape + +```json +{ + "type": "TimeSeries", + "query": { + "apl": "`otel-metrics`:`http.server.duration`\n| where `service.name` == \"api\"\n| align to $__interval using avg\n| group by `service.name` using avg", + "metricsDataset": "otel-metrics" + } +} +``` + +### Required and Optional Fields + +| Field | Required? | Description | +|-------|-----------|-------------| +| `apl` | ✅ Yes | The MPL pipeline string. Use this field even for MPL content. | +| `metricsDataset` | ✅ Yes (for metrics charts) | Dataset name (e.g. `"otel-metrics"`). Denotes the chart as MPL — without it the backend treats `apl` as APL. | +| `mpl` | ❌ No (rejected) | GET may return it for existing metrics charts, but create rejects it. Put the MPL string in `apl` instead. | +| `metricsMetric` | ❌ No | UI/editor metadata; not needed for hand-authored create payloads | +| `metricsFilter` | ❌ No | UI/editor metadata; not needed for hand-authored create payloads | +| `metricsTransformations` | ❌ No | UI/editor metadata; not needed for hand-authored create payloads | + +> **Why both `apl` and `metricsDataset`?** The dashboard create API uses `apl` as the query text field for both APL and MPL queries. `metricsDataset` is the discriminator that flags the chart as MPL. The dataset/metric selector is also embedded in the MPL string itself (e.g. `` `otel-metrics`:`http.server.duration` ``), but `metricsDataset` must still be set explicitly. + +## Authoring Checklist + +When generating metrics chart JSON: + +1. Confirm dataset kind is `otel:metrics:v1` via `scripts/metrics/datasets `. +2. Run `scripts/metrics/metrics-spec` to learn the full MPL syntax — **mandatory, never guess**. +3. Discover available metrics and tags with `scripts/metrics/metrics-info`. If results are empty, retry with `--start` set to 7 days ago (sparse metrics may not have data in the default 24h window). +4. Put the full MPL pipeline in `query.apl` AND set `query.metricsDataset` to the dataset name. Do not set `query.mpl` — the create API rejects it. +5. **Use `align to $__interval`, not a fixed window.** The dashboard runtime injects `$__interval` based on the time picker and panel width; a fixed `align to 1m` produces broken granularity outside its design range. Do not add `param $__interval: Duration;` to the chart string — the runtime injects it. Pre-validation via `scripts/metrics/metrics-query` requires substituting a concrete duration for that call only. +6. Validate your query with `scripts/metrics/metrics-query` before embedding in the dashboard. + +> **Note:** `find-metrics ` searches tag values, not metric names. Use `metrics-info metrics` to list metric names. + +## Metrics Discovery & Query Scripts + +| Script | Usage | +|--------|-------| +| `scripts/metrics/datasets [--kind ]` | List datasets (with edge deployment info) | +| `scripts/metrics/metrics-spec ` | Fetch MPL query specification | +| `scripts/metrics/metrics-info ...` | Discover metrics, tags, and values | +| `scripts/metrics/metrics-query ` | Execute a metrics query | + +> These scripts are vendored from `query-metrics`. Keep in sync if upstream behavior changes. diff --git a/.agents/skills/building-dashboards/reference/smartfilter.md b/.agents/skills/building-dashboards/reference/smartfilter.md new file mode 100644 index 00000000..1d0a01a8 --- /dev/null +++ b/.agents/skills/building-dashboards/reference/smartfilter.md @@ -0,0 +1,135 @@ +# SmartFilter (Filter Bar) Configuration + +SmartFilter is a **chart type** that creates dropdown/search filters. It requires TWO parts: +1. A `SmartFilter` chart in the `charts` array with filter definitions +2. `declare query_parameters` in each panel query that should respond to filters + +## SmartFilter Chart JSON Structure + +```json +{ + "id": "country-filter", + "name": "Filters", + "type": "SmartFilter", + "query": {"apl": ""}, + "filters": [ + { + "id": "country_filter", + "name": "Country", + "type": "select", + "selectType": "apl", + "active": true, + "apl": { + "apl": "['logs'] | distinct ['geo.country'] | project key=['geo.country'], value=['geo.country'] | sort by key asc", + "queryOptions": {"quickRange": "1h"} + }, + "options": [ + {"key": "All", "value": "", "default": true} + ] + } + ] +} +``` + +## Filter Types + +### Dynamic APL Dropdown (`selectType: "apl"`) + +Populates options from an APL query. + +**Requirements:** +- `apl.apl`: Query returning `key` and `value` columns +- `apl.queryOptions.quickRange`: Time range for the query (e.g., `"1h"`, `"7d"`) +- `options`: Must include at least `[{"key": "All", "value": "", "default": true}]` + +### Static List Dropdown (`selectType: "list"`) + +Uses predefined options only. + +```json +{ + "id": "status_filter", + "name": "Status", + "type": "select", + "selectType": "list", + "active": true, + "options": [ + {"key": "All", "value": "", "default": true}, + {"key": "2xx", "value": "2"}, + {"key": "4xx", "value": "4"}, + {"key": "5xx", "value": "5"} + ] +} +``` + +### Search Filter (`type: "search"`) + +Free-text input instead of dropdown: + +```json +{ + "id": "trace_id", + "name": "Trace ID", + "type": "search", + "selectType": "list", + "active": true, + "options": [{"key": "All", "value": "", "default": true}] +} +``` + +## Panel Query Integration + +Panel queries must declare parameters and handle empty (All) case: + +```apl +declare query_parameters (country_filter:string = ""); +['logs'] +| where isempty(country_filter) or ['geo.country'] == country_filter +| summarize count() by bin_auto(_time) +``` + +## Filter Query for Dynamic Dropdowns + +```apl +['logs'] +| distinct ['geo.country'] +| project key=['geo.country'], value=['geo.country'] +| sort by key asc +``` + +## Dependent/Cascading Filters + +Filters can depend on other filters by declaring their parameters in the APL query: + +```json +{ + "id": "city_filter", + "name": "City", + "type": "select", + "selectType": "apl", + "active": true, + "apl": { + "apl": "declare query_parameters (country_filter:string=\"\");\n['logs']\n| where ['geo.country'] == country_filter\n| distinct ['geo.city']\n| project key=['geo.city'], value=['geo.city']", + "queryOptions": {"quickRange": "1h"} + }, + "options": [{"key": "All", "value": "", "default": true}] +} +``` + +The city dropdown re-queries when `country_filter` changes, showing only cities in the selected country. + +## Layout + +Place SmartFilter at y=0, full width (w=12, h=1), shift other panels down: + +```json +{"i": "filters", "x": 0, "y": 0, "w": 12, "h": 1} +``` + +## Best Practices + +- Filter `id` must match the parameter name in `declare query_parameters` +- Use `isempty(filter)` check so "All" option works (empty string = no filter) +- One SmartFilter chart can contain multiple filters +- Place at top of dashboard (y=0) for visibility +- For cascading filters, order matters: parent filter should come before dependent filters diff --git a/.agents/skills/building-dashboards/reference/splunk-migration.md b/.agents/skills/building-dashboards/reference/splunk-migration.md new file mode 100644 index 00000000..beb87fa0 --- /dev/null +++ b/.agents/skills/building-dashboards/reference/splunk-migration.md @@ -0,0 +1,243 @@ +# Splunk Dashboard Migration + +Guide for converting Splunk dashboards to Axiom dashboards. + +--- + +## Migration Workflow + +1. **Export Splunk dashboard** (XML or JSON) +2. **Inventory panels** — list each panel with its SPL query and visualization type +3. **Translate SPL → APL** using the `spl-to-apl` skill +4. **Map visualization types** (see table below) +5. **Test queries** with explicit time filters in Query tab (dashboards inherit time from UI picker) +6. **Adjust binning** for Axiom visualization +7. **Build Axiom dashboard** using templates (remove time filters from panel queries) +8. **Validate and deploy** with `dashboard-validate` and `dashboard-create` + +--- + +## Visualization Type Mapping + +| Splunk Visualization | Axiom Chart Type | Notes | +|---------------------|------------------|-------| +| Single Value | Statistic | Direct mapping | +| Line Chart | TimeSeries | Ensure `bin(_time, ...)` | +| Area Chart | TimeSeries | Same as line | +| Column Chart | TimeSeries | Axiom renders as bars | +| Bar Chart (horizontal) | Table | No horizontal bar; use table | +| Pie Chart | Pie | Limit to ≤6 categories | +| Table | Table | Direct mapping | +| Events List | LogStream | Add `take N` and `project-keep` | +| Choropleth Map | Table | No map support; use table | +| Scatter Plot | Table | No scatter; use table with dimensions | + +--- + +## Panel Translation Examples + +**Note:** Dashboard panel queries do NOT need time filters—the dashboard UI time picker applies to all panels automatically. The examples below show the final dashboard query format. + +### Single Value → Statistic + +**Splunk:** +```spl +index=web status>=500 +| stats count as errors +``` + +**Axiom (dashboard panel):** +```apl +['web-logs'] +| where status >= 500 +| summarize errors = count() +``` + +### Timechart → TimeSeries + +**Splunk:** +```spl +index=web +| timechart span=5m count by status +``` + +**Axiom (dashboard panel):** +```apl +['web-logs'] +| summarize count() by bin_auto(_time), status +``` + +### Stats Table → Table + +**Splunk:** +```spl +index=web status>=500 +| stats count by uri +| sort - count +| head 10 +``` + +**Axiom (dashboard panel):** +```apl +['web-logs'] +| where status >= 500 +| summarize count = count() by uri +| top 10 by count +| project URI = uri, Errors = count +``` + +### Top Command → Table + +**Splunk:** +```spl +index=web +| top limit=10 user_agent +``` + +**Axiom (dashboard panel):** +```apl +['web-logs'] +| summarize count() by user_agent +| top 10 by count_ +| project "User Agent" = user_agent, Count = count_ +``` + +### Events Search → LogStream + +**Splunk:** +```spl +index=web status>=500 +| table _time, uri, status, error_message +``` + +**Axiom (dashboard panel):** +```apl +['web-logs'] +| where status >= 500 +| project-keep _time, uri, status, error_message +| order by _time desc +| take 100 +``` + +### Chart with Eval → TimeSeries + +**Splunk:** +```spl +index=web +| timechart span=5m count as total, count(eval(status>=500)) as errors +| eval error_rate = round(errors/total*100, 2) +``` + +**Axiom (dashboard panel):** +```apl +['web-logs'] +| summarize + total = count(), + errors = countif(status >= 500) + by bin_auto(_time) +| extend error_rate = round(100.0 * errors / total, 2) +| project _time, error_rate +``` + +--- + +## Time Range Translation + +Splunk dashboards use time pickers. Axiom dashboards also have a time picker that automatically scopes all queries—**panel queries don't need explicit time filters**. + +For **ad-hoc testing** in the Query tab, use these time filters: + +| Splunk Time Picker | Axiom APL (for Query tab testing) | +|-------------------|-----------------------------------| +| Last 15 minutes | `where _time between (ago(15m) .. now())` | +| Last 60 minutes | `where _time between (ago(1h) .. now())` | +| Last 4 hours | `where _time between (ago(4h) .. now())` | +| Last 24 hours | `where _time between (ago(24h) .. now())` | +| Last 7 days | `where _time between (ago(7d) .. now())` | +| Today | `where _time between (startofday(now()) .. now())` | +| Yesterday | `where _time between (startofday(ago(1d)) .. startofday(now()))` | + +**Remember:** Remove time filters when placing queries in dashboard panels. + +--- + +## Binning Adjustment + +Splunk `timechart span=` maps to Axiom `bin(_time, ...)`. + +| Splunk | Axiom | +|--------|-------| +| `span=1m` | `bin(_time, 1m)` | +| `span=5m` | `bin(_time, 5m)` | +| `span=1h` | `bin(_time, 1h)` | +| `span=1d` | `bin(_time, 1d)` | + +Or use `bin_auto(_time)` for automatic sizing based on time range. + +--- + +## Field Name Mapping + +Splunk and Axiom may have different field names for the same data. + +| Concept | Splunk (common) | Axiom (common) | +|---------|-----------------|----------------| +| Timestamp | `_time` | `_time` | +| Raw event | `_raw` | `_raw` or structured fields | +| Source | `source` | `_source` or custom | +| Host | `host` | `host` or `['kubernetes.node.name']` | +| Index | `index` | N/A (use dataset) | + +**Tip:** Run `getschema` on your Axiom dataset to discover actual field names: +```apl +['your-dataset'] | where _time between (ago(1h) .. now()) | getschema +``` + +--- + +## Features Without Direct Equivalents + +| Splunk Feature | Axiom Approach | +|----------------|----------------| +| `transaction` | Use `summarize` with `make_list()` grouped by session/trace | +| `streamstats` | No direct equivalent; approximate with window functions | +| `eventstats` | Use subquery + join | +| Drilldown actions | Use SmartFilter for interactive filtering | +| Trellis layout | Create separate panels per dimension | +| Real-time search | Use short time window + fast refresh | + +--- + +## Common Migration Pitfalls + +### Unbounded Results +**Problem:** Splunk implicitly limits; Axiom may return all rows. +**Fix:** Add `| top N by ...` or `| take N` for tables/logs. + +### Case Sensitivity +**Problem:** Splunk search is case-insensitive by default. +**Fix:** Use `has` (case-insensitive) or `tolower()` for matching. + +### Field Escaping +**Problem:** Splunk uses bare field names; Axiom needs brackets for dots. +**Fix:** `field.name` → `['field.name']` + +### Different Aggregation Names +**Problem:** Function names differ between SPL and APL. +**Fix:** Consult `spl-to-apl` skill for complete mapping. + +--- + +## Migration Checklist + +- [ ] Inventory all panels from Splunk dashboard +- [ ] Map each panel's visualization type +- [ ] Translate SPL queries using spl-to-apl +- [ ] Verify field names with getschema +- [ ] Test queries in Query tab (with time filters for testing) +- [ ] Add `top N` or `take N` where needed +- [ ] Test each query individually in Axiom +- [ ] Build dashboard JSON (remove time filters from panel queries) +- [ ] Validate with `dashboard-validate` +- [ ] Deploy with `dashboard-create` +- [ ] Compare visually to original Splunk dashboard diff --git a/.agents/skills/building-dashboards/reference/templates/api-health.json b/.agents/skills/building-dashboards/reference/templates/api-health.json new file mode 100644 index 00000000..39f345c7 --- /dev/null +++ b/.agents/skills/building-dashboards/reference/templates/api-health.json @@ -0,0 +1,131 @@ +{ + "name": "{{service}} - API Health", + "description": "HTTP API health dashboard showing golden signals: traffic, errors, latency, and status distribution.", + "owner": "X-AXIOM-EVERYONE", + "datasets": ["{{dataset}}"], + "refreshTime": 60, + "schemaVersion": 2, + "timeWindowStart": "qr-now-1h", + "timeWindowEnd": "qr-now", + "charts": [ + { + "id": "total-requests", + "name": "Total Requests", + "type": "Statistic", + "query": { + "apl": "['{{dataset}}'] | summarize total = count()" + } + }, + { + "id": "error-rate", + "name": "Error Rate (%)", + "type": "Statistic", + "query": { + "apl": "['{{dataset}}'] | summarize total = count(), errors = countif(status >= 500) | extend error_rate = iff(total > 0, round(100.0 * errors / total, 2), 0.0) | project error_rate" + } + }, + { + "id": "p50-latency", + "name": "p50 Latency (ms)", + "type": "Statistic", + "query": { + "apl": "['{{dataset}}'] | summarize p50 = percentile(duration_ms, 50)" + } + }, + { + "id": "p99-latency", + "name": "p99 Latency (ms)", + "type": "Statistic", + "query": { + "apl": "['{{dataset}}'] | summarize p99 = percentile(duration_ms, 99)" + } + }, + { + "id": "traffic-ts", + "name": "Request Rate Over Time", + "type": "TimeSeries", + "query": { + "apl": "['{{dataset}}'] | summarize requests = count() by bin_auto(_time)" + } + }, + { + "id": "error-rate-ts", + "name": "Error Rate Over Time", + "type": "TimeSeries", + "query": { + "apl": "['{{dataset}}'] | summarize total = count(), errors = countif(status >= 500) by bin_auto(_time) | extend error_rate = iff(total > 0, round(100.0 * errors / total, 2), 0.0) | project _time, error_rate" + } + }, + { + "id": "latency-percentiles-ts", + "name": "Latency Percentiles Over Time", + "type": "TimeSeries", + "query": { + "apl": "['{{dataset}}'] | summarize percentiles_array(duration_ms, 50, 95, 99) by bin_auto(_time)" + } + }, + { + "id": "status-distribution", + "name": "Status Code Distribution", + "type": "Pie", + "query": { + "apl": "['{{dataset}}'] | extend status_class = case(status < 300, '2xx', status < 400, '3xx', status < 500, '4xx', '5xx') | summarize count() by status_class" + } + }, + { + "id": "traffic-by-status-ts", + "name": "Traffic by Status Class", + "type": "TimeSeries", + "query": { + "apl": "['{{dataset}}'] | extend status_class = case(status < 300, '2xx', status < 400, '3xx', status < 500, '4xx', '5xx') | summarize count() by bin_auto(_time), status_class" + } + }, + { + "id": "top-routes-by-traffic", + "name": "Top Routes by Traffic", + "type": "Table", + "query": { + "apl": "['{{dataset}}'] | summarize requests = count(), errors = countif(status >= 500), p95 = percentile(duration_ms, 95) by route | top 10 by requests | project Route = route, Requests = requests, Errors = errors, 'p95 (ms)' = p95" + } + }, + { + "id": "top-routes-by-errors", + "name": "Top Routes by Errors", + "type": "Table", + "query": { + "apl": "['{{dataset}}'] | where status >= 500 | summarize errors = count() by route, status | top 10 by errors | project Route = route, Status = status, Errors = errors" + } + }, + { + "id": "slowest-routes", + "name": "Slowest Routes (by p99)", + "type": "Table", + "query": { + "apl": "['{{dataset}}'] | summarize requests = count(), p99 = percentile(duration_ms, 99) by route | where requests >= 10 | top 10 by p99 | project Route = route, 'p99 (ms)' = p99, Requests = requests" + } + }, + { + "id": "recent-errors", + "name": "Recent Errors", + "type": "LogStream", + "query": { + "apl": "['{{dataset}}'] | where status >= 500 | project-keep _time, trace_id, method, route, status, error_message, duration_ms | order by _time desc | take 100" + } + } + ], + "layout": [ + {"i": "total-requests", "x": 0, "y": 0, "w": 3, "h": 2}, + {"i": "error-rate", "x": 3, "y": 0, "w": 3, "h": 2}, + {"i": "p50-latency", "x": 6, "y": 0, "w": 3, "h": 2}, + {"i": "p99-latency", "x": 9, "y": 0, "w": 3, "h": 2}, + {"i": "traffic-ts", "x": 0, "y": 2, "w": 6, "h": 4}, + {"i": "error-rate-ts", "x": 6, "y": 2, "w": 6, "h": 4}, + {"i": "latency-percentiles-ts", "x": 0, "y": 6, "w": 6, "h": 4}, + {"i": "status-distribution", "x": 6, "y": 6, "w": 3, "h": 4}, + {"i": "traffic-by-status-ts", "x": 9, "y": 6, "w": 3, "h": 4}, + {"i": "top-routes-by-traffic", "x": 0, "y": 10, "w": 4, "h": 4}, + {"i": "top-routes-by-errors", "x": 4, "y": 10, "w": 4, "h": 4}, + {"i": "slowest-routes", "x": 8, "y": 10, "w": 4, "h": 4}, + {"i": "recent-errors", "x": 0, "y": 14, "w": 12, "h": 6} + ] +} diff --git a/.agents/skills/building-dashboards/reference/templates/blank.json b/.agents/skills/building-dashboards/reference/templates/blank.json new file mode 100644 index 00000000..636cc1a2 --- /dev/null +++ b/.agents/skills/building-dashboards/reference/templates/blank.json @@ -0,0 +1,12 @@ +{ + "name": "{{name}}", + "description": "{{description}}", + "owner": "X-AXIOM-EVERYONE", + "datasets": ["{{dataset}}"], + "refreshTime": 60, + "schemaVersion": 2, + "timeWindowStart": "qr-now-1h", + "timeWindowEnd": "qr-now", + "charts": [], + "layout": [] +} diff --git a/.agents/skills/building-dashboards/reference/templates/org-usage-cost-control.json b/.agents/skills/building-dashboards/reference/templates/org-usage-cost-control.json new file mode 100644 index 00000000..08bfc155 --- /dev/null +++ b/.agents/skills/building-dashboards/reference/templates/org-usage-cost-control.json @@ -0,0 +1,384 @@ +{ + "charts": [ + { + "filters": [ + { + "active": true, + "apl": { + "apl": "['axiom-audit'] | where action == 'usageCalculated' | distinct ['resource.id'] | project key=['resource.id'], value=['resource.id'] | sort by key asc", + "queryOptions": { + "quickRange": "24h" + } + }, + "id": "org_filter", + "name": "Organization", + "options": [ + { + "default": true, + "key": "All Orgs", + "value": "" + } + ], + "selectType": "apl", + "type": "select" + }, + { + "active": true, + "apl": { + "apl": "['axiom-audit'] | where action == 'usageCalculated' | distinct tostring(['properties.dataset']) | project key=tostring(['properties.dataset']), value=tostring(['properties.dataset']) | sort by key asc", + "queryOptions": { + "quickRange": "24h" + } + }, + "id": "dataset_filter", + "name": "Dataset", + "options": [ + { + "default": true, + "key": "All Datasets", + "value": "" + } + ], + "selectType": "apl", + "type": "select" + }, + { + "active": true, + "apl": { + "apl": "[\"axiom-audit\"] | where action == \"runAPLQueryCost\" | extend User = case(isnotempty([\"actor.email\"]), [\"actor.email\"], isnotempty([\"actor.name\"]), strcat(\"[\", [\"actor.name\"], \"]\"), \"[unknown]\") | distinct User | project key=User, value=User | sort by key asc", + "queryOptions": { + "quickRange": "24h" + } + }, + "id": "user_filter", + "name": "User", + "options": [ + { + "default": true, + "key": "All Users", + "value": "" + } + ], + "selectType": "apl", + "type": "select" + }, + { + "active": true, + "id": "contract_gb", + "name": "Contract (GB/mo)", + "type": "search", + "options": [ + { + "default": true, + "key": "", + "value": "" + } + ] + } + ], + "id": "filters", + "name": "Filters", + "query": { + "apl": "" + }, + "type": "SmartFilter" + }, + { + "colorScheme": "Blue", + "customUnits": "GB", + "id": "total-ingest-gb", + "name": "Total Ingest", + "query": { + "apl": "declare query_parameters (org_filter:string = \"\", dataset_filter:string = \"\");\n['axiom-audit']\n| where action == 'usageCalculated'\n| where isempty(org_filter) or ['resource.id'] == org_filter\n| where isempty(dataset_filter) or tostring(['properties.dataset']) == dataset_filter\n| summarize total_bytes = sum(['properties.hourly_ingest_bytes'])\n| extend ['Total'] = total_bytes / 1000000000\n| project ['Total']" + }, + "type": "Statistic" + }, + { + "colorScheme": "Orange", + "customUnits": "GB/day", + "id": "daily-burn-rate", + "name": "Daily Burn Rate", + "query": { + "apl": "declare query_parameters (org_filter:string = \"\", dataset_filter:string = \"\");\n['axiom-audit']\n| where action == 'usageCalculated'\n| where isempty(org_filter) or ['resource.id'] == org_filter\n| where isempty(dataset_filter) or tostring(['properties.dataset']) == dataset_filter\n| summarize total_bytes = sum(['properties.hourly_ingest_bytes']), day_count = dcount(bin(_time, 1d))\n| extend days = iff(day_count == 0, 1.0, todouble(day_count))\n| extend ['GB/day'] = round(total_bytes / days / 1000000000, 0)\n| project ['GB/day']" + }, + "showChart": true, + "type": "Statistic", + "unit": "Gigabyte" + }, + { + "colorScheme": "Purple", + "customUnits": "GB", + "errorThreshold": "Above", + "errorThresholdValue": "15000000", + "id": "monthly-projection", + "name": "30-Day Projection", + "query": { + "apl": "declare query_parameters (org_filter:string = \"\", dataset_filter:string = \"\");\n['axiom-audit']\n| where action == 'usageCalculated'\n| where isempty(org_filter) or ['resource.id'] == org_filter\n| where isempty(dataset_filter) or tostring(['properties.dataset']) == dataset_filter\n| summarize total_bytes = sum(['properties.hourly_ingest_bytes']), day_count = dcount(bin(_time, 1d))\n| extend days = iff(day_count == 0, 1.0, todouble(day_count))\n| extend daily_rate = total_bytes / days\n| extend ['30d GB'] = round(daily_rate * 30 / 1000000000, 0)\n| project ['30d GB']" + }, + "type": "Statistic", + "warningThreshold": "Above", + "warningThresholdValue": "5000000" + }, + { + "colorScheme": "Teal", + "customUnits": "GB·ms", + "id": "total-query-cost", + "name": "Query Cost", + "query": { + "apl": "declare query_parameters (org_filter:string = \"\", dataset_filter:string = \"\");\n[\"axiom-audit\"]\n| where action == \"usageCalculated\"\n| where isempty(org_filter) or [\"resource.id\"] == org_filter\n| where isempty(dataset_filter) or tostring([\"properties.dataset\"]) == dataset_filter\n| summarize [\"GB·ms\"] = round(sum([\"properties.hourly_billable_query_gbms\"]), 0)\n| project [\"GB·ms\"]" + }, + "type": "Statistic" + }, + { + "colorScheme": "Yellow", + "customUnits": "%", + "errorThreshold": "Above", + "errorThresholdValue": "25", + "id": "wow-change", + "name": "Week-over-Week", + "overrideDashboardTimeRange": true, + "query": { + "apl": "declare query_parameters (org_filter:string = \"\", dataset_filter:string = \"\");\n['axiom-audit']\n| where _time between (ago(14d) .. now())\n| where action == 'usageCalculated'\n| where isempty(org_filter) or ['resource.id'] == org_filter\n| where isempty(dataset_filter) or tostring(['properties.dataset']) == dataset_filter\n| summarize this_week = sumif(['properties.hourly_ingest_bytes'], _time >= ago(7d)), last_week = sumif(['properties.hourly_ingest_bytes'], _time < ago(7d) and _time >= ago(14d))\n| extend ['WoW %'] = iff(last_week == 0, real(null), round(100.0 * (this_week - last_week) / last_week, 1))\n| project ['WoW %']" + }, + "type": "Statistic", + "warningThreshold": "Above", + "warningThresholdValue": "10" + }, + + { + "colorScheme": "Green", + "id": "active-datasets", + "name": "Active Datasets", + "query": { + "apl": "declare query_parameters (org_filter:string = \"\", dataset_filter:string = \"\");\n['axiom-audit']\n| where action == 'usageCalculated'\n| where isempty(org_filter) or ['resource.id'] == org_filter\n| where isempty(dataset_filter) or tostring(['properties.dataset']) == dataset_filter\n| summarize ['Datasets'] = dcount(['properties.dataset'])" + }, + "type": "Statistic" + }, + { + "datasetId": "axiom-audit", + "id": "ingest-by-dataset-ts", + "modified": 1769193868070, + "name": "Daily Ingest by Dataset (GB)", + "numSeries": 1, + "overrideDashboardCompareAgainst": false, + "overrideDashboardTimeRange": false, + "query": { + "apl": "declare query_parameters (org_filter:string = \"\", dataset_filter:string = \"\");\n[\"axiom-audit\"]\n| where action == \"usageCalculated\"\n| where isempty(org_filter) or [\"resource.id\"] == org_filter\n| where isempty(dataset_filter) or tostring([\"properties.dataset\"]) == dataset_filter\n| summarize [\"GB\"] = round(sum([\"properties.hourly_ingest_bytes\"]) / 1000000000, 1) by bin(_time, 1d), Dataset = tostring([\"properties.dataset\"])", + "endTime": "", + "libraries": [], + "queryOptions": { + "additionalQueryOptions": { + "aggChartOpts": "{\"*\":{\"variant\":\"bars\"},\"{\\\"alias\\\":\\\"GB\\\",\\\"field\\\":\\\"properties.hourly_ingest_bytes\\\",\\\"op\\\":\\\"computed\\\"}\":{\"displayNull\":\"auto\",\"variant\":\"bars\"}}" + }, + "aggChartOpts": "{\"*\":{\"variant\":\"bars\"},\"{\\\"alias\\\":\\\"GB\\\",\\\"field\\\":\\\"properties.hourly_ingest_bytes\\\",\\\"op\\\":\\\"computed\\\"}\":{\"displayNull\":\"auto\",\"variant\":\"bars\"}}", + "containsTimeFilter": "false", + "endTime": "2026-01-23T18:44:21.579Z", + "quickRange": "7d", + "startTime": "2026-01-16T18:44:21.579Z", + "timeSeriesView": "charts" + }, + "resolution": "auto", + "startTime": "" + }, + "type": "TimeSeries" + }, + { + "id": "top-datasets-ingest", + "name": "Top Datasets by Ingest", + "query": { + "apl": "declare query_parameters (org_filter:string = \"\", dataset_filter:string = \"\");\n[\"axiom-audit\"]\n| where action == \"usageCalculated\"\n| where isempty(org_filter) or [\"resource.id\"] == org_filter\n| where isempty(dataset_filter) or tostring([\"properties.dataset\"]) == dataset_filter\n| summarize ingest_gb = round(sum([\"properties.hourly_ingest_bytes\"]) / 1000000000, 1), query_gbms = round(sum([\"properties.hourly_billable_query_gbms\"]), 0) by Dataset = tostring([\"properties.dataset\"])\n| sort by ingest_gb desc\n| limit 15\n| project Dataset, [\"Ingest GB\"] = ingest_gb, [\"Query GB·ms\"] = query_gbms" + }, + "tableSettings": { + "columns": [ + {"name": "Dataset", "width": 200}, + {"name": "Ingest GB", "width": 100}, + {"name": "Query GB·ms", "width": 120} + ] + }, + "type": "Table" + }, + { + "id": "waste-candidates", + "name": "Lowest Query Activity", + "query": { + "apl": "declare query_parameters (org_filter:string = \"\", dataset_filter:string = \"\");\n[\"axiom-audit\"]\n| where action == \"usageCalculated\"\n| where isempty(org_filter) or [\"resource.id\"] == org_filter\n| where isempty(dataset_filter) or tostring([\"properties.dataset\"]) == dataset_filter\n| summarize ingest_bytes = sum([\"properties.hourly_ingest_bytes\"]), query_gbms = sum([\"properties.hourly_billable_query_gbms\"]) by Dataset = tostring([\"properties.dataset\"])\n| extend ingest_gb = round(ingest_bytes / 1000000000, 1)\n| extend work_per_gb = round(query_gbms / (ingest_gb + 0.001), 0)\n| where ingest_gb > 1\n| order by work_per_gb asc\n| project Dataset, [\"Ingest GB\"] = ingest_gb, [\"Query GB·ms\"] = round(query_gbms, 0), [\"Work/GB\"] = work_per_gb\n| take 10" + }, + "tableSettings": { + "columns": [ + {"name": "Dataset", "width": 200}, + {"name": "Ingest GB", "width": 100}, + {"name": "Query GB·ms", "width": 120}, + {"name": "Work/GB", "width": 90} + ] + }, + "type": "Table" + }, + { + "id": "top-orgs", + "name": "Top Orgs by Usage", + "query": { + "apl": "declare query_parameters (org_filter:string = \"\", dataset_filter:string = \"\");\n[\"axiom-audit\"]\n| where action == \"usageCalculated\"\n| where isempty(org_filter) or [\"resource.id\"] == org_filter\n| where isempty(dataset_filter) or tostring([\"properties.dataset\"]) == dataset_filter\n| summarize ingest_gb = round(sum([\"properties.hourly_ingest_bytes\"]) / 1000000000, 1), query_gbms = round(sum([\"properties.hourly_billable_query_gbms\"]), 0), datasets = dcount([\"properties.dataset\"]) by Org = [\"resource.id\"]\n| sort by ingest_gb desc\n| limit 10\n| project Org, [\"Ingest GB\"] = ingest_gb, [\"Query GB·ms\"] = query_gbms, Datasets = datasets" + }, + "tableSettings": { + "columns": [ + {"name": "Org", "width": 180}, + {"name": "Ingest GB", "width": 100}, + {"name": "Query GB·ms", "width": 120}, + {"name": "Datasets", "width": 80} + ] + }, + "type": "Table" + }, + { + "id": "note-actions", + "name": "Cost Optimization Actions", + "query": {}, + "text": "## Cost Control Playbook\n\n### Understanding Work/GB\n\n**Work/GB** = Query Cost (GB·ms) ÷ Ingest GB\n\nThis ratio measures how much query activity occurs relative to the amount of data ingested. Lower values indicate data that is stored but rarely queried.\n\n- **0** = Data ingested but never queried\n- **Low values** = Candidates for optimization\n- **Higher values** = Actively used data\n\nThe **Lowest Query Activity** panel ranks datasets by this ratio, with the least-queried at the top.\n\n### Optimization Actions\n\n| Signal | Action |\n|--------|--------|\n| **Work/GB = 0** | Consider dropping or stop ingesting |\n| **Low Work/GB + High Ingest** | Partition, sample, or filter at source |\n| **WoW spike** | Investigate recent deploys or new services |\n\n### Investigate Further\n\nUse **[axiom-sre](https://github.com/axiomhq/skills)** to investigate which logs are ingested but never queried, or which applications contribute the most events.\n\n```\nnpx skills add axiomhq/skills -s axiom-sre\n```", + "type": "Note" + }, + { + "id": "query-cost-by-dataset-ts", + "name": "Daily Query Cost by Dataset (GB·ms)", + "query": { + "apl": "declare query_parameters (org_filter:string = \"\", dataset_filter:string = \"\");\n[\"axiom-audit\"]\n| where action == \"usageCalculated\"\n| where isempty(org_filter) or [\"resource.id\"] == org_filter\n| where isempty(dataset_filter) or tostring([\"properties.dataset\"]) == dataset_filter\n| summarize [\"GB·ms\"] = round(sum([\"properties.hourly_billable_query_gbms\"]), 0) by bin(_time, 1d), Dataset = tostring([\"properties.dataset\"])", + "queryOptions": { + "aggChartOpts": "{\"{\\\"alias\\\":\\\"GB·ms\\\",\\\"field\\\":\\\"properties.hourly_billable_query_gbms\\\",\\\"op\\\":\\\"computed\\\"}\":{\"variant\":\"bars\",\"displayNull\":\"auto\"}}" + } + }, + "type": "TimeSeries" + }, + { + "id": "wow-ingest-delta", + "name": "Top Ingest Movers (WoW)", + "query": { + "apl": "declare query_parameters (org_filter:string = \"\", dataset_filter:string = \"\");\n[\"axiom-audit\"]\n| where _time between (ago(14d) .. now())\n| where action == \"usageCalculated\"\n| where isempty(org_filter) or [\"resource.id\"] == org_filter\n| where isempty(dataset_filter) or tostring([\"properties.dataset\"]) == dataset_filter\n| summarize this_week = sumif([\"properties.hourly_ingest_bytes\"], _time >= ago(7d)), last_week = sumif([\"properties.hourly_ingest_bytes\"], _time < ago(7d) and _time >= ago(14d)) by Dataset = tostring([\"properties.dataset\"])\n| extend delta_gb = (this_week - last_week) / 1000000000\n| extend [\"This Week GB\"] = round(this_week / 1000000000, 1), [\"Last Week GB\"] = round(last_week / 1000000000, 1)\n| extend [\"Delta GB\"] = round(delta_gb, 1)\n| extend [\"Delta %\"] = iff(last_week == 0, 100.0, round(100.0 * (this_week - last_week) / last_week, 1))\n| where delta_gb > 10\n| sort by delta_gb desc\n| limit 10\n| project Dataset, [\"This Week GB\"], [\"Last Week GB\"], [\"Delta GB\"], [\"Delta %\"]" + }, + "tableSettings": { + "columns": [ + {"name": "Dataset", "width": 180}, + {"name": "This Week GB", "width": 110}, + {"name": "Last Week GB", "width": 110}, + {"name": "Delta GB", "width": 90}, + {"name": "Delta %", "width": 80} + ] + }, + "type": "Table" + }, + { + "id": "top-users-query-cost", + "name": "Top 10 Users by Query Cost", + "query": { + "apl": "declare query_parameters (org_filter:string = \"\", dataset_filter:string = \"\", user_filter:string = \"\");\n[\"axiom-audit\"]\n| where action == \"runAPLQueryCost\"\n| where isempty(org_filter) or [\"resource.id\"] == org_filter\n| extend User = case(isnotempty([\"actor.email\"]), [\"actor.email\"], isnotempty([\"actor.name\"]), strcat(\"[\", [\"actor.name\"], \"]\"), isnotempty([\"actor.userAgent\"]), strcat(\"[\", [\"actor.userAgent\"], \"]\"), isnotempty([\"actor.id\"]), strcat(\"[id:\", substring([\"actor.id\"], 0, 8), \"]\"), isnotempty(source), strcat(\"[source:\", source, \"]\"), \"[unknown]\")\n| where isempty(user_filter) or User == user_filter\n| summarize query_cost = sum([\"properties.query_cost_gbms\"]), queries = count() by User\n| sort by query_cost desc\n| limit 10\n| project User, [\"Cost GB·ms\"] = round(query_cost, 0), Queries = queries" + }, + "tableSettings": { + "columns": [ + {"name": "User", "width": 250}, + {"name": "Cost GB·ms", "width": 120}, + {"name": "Queries", "width": 80} + ] + }, + "type": "Table" + }, + { + "id": "top-expensive-queries", + "name": "Top 10 Expensive Queries", + "query": { + "apl": "declare query_parameters (org_filter:string = \"\", dataset_filter:string = \"\", user_filter:string = \"\");\n[\"axiom-audit\"]\n| where action == \"runAPLQueryCost\"\n| where [\"properties.query_cost_gbms\"] > 0\n| where isempty(org_filter) or [\"resource.id\"] == org_filter\n| extend User = case(isnotempty([\"actor.email\"]), [\"actor.email\"], isnotempty([\"actor.name\"]), strcat(\"[\", [\"actor.name\"], \"]\"), isnotempty([\"actor.userAgent\"]), strcat(\"[\", [\"actor.userAgent\"], \"]\"), isnotempty([\"actor.id\"]), strcat(\"[id:\", substring([\"actor.id\"], 0, 8), \"]\"), isnotempty(source), strcat(\"[source:\", source, \"]\"), \"[unknown]\")\n| where isempty(user_filter) or User == user_filter\n| sort by [\"properties.query_cost_gbms\"] desc\n| limit 10\n| project User, [\"Cost GB·ms\"] = round([\"properties.query_cost_gbms\"], 0), Query = substring([\"properties.query_string\"], 0, 100)" + }, + "tableSettings": { + "columns": [ + {"name": "User", "width": 180}, + {"name": "Cost GB·ms", "width": 100}, + {"name": "Query", "width": 300} + ] + }, + "type": "Table" + }, + { + "id": "queries-per-user-ts", + "name": "Queries per User", + "query": { + "apl": "declare query_parameters (org_filter:string = \"\", dataset_filter:string = \"\", user_filter:string = \"\");\n[\"axiom-audit\"]\n| where action == \"runAPLQueryCost\"\n| where isempty(org_filter) or [\"resource.id\"] == org_filter\n| extend User = case(isnotempty([\"actor.email\"]), [\"actor.email\"], isnotempty([\"actor.name\"]), strcat(\"[\", [\"actor.name\"], \"]\"), isnotempty([\"actor.userAgent\"]), strcat(\"[\", [\"actor.userAgent\"], \"]\"), isnotempty([\"actor.id\"]), strcat(\"[id:\", substring([\"actor.id\"], 0, 8), \"]\"), isnotempty(source), strcat(\"[source:\", source, \"]\"), \"[unknown]\")\n| where isempty(user_filter) or User == user_filter\n| summarize Queries = count() by bin(_time, 1d), User", + "queryOptions": { + "aggChartOpts": "{\"{\\\"alias\\\":\\\"Queries\\\",\\\"op\\\":\\\"count\\\"}\":{\"variant\":\"line\",\"scaleDistr\":\"log\"}}" + } + }, + "type": "TimeSeries" + }, + { + "id": "over-contract-pct", + "name": "% Over Contract", + "type": "Statistic", + "colorScheme": "Red", + "customUnits": "%", + "errorThreshold": "Above", + "errorThresholdValue": "50", + "warningThreshold": "Above", + "warningThresholdValue": "20", + "query": { + "apl": "declare query_parameters (org_filter:string = \"\", dataset_filter:string = \"\", contract_gb:string = \"\");\n[\"axiom-audit\"]\n| where action == \"usageCalculated\"\n| where isempty(org_filter) or [\"resource.id\"] == org_filter\n| where isempty(dataset_filter) or tostring([\"properties.dataset\"]) == dataset_filter\n| summarize total_bytes = sum([\"properties.hourly_ingest_bytes\"]), day_count = dcount(bin(_time, 1d))\n| extend days = iff(day_count == 0, 1.0, todouble(day_count))\n| extend daily_gb = total_bytes / days / 1000000000\n| extend monthly_gb = daily_gb * 30\n| extend contract = todouble(contract_gb)\n| extend [\"%\"] = iff(isempty(contract_gb) or contract <= 0, real(null), round(100.0 * (monthly_gb - contract) / contract, 0))\n| project [\"%\"]" + } + }, + { + "id": "required-cut-pct", + "name": "Required Cut", + "type": "Statistic", + "colorScheme": "Orange", + "customUnits": "%", + "query": { + "apl": "declare query_parameters (org_filter:string = \"\", dataset_filter:string = \"\", contract_gb:string = \"\");\n[\"axiom-audit\"]\n| where action == \"usageCalculated\"\n| where isempty(org_filter) or [\"resource.id\"] == org_filter\n| where isempty(dataset_filter) or tostring([\"properties.dataset\"]) == dataset_filter\n| summarize total_bytes = sum([\"properties.hourly_ingest_bytes\"]), day_count = dcount(bin(_time, 1d))\n| extend days = iff(day_count == 0, 1.0, todouble(day_count))\n| extend daily_gb = total_bytes / days / 1000000000\n| extend monthly_gb = daily_gb * 30\n| extend contract = todouble(contract_gb)\n| extend [\"%\"] = iff(isempty(contract_gb) or contract <= 0 or monthly_gb <= contract, real(null), round(100.0 * (monthly_gb - contract) / monthly_gb, 0))\n| project [\"%\"]" + } + }, + { + "id": "top-queried-fields", + "name": "Query Filter Patterns", + "type": "Table", + "query": { + "apl": "declare query_parameters (org_filter:string = \"\", dataset_filter:string = \"\");\n[\"axiom-history\"]\n| where isnotempty([\"query.apl\"])\n| extend parsed = parse_apl([\"query.apl\"])\n| extend dataset = tostring(parsed.body.source.name.name)\n| where isempty(dataset_filter) or dataset == dataset_filter\n| extend ops = todynamic(tostring(parsed.body.operations))\n| mv-expand ops\n| where ops.kind == \"Where\"\n| extend p = ops.predicate\n| extend is_logical = p.op in (\"and\", \"or\")\n| extend f1 = iff(not(is_logical) and p.kind == \"BinaryExpr\", tostring(p.left.name), \"\")\n| extend o1 = iff(not(is_logical) and p.kind == \"BinaryExpr\", tostring(p.op), \"\")\n| extend v1 = iff(not(is_logical) and p.kind == \"BinaryExpr\", coalesce(tostring(p.right.value), tostring(p.right.name)), \"\")\n| extend f2 = iff(tostring(p.left.op) !in (\"and\", \"or\", \"\"), tostring(p.left.left.name), \"\")\n| extend o2 = iff(tostring(p.left.op) !in (\"and\", \"or\", \"\"), tostring(p.left.op), \"\")\n| extend v2 = coalesce(tostring(p.left.right.value), tostring(p.left.right.name))\n| extend f3 = iff(tostring(p.right.op) !in (\"and\", \"or\", \"\"), tostring(p.right.left.name), \"\")\n| extend o3 = iff(tostring(p.right.op) !in (\"and\", \"or\", \"\"), tostring(p.right.op), \"\")\n| extend v3 = coalesce(tostring(p.right.right.value), tostring(p.right.right.name))\n| extend in_field = iff(p.kind == \"InExpr\", tostring(p.left.name), \"\")\n| extend in_op = iff(p.kind == \"InExpr\", tostring(p.op), \"\")\n| extend in_vals = iff(p.kind == \"InExpr\", strcat(tostring(p.right.list[0].value), \", \", tostring(p.right.list[1].value), \"...\"), \"\")\n| extend func_name = case(p.kind == \"CallExpr\", tostring(p.func.name), p.left.kind == \"CallExpr\", tostring(p.left.func.name), p.right.kind == \"CallExpr\", tostring(p.right.func.name), \"\")\n| extend func_params_arr = case(p.kind == \"CallExpr\", p.params, p.left.kind == \"CallExpr\", p.left.params, p.right.kind == \"CallExpr\", p.right.params, dynamic([]))\n| extend func_fields = strcat(tostring(func_params_arr[0].expr.name), iff(array_length(func_params_arr) > 1, strcat(\", \", tostring(func_params_arr[1].expr.name)), \"\"))\n| extend field = coalesce(iff(isnotempty(in_field), in_field, \"\"), iff(isnotempty(func_fields) and func_fields != \"_time\", func_fields, \"\"), iff(isnotempty(f1) and f1 != \"_time\", f1, \"\"), iff(isnotempty(f2) and f2 != \"_time\", f2, \"\"), iff(isnotempty(f3) and f3 != \"_time\", f3, \"\"))\n| extend op = coalesce(iff(isnotempty(in_op), in_op, \"\"), iff(isnotempty(func_name), func_name, \"\"), iff(isnotempty(o1), o1, \"\"), iff(isnotempty(o2), o2, \"\"), iff(isnotempty(o3), o3, \"\"))\n| extend val = coalesce(iff(isnotempty(in_vals), in_vals, \"\"), iff(isnotempty(v1), v1, \"\"), iff(isnotempty(v2), v2, \"\"), iff(isnotempty(v3), v3, \"\"))\n| where isnotempty(field) and isnotempty(op)\n| summarize Queries = count() by dataset, field, op, val\n| sort by Queries desc\n| limit 20\n| project Dataset=dataset, Field=field, Op=op, Value=substring(val, 0, 40), Queries" + }, + "tableSettings": { + "columns": [ + {"name": "Dataset", "width": 140}, + {"name": "Field", "width": 180}, + {"name": "Op", "width": 70}, + {"name": "Value", "width": 150}, + {"name": "Queries", "width": 80} + ] + } + } + ], + "datasets": [ + "axiom-audit", + "axiom-history" + ], + "description": "Usage monitoring dashboard for tracking ingest volume, query costs, burn rate projections, and waste identification. Filter by org to scope analysis.", + "layout": [ + { "i": "filters", "x": 0, "y": 0, "w": 12, "h": 1 }, + { "i": "total-ingest-gb", "x": 0, "y": 1, "w": 3, "h": 2 }, + { "i": "daily-burn-rate", "x": 3, "y": 1, "w": 3, "h": 2 }, + { "i": "monthly-projection", "x": 6, "y": 1, "w": 3, "h": 2 }, + { "i": "over-contract-pct", "x": 9, "y": 1, "w": 3, "h": 2 }, + { "i": "required-cut-pct", "x": 0, "y": 3, "w": 3, "h": 2 }, + { "i": "wow-change", "x": 3, "y": 3, "w": 3, "h": 2 }, + { "i": "total-query-cost", "x": 6, "y": 3, "w": 3, "h": 2 }, + { "i": "active-datasets", "x": 9, "y": 3, "w": 3, "h": 2 }, + { "i": "ingest-by-dataset-ts", "x": 0, "y": 5, "w": 6, "h": 4 }, + { "i": "query-cost-by-dataset-ts", "x": 6, "y": 5, "w": 6, "h": 4 }, + { "i": "wow-ingest-delta", "x": 0, "y": 9, "w": 6, "h": 4 }, + { "i": "waste-candidates", "x": 6, "y": 9, "w": 6, "h": 4 }, + { "i": "top-datasets-ingest", "x": 0, "y": 13, "w": 6, "h": 4 }, + { "i": "top-orgs", "x": 6, "y": 13, "w": 6, "h": 4 }, + { "i": "top-queried-fields", "x": 0, "y": 17, "w": 6, "h": 4 }, + { "i": "note-actions", "x": 6, "y": 17, "w": 6, "h": 4 }, + { "i": "top-users-query-cost", "x": 0, "y": 21, "w": 6, "h": 4 }, + { "i": "top-expensive-queries", "x": 6, "y": 21, "w": 6, "h": 4 }, + { "i": "queries-per-user-ts", "x": 0, "y": 25, "w": 12, "h": 4 } + ], + "name": "Org Usage & Cost Control", + "owner": "8bc79245-bc38-4a2b-a37e-2e5b2fd5ec70", + "refreshTime": 300, + "schemaVersion": 2, + "timeWindowEnd": "qr-now", + "timeWindowStart": "qr-now-30d", + "version": "1769205867543185447" +} diff --git a/.agents/skills/building-dashboards/reference/templates/service-overview-with-filters.json b/.agents/skills/building-dashboards/reference/templates/service-overview-with-filters.json new file mode 100644 index 00000000..8f5405ae --- /dev/null +++ b/.agents/skills/building-dashboards/reference/templates/service-overview-with-filters.json @@ -0,0 +1,132 @@ +{ + "name": "{{service}} - Service Overview (Filtered)", + "description": "Interactive dashboard for {{service}} with SmartFilter. Allows filtering by route and status code.", + "owner": "X-AXIOM-EVERYONE", + "datasets": ["{{dataset}}"], + "refreshTime": 60, + "schemaVersion": 2, + "timeWindowStart": "qr-now-1h", + "timeWindowEnd": "qr-now", + "charts": [ + { + "id": "filters", + "name": "Filters", + "type": "SmartFilter", + "query": {"apl": ""}, + "filters": [ + { + "id": "route_filter", + "name": "Route", + "type": "select", + "selectType": "apl", + "active": true, + "apl": { + "apl": "['{{dataset}}'] | where service == '{{service}}' | distinct route | project key=route, value=route | sort by key asc", + "queryOptions": {"quickRange": "1h"} + }, + "options": [ + {"key": "All", "value": "", "default": true} + ] + }, + { + "id": "status_filter", + "name": "Status", + "type": "select", + "selectType": "list", + "active": true, + "options": [ + {"key": "All", "value": "", "default": true}, + {"key": "2xx", "value": "2"}, + {"key": "3xx", "value": "3"}, + {"key": "4xx", "value": "4"}, + {"key": "5xx", "value": "5"} + ] + } + ] + }, + { + "id": "error-rate", + "name": "Error Rate", + "type": "Statistic", + "query": { + "apl": "declare query_parameters (route_filter:string = \"\", status_filter:string = \"\");\n['{{dataset}}']\n| where service == '{{service}}'\n| where isempty(route_filter) or route == route_filter\n| where isempty(status_filter) or tostring(status) startswith status_filter\n| summarize total = count(), errors = countif(status >= 500)\n| extend error_rate = iff(total > 0, round(100.0 * errors / total, 2), 0.0)\n| project ['Error Rate %'] = error_rate" + } + }, + { + "id": "p95-latency", + "name": "p95 Latency", + "type": "Statistic", + "query": { + "apl": "declare query_parameters (route_filter:string = \"\", status_filter:string = \"\");\n['{{dataset}}']\n| where service == '{{service}}'\n| where isempty(route_filter) or route == route_filter\n| where isempty(status_filter) or tostring(status) startswith status_filter\n| summarize ['p95 (ms)'] = round(percentile(duration_ms, 95), 1)" + } + }, + { + "id": "traffic-rps", + "name": "Total Requests", + "type": "Statistic", + "query": { + "apl": "declare query_parameters (route_filter:string = \"\", status_filter:string = \"\");\n['{{dataset}}']\n| where service == '{{service}}'\n| where isempty(route_filter) or route == route_filter\n| where isempty(status_filter) or tostring(status) startswith status_filter\n| summarize ['Total Requests'] = count()" + } + }, + { + "id": "error-count", + "name": "Errors", + "type": "Statistic", + "query": { + "apl": "declare query_parameters (route_filter:string = \"\", status_filter:string = \"\");\n['{{dataset}}']\n| where service == '{{service}}' and status >= 500\n| where isempty(route_filter) or route == route_filter\n| where isempty(status_filter) or tostring(status) startswith status_filter\n| summarize Errors = count()" + } + }, + { + "id": "request-rate-ts", + "name": "Request Rate Over Time", + "type": "TimeSeries", + "query": { + "apl": "declare query_parameters (route_filter:string = \"\", status_filter:string = \"\");\n['{{dataset}}']\n| where service == '{{service}}'\n| where isempty(route_filter) or route == route_filter\n| where isempty(status_filter) or tostring(status) startswith status_filter\n| summarize ['req/min'] = count() by bin_auto(_time)" + } + }, + { + "id": "error-rate-ts", + "name": "Error Rate Over Time (%)", + "type": "TimeSeries", + "query": { + "apl": "declare query_parameters (route_filter:string = \"\", status_filter:string = \"\");\n['{{dataset}}']\n| where service == '{{service}}'\n| where isempty(route_filter) or route == route_filter\n| where isempty(status_filter) or tostring(status) startswith status_filter\n| summarize total = count(), errors = countif(status >= 500) by bin_auto(_time)\n| extend ['error_rate_%'] = iff(total > 0, round(100.0 * errors / total, 2), 0.0)\n| project _time, ['error_rate_%']" + } + }, + { + "id": "latency-heatmap", + "name": "Latency Distribution", + "type": "Heatmap", + "query": { + "apl": "declare query_parameters (route_filter:string = \"\", status_filter:string = \"\");\n['{{dataset}}']\n| where service == '{{service}}'\n| where isempty(route_filter) or route == route_filter\n| where isempty(status_filter) or tostring(status) startswith status_filter\n| summarize histogram(duration_ms, 15) by bin_auto(_time)" + } + }, + { + "id": "top-routes", + "name": "Top Routes by Traffic", + "type": "Table", + "query": { + "apl": "declare query_parameters (route_filter:string = \"\", status_filter:string = \"\");\n['{{dataset}}']\n| where service == '{{service}}'\n| where isempty(route_filter) or route == route_filter\n| where isempty(status_filter) or tostring(status) startswith status_filter\n| summarize Requests = count(), Errors = countif(status >= 500), ['p95 (ms)'] = round(percentile(duration_ms, 95), 0) by route\n| top 10 by Requests\n| project Route = route, Requests, Errors, ['p95 (ms)']" + } + }, + { + "id": "recent-errors", + "name": "Recent Errors", + "type": "LogStream", + "query": { + "apl": "declare query_parameters (route_filter:string = \"\", status_filter:string = \"\");\n['{{dataset}}']\n| where service == '{{service}}' and status >= 500\n| where isempty(route_filter) or route == route_filter\n| where isempty(status_filter) or tostring(status) startswith status_filter\n| project-keep _time, trace_id, route, status, error_message, duration_ms\n| order by _time desc\n| take 100" + } + } + ], + "layout": [ + {"i": "filters", "x": 0, "y": 0, "w": 12, "h": 1}, + {"i": "error-rate", "x": 0, "y": 1, "w": 3, "h": 2}, + {"i": "p95-latency", "x": 3, "y": 1, "w": 3, "h": 2}, + {"i": "traffic-rps", "x": 6, "y": 1, "w": 3, "h": 2}, + {"i": "error-count", "x": 9, "y": 1, "w": 3, "h": 2}, + {"i": "request-rate-ts", "x": 0, "y": 3, "w": 6, "h": 3}, + {"i": "error-rate-ts", "x": 6, "y": 3, "w": 6, "h": 3}, + {"i": "latency-heatmap", "x": 0, "y": 6, "w": 12, "h": 3}, + {"i": "top-routes", "x": 0, "y": 9, "w": 6, "h": 4}, + {"i": "recent-errors", "x": 6, "y": 9, "w": 6, "h": 4} + ] +} diff --git a/.agents/skills/building-dashboards/reference/templates/service-overview.json b/.agents/skills/building-dashboards/reference/templates/service-overview.json new file mode 100644 index 00000000..2bbf598c --- /dev/null +++ b/.agents/skills/building-dashboards/reference/templates/service-overview.json @@ -0,0 +1,113 @@ +{ + "name": "{{service}} - Service Overview", + "description": "Oncall dashboard for {{service}} service. Shows traffic, errors, latency, and recent error logs.", + "owner": "X-AXIOM-EVERYONE", + "datasets": ["{{dataset}}"], + "refreshTime": 60, + "schemaVersion": 2, + "timeWindowStart": "qr-now-1h", + "timeWindowEnd": "qr-now", + "charts": [ + { + "id": "error-rate", + "name": "Error Rate", + "type": "Statistic", + "query": { + "apl": "['{{dataset}}'] | where service == '{{service}}' | summarize total = count(), errors = countif(status >= 500) | extend error_rate = iff(total > 0, round(100.0 * errors / total, 2), 0.0) | project ['Error Rate %'] = error_rate" + } + }, + { + "id": "p95-latency", + "name": "p95 Latency", + "type": "Statistic", + "query": { + "apl": "['{{dataset}}'] | where service == '{{service}}' | summarize ['p95 (ms)'] = round(percentile(duration_ms, 95), 1)" + } + }, + { + "id": "traffic-rps", + "name": "Total Requests", + "type": "Statistic", + "query": { + "apl": "['{{dataset}}'] | where service == '{{service}}' | summarize ['Total Requests'] = count()" + } + }, + { + "id": "error-count", + "name": "Errors", + "type": "Statistic", + "query": { + "apl": "['{{dataset}}'] | where service == '{{service}}' and status >= 500 | summarize Errors = count()" + } + }, + { + "id": "request-rate-ts", + "name": "Request Rate Over Time", + "type": "TimeSeries", + "query": { + "apl": "['{{dataset}}'] | where service == '{{service}}' | summarize ['req/min'] = count() by bin_auto(_time)" + } + }, + { + "id": "error-rate-ts", + "name": "Error Rate Over Time (%)", + "type": "TimeSeries", + "query": { + "apl": "['{{dataset}}'] | where service == '{{service}}' | summarize total = count(), errors = countif(status >= 500) by bin_auto(_time) | extend ['error_rate_%'] = iff(total > 0, round(100.0 * errors / total, 2), 0.0) | project _time, ['error_rate_%']" + } + }, + { + "id": "latency-ts", + "name": "Latency Percentiles (ms)", + "type": "TimeSeries", + "query": { + "apl": "['{{dataset}}'] | where service == '{{service}}' | summarize percentiles_array(duration_ms, 50, 95, 99) by bin_auto(_time)" + } + }, + { + "id": "latency-heatmap", + "name": "Latency Distribution", + "type": "Heatmap", + "query": { + "apl": "['{{dataset}}'] | where service == '{{service}}' | summarize histogram(duration_ms, 15) by bin_auto(_time)" + } + }, + { + "id": "status-distribution", + "name": "Status Codes", + "type": "Pie", + "query": { + "apl": "['{{dataset}}'] | where service == '{{service}}' | extend status_class = case(status < 300, '2xx', status < 400, '3xx', status < 500, '4xx', '5xx') | summarize count() by status_class" + } + }, + { + "id": "top-routes", + "name": "Top Routes by Traffic", + "type": "Table", + "query": { + "apl": "['{{dataset}}'] | where service == '{{service}}' | summarize Requests = count(), Errors = countif(status >= 500), ['p95 (ms)'] = round(percentile(duration_ms, 95), 0) by route | top 10 by Requests | project Route = route, Requests, Errors, ['p95 (ms)']" + } + }, + { + "id": "recent-errors", + "name": "Recent Errors", + "type": "LogStream", + "query": { + "apl": "['{{dataset}}'] | where service == '{{service}}' and status >= 500 | project-keep _time, trace_id, route, status, error_message, duration_ms | order by _time desc | take 100" + } + } + ], + "layout": [ + {"i": "error-rate", "x": 0, "y": 0, "w": 3, "h": 2}, + {"i": "p95-latency", "x": 3, "y": 0, "w": 3, "h": 2}, + {"i": "traffic-rps", "x": 6, "y": 0, "w": 3, "h": 2}, + {"i": "error-count", "x": 9, "y": 0, "w": 3, "h": 2}, + {"i": "request-rate-ts", "x": 0, "y": 2, "w": 6, "h": 3}, + {"i": "error-rate-ts", "x": 6, "y": 2, "w": 6, "h": 3}, + {"i": "latency-ts", "x": 0, "y": 5, "w": 6, "h": 3}, + {"i": "latency-heatmap", "x": 6, "y": 5, "w": 6, "h": 3}, + {"i": "status-distribution", "x": 0, "y": 8, "w": 4, "h": 3}, + {"i": "top-routes", "x": 4, "y": 8, "w": 8, "h": 3}, + {"i": "recent-errors", "x": 0, "y": 11, "w": 12, "h": 4} + ] +} diff --git a/.agents/skills/building-dashboards/scripts/axiom-api b/.agents/skills/building-dashboards/scripts/axiom-api new file mode 100755 index 00000000..27ad5f97 --- /dev/null +++ b/.agents/skills/building-dashboards/scripts/axiom-api @@ -0,0 +1,89 @@ +#!/usr/bin/env bash +# axiom-api: Make authenticated requests to the Axiom DASHBOARD/APP API +# +# ⚠️ This script rewrites URLs (api.* → app.*/api) for the dashboard API. +# For data/metrics API calls (/v1/query/*, /v1/datasets), use scripts/metrics/axiom-api instead. +# +# Usage: axiom-api [json-body] +# +# Reads credentials from ~/.axiom.toml (shared with axiom-sre) +# +# Examples: +# axiom-api prod GET /dashboards +# axiom-api prod GET /dashboards/abc123 +# axiom-api prod POST /dashboards '{"name":"Test",...}' +# axiom-api prod GET /user + +set -euo pipefail + +DEPLOYMENT="${1:-}" +METHOD="${2:-}" +PATH_="${3:-}" +BODY="${4:-}" + +if [[ -z "$DEPLOYMENT" || -z "$METHOD" || -z "$PATH_" ]]; then + echo "Usage: axiom-api [json-body]" >&2 + exit 1 +fi + +# Reject data/metrics paths that should use scripts/metrics/axiom-api +case "$PATH_" in + /v1/query/*|/v1/datasets*) + echo "Error: This script is for the dashboard/app API." >&2 + echo "For data/metrics endpoints ($PATH_), use scripts/metrics/axiom-api instead." >&2 + exit 2 + ;; +esac + +CONFIG_FILE="$HOME/.axiom.toml" +if [[ ! -f "$CONFIG_FILE" ]]; then + echo "Error: $CONFIG_FILE not found" >&2 + exit 1 +fi + +# Parse TOML for deployment config +extract_value() { + local key="$1" + awk -v deployment="$DEPLOYMENT" -v key="$key" ' + /^[[:space:]]*\[deployments\./ { in_deployment = ($0 ~ "\\[deployments\\." deployment "\\]") } + in_deployment { + gsub(/^[[:space:]]+/, "") + if ($1 == key) { + sub(/^[^=]*=[[:space:]]*/, "") + if (match($0, /^"[^"]*"/)) { + $0 = substr($0, RSTART+1, RLENGTH-2) + } else { + sub(/[[:space:]]*#.*$/, "") + } + print + exit + } + } + ' "$CONFIG_FILE" +} + +URL=$(extract_value "url") +TOKEN=$(extract_value "token") +ORG_ID=$(extract_value "org_id") + +if [[ -z "$URL" || -z "$TOKEN" || -z "$ORG_ID" ]]; then + echo "Error: Could not find deployment '$DEPLOYMENT' in $CONFIG_FILE" >&2 + exit 1 +fi + +API_URL="${URL%/}/v2" + +CURL_ARGS=( + -s + -X "$METHOD" + -H "Authorization: Bearer $TOKEN" + -H "X-Axiom-Org-Id: $ORG_ID" + -H "Content-Type: application/json" + -H "Accept: application/json" +) + +if [[ -n "$BODY" ]]; then + CURL_ARGS+=(-d "$BODY") +fi + +curl "${CURL_ARGS[@]}" "${API_URL}${PATH_}" diff --git a/.agents/skills/building-dashboards/scripts/dashboard-chart-patch b/.agents/skills/building-dashboards/scripts/dashboard-chart-patch new file mode 100755 index 00000000..df969dad --- /dev/null +++ b/.agents/skills/building-dashboards/scripts/dashboard-chart-patch @@ -0,0 +1,122 @@ +#!/usr/bin/env bash +# dashboard-chart-patch: Patch one chart in an existing dashboard +# +# Usage: +# dashboard-chart-patch (--version | --overwrite) [--message ] +# +# The patch file must contain a JSON object. It is sent as the `chart` JSON +# merge patch, so null values remove existing chart fields. +# +# Examples: +# dashboard-chart-patch prod dash-uid error-rate ./chart.patch.json --version 12 +# dashboard-chart-patch prod dash-uid error-rate ./chart.patch.json --overwrite --message "Update error chart" + +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" + +usage() { + echo "Usage: dashboard-chart-patch (--version | --overwrite) [--message ]" >&2 +} + +DEPLOYMENT="${1:-}" +DASHBOARD_UID="${2:-}" +CHART_ID="${3:-}" +PATCH_FILE="${4:-}" + +if [[ $# -lt 4 || -z "$DEPLOYMENT" || -z "$DASHBOARD_UID" || -z "$CHART_ID" || -z "$PATCH_FILE" ]]; then + usage + exit 1 +fi + +shift 4 + +VERSION="" +OVERWRITE="false" +MESSAGE="" + +while [[ $# -gt 0 ]]; do + case "$1" in + --version) + if [[ $# -lt 2 || -z "${2:-}" ]]; then + echo "Error: --version requires a value" >&2 + usage + exit 1 + fi + VERSION="$2" + shift 2 + ;; + --overwrite) + OVERWRITE="true" + shift + ;; + --message) + if [[ $# -lt 2 ]]; then + echo "Error: --message requires a value" >&2 + usage + exit 1 + fi + MESSAGE="$2" + shift 2 + ;; + -h|--help) + usage + exit 0 + ;; + *) + echo "Error: Unknown option: $1" >&2 + usage + exit 1 + ;; + esac +done + +if [[ ! -f "$PATCH_FILE" ]]; then + echo "Error: File not found: $PATCH_FILE" >&2 + exit 1 +fi + +if [[ "$OVERWRITE" == "true" && -n "$VERSION" ]]; then + echo "Error: Use either --version or --overwrite, not both" >&2 + exit 1 +fi + +if [[ "$OVERWRITE" == "false" && -z "$VERSION" ]]; then + echo "Error: --version is required unless --overwrite is set" >&2 + echo "Fetch the current dashboard first: dashboard-get $DEPLOYMENT $DASHBOARD_UID" >&2 + exit 1 +fi + +if [[ -n "$VERSION" && ! "$VERSION" =~ ^[0-9]+$ ]]; then + echo "Error: --version must be a numeric dashboard version" >&2 + exit 1 +fi + +if ! jq -e 'type == "object"' "$PATCH_FILE" > /dev/null; then + echo "Error: chart patch must be a JSON object" >&2 + exit 1 +fi + +PATCH_ID=$(jq -r 'if has("id") then .id else empty end' "$PATCH_FILE") +if [[ -n "$PATCH_ID" && "$PATCH_ID" != "$CHART_ID" ]]; then + echo "Error: chart patch id must match chart id '$CHART_ID'" >&2 + exit 1 +fi + +CHART_PATCH=$(jq -c '.' "$PATCH_FILE") + +BODY=$(jq -n \ + --argjson chart "$CHART_PATCH" \ + --arg message "$MESSAGE" \ + --argjson overwrite "$OVERWRITE" \ + '{chart: $chart} + + (if $overwrite then {overwrite: true} else {} end) + + (if $message != "" then {message: $message} else {} end)') + +if [[ -n "$VERSION" ]]; then + BODY=$(echo "$BODY" | jq --argjson version "$VERSION" '. + {version: $version}') +fi + +RESPONSE=$("$SCRIPT_DIR/axiom-api" "$DEPLOYMENT" PATCH "/dashboards/uid/$DASHBOARD_UID/charts/$CHART_ID" "$BODY") + +echo "$RESPONSE" | jq . diff --git a/.agents/skills/building-dashboards/scripts/dashboard-copy b/.agents/skills/building-dashboards/scripts/dashboard-copy new file mode 100755 index 00000000..deb7321e --- /dev/null +++ b/.agents/skills/building-dashboards/scripts/dashboard-copy @@ -0,0 +1,51 @@ +#!/usr/bin/env bash +# dashboard-copy: Clone an existing dashboard +# +# Usage: dashboard-copy [new-name] +# +# Examples: +# dashboard-copy prod abc123 # Creates "Original Name (copy)" +# dashboard-copy prod abc123 "My New Dashboard" # Creates with custom name + +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" + +DEPLOYMENT="${1:-}" +ID="${2:-}" +NEW_NAME="${3:-}" + +if [[ -z "$DEPLOYMENT" || -z "$ID" ]]; then + echo "Usage: dashboard-copy [new-name]" >&2 + exit 1 +fi + +# Fetch original +ORIGINAL=$("$SCRIPT_DIR/axiom-api" "$DEPLOYMENT" GET "/dashboards/uid/$ID") + +# Get original name if new name not provided +if [[ -z "$NEW_NAME" ]]; then + ORIG_NAME=$(echo "$ORIGINAL" | jq -r '.dashboard.name') + NEW_NAME="${ORIG_NAME} (copy)" +fi + +# Strip server fields and set new name on the dashboard subobject +BODY=$(echo "$ORIGINAL" | jq --arg name "$NEW_NAME" ' + .dashboard | + del(.id, .uid, .version, .createdAt, .updatedAt, .createdBy, .updatedBy) | + .name = $name +' | jq '{dashboard: .}') + +RESPONSE=$("$SCRIPT_DIR/axiom-api" "$DEPLOYMENT" POST "/dashboards" "$BODY") + +# Print new UID and name +ID=$(echo "$RESPONSE" | jq -r '.dashboard.uid // empty') +NAME=$(echo "$RESPONSE" | jq -r '.dashboard.dashboard.name // empty') + +if [[ -n "$ID" ]]; then + echo -e "${ID}\t${NAME}" +else + echo "Error copying dashboard:" >&2 + echo "$RESPONSE" | jq . >&2 + exit 1 +fi diff --git a/.agents/skills/building-dashboards/scripts/dashboard-create b/.agents/skills/building-dashboards/scripts/dashboard-create new file mode 100755 index 00000000..00c81641 --- /dev/null +++ b/.agents/skills/building-dashboards/scripts/dashboard-create @@ -0,0 +1,55 @@ +#!/usr/bin/env bash +# dashboard-create: Create a dashboard from JSON file +# +# Usage: dashboard-create +# +# The JSON file should NOT contain id, version, createdAt, updatedAt fields. +# Use templates or dashboard-from-template to generate valid JSON. +# +# Examples: +# dashboard-create prod ./my-dashboard.json +# dashboard-create staging ./dashboard.json + +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" + +DEPLOYMENT="${1:-}" +JSON_FILE="${2:-}" + +if [[ -z "$DEPLOYMENT" || -z "$JSON_FILE" ]]; then + echo "Usage: dashboard-create " >&2 + exit 1 +fi + +if [[ ! -f "$JSON_FILE" ]]; then + echo "Error: File not found: $JSON_FILE" >&2 + exit 1 +fi + +# Validate dashboard structure before deploying +if ! "$SCRIPT_DIR/dashboard-validate" "$JSON_FILE" --strict >&2; then + echo "Error: Dashboard validation failed. Fix the errors above before deploying." >&2 + exit 1 +fi + +# Read, strip server-managed fields, and normalize layout for react-grid-layout +BODY=$(jq -L "$SCRIPT_DIR" ' + include "dashboard-normalize"; + del(.id, .uid, .version, .createdAt, .updatedAt, .createdBy, .updatedBy) | + normalize_dashboard_layout +' "$JSON_FILE") + +BODY=$(echo "$BODY" | jq '{dashboard: .}') + +RESPONSE=$("$SCRIPT_DIR/axiom-api" "$DEPLOYMENT" POST "/dashboards" "$BODY") + +# Extract and print the new dashboard UID +ID=$(echo "$RESPONSE" | jq -r '.dashboard.uid // empty') +if [[ -n "$ID" ]]; then + echo "$ID" +else + echo "Error creating dashboard:" >&2 + echo "$RESPONSE" | jq . >&2 + exit 1 +fi diff --git a/.agents/skills/building-dashboards/scripts/dashboard-delete b/.agents/skills/building-dashboards/scripts/dashboard-delete new file mode 100755 index 00000000..7ae99eaf --- /dev/null +++ b/.agents/skills/building-dashboards/scripts/dashboard-delete @@ -0,0 +1,32 @@ +#!/usr/bin/env bash +# dashboard-delete: Delete a dashboard +# +# Usage: dashboard-delete +# +# ⚠️ This is irreversible! Axiom cannot restore deleted dashboards. +# +# Examples: +# dashboard-delete prod abc123 + +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" + +DEPLOYMENT="${1:-}" +ID="${2:-}" + +if [[ -z "$DEPLOYMENT" || -z "$ID" ]]; then + echo "Usage: dashboard-delete " >&2 + exit 1 +fi + +# Confirm +read -p "Delete dashboard $ID? This cannot be undone. [y/N] " -n 1 -r +echo +if [[ ! $REPLY =~ ^[Yy]$ ]]; then + echo "Cancelled" + exit 0 +fi + +"$SCRIPT_DIR/axiom-api" "$DEPLOYMENT" DELETE "/dashboards/uid/$ID" +echo "Deleted: $ID" diff --git a/.agents/skills/building-dashboards/scripts/dashboard-from-template b/.agents/skills/building-dashboards/scripts/dashboard-from-template new file mode 100755 index 00000000..40255660 --- /dev/null +++ b/.agents/skills/building-dashboards/scripts/dashboard-from-template @@ -0,0 +1,57 @@ +#!/usr/bin/env bash +# dashboard-from-template: Instantiate a dashboard template with substitutions +# +# Usage: dashboard-from-template