Skip to content

Tracking: event-native log ingestion and exploration foundation #25

@STRRL

Description

@STRRL

Summary

Build the first version of an event-native log exploration pipeline for lapp.

The core idea is to treat each log line as an event with:

  • raw text payload
  • extracted attributes
  • inferred metadata

This issue tracks the work needed to turn plain text logs into a searchable, filterable event stream with basic pattern grouping and drilldown-friendly metadata.

Goals

  • Ingest plain text logs as structured events
  • Extract stable metadata from text when no structured envelope exists
  • Keep raw text as the source of truth
  • Separate explicit parsed attributes from inferred metadata
  • Enable basic exploration through timeline, facets, and event list views

Non-goals

  • Perfect semantic extraction
  • Full natural-language understanding of logs
  • Complex multi-entity graph modeling in v1
  • Advanced query language design

Proposed event model

{
  "ts": "2026-03-10T21:00:00Z",
  "text": "raw log line",
  "attrs": {
    "level": "error",
    "service": "payments-api",
    "env": "prod",
    "request_id": "req_123",
    "trace_id": "trace_456",
    "user_id": "user_789",
    "endpoint": "/checkout"
  },
  "inferred": {
    "pattern": "user <id> failed to login",
    "entity": "payments-api"
  }
}

Design principles

  • Raw text must always be preserved
  • attrs and inferred must stay separate
  • Favor deterministic extraction over clever guessing
  • Entity detection is a navigation aid, not ground truth
  • The first version should optimize for usefulness, not completeness

Execution plan

Phase 1: Event schema

  • Define the v1 event schema
  • Add sample event fixtures covering JSON, logfmt, key=value, and plain text logs
  • Document required vs optional fields

Phase 2: Ingestion foundation

  • Implement a raw log line -> event entry point
  • Ensure every line can be ingested even when parsing fails
  • Preserve original text unchanged in storage

Phase 3: Parser pipeline

  • Add JSON parser
  • Add logfmt parser
  • Add key=value parser
  • Add regex-based prefix parser for common timestamp/level formats
  • Add plain text fallback parser
  • Define parser ordering and fallback behavior

Phase 4: Stable attribute extraction

  • Extract timestamp
  • Extract severity / level
  • Extract service name candidates
  • Extract environment candidates
  • Extract request ID / trace ID / span ID / correlation ID
  • Extract endpoint / route candidates when possible
  • Extract user / tenant identifiers when explicitly present

Phase 5: Canonical normalization

  • Create canonical field mappings (for example service, service_name, service.name -> attrs.service)
  • Normalize severity values to a fixed enum
  • Normalize environment values (production -> prod, etc.)
  • Normalize endpoint values where safe
  • Add tests for alias resolution and value normalization

Phase 6: Inference layer

  • Implement basic pattern extraction by replacing variable tokens (numbers, UUIDs, IDs, hashes)
  • Store a normalized inferred.pattern
  • Implement minimal entity inference
  • Use attrs.service as the primary entity when available
  • Fall back to heuristic text-based inference only when necessary

Phase 7: Indexing and filtering

  • Support filtering by time range
  • Support filtering by level
  • Support filtering by service
  • Support filtering by environment
  • Support filtering by request ID / trace ID
  • Support filtering by pattern

Phase 8: Minimal exploration UI

  • Build a timeline view showing event counts over time
  • Build a facet panel for top values (level, service, env, pattern)
  • Build an event list view showing raw text and extracted metadata
  • Add click-to-filter interactions from facets and event rows

Phase 9: Quality and observability

  • Record extraction source for each parsed field where useful
  • Measure parser hit rates
  • Measure missing-field rates for timestamp, level, and service
  • Add fixture-based tests for common real-world log shapes

Suggested milestone split

Milestone 1: Ingestion + schema

  • Event schema
  • Raw ingestion path
  • Parser pipeline scaffold

Milestone 2: Basic extraction

  • Timestamp
  • Level
  • Service
  • Request / trace identifiers
  • Canonical normalization

Milestone 3: Usable exploration

  • Pattern extraction
  • Basic indexing
  • Timeline + facets + event list
  • Click-to-filter

Acceptance criteria for v1

  • A plain text log line can always be ingested as an event
  • Common structured log formats can populate attrs
  • Users can filter events by time, level, service, and pattern
  • Users can inspect raw text alongside extracted metadata
  • Pattern grouping works well enough to reduce repeated noisy lines

Open questions

  • What is the canonical field schema for lapp beyond the v1 core fields?
  • Should inferred fields carry confidence scores in v1 or wait until v2?
  • What storage/index model is best for raw text + extracted attrs + inferred fields?
  • Should request/trace correlation be part of v1 or follow immediately after?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions