Skip to content

nrednav/entropy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

60 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Entropy

Hex.pm

Entropy is a fault-injection tool for the Elixir/OTP runtime.

It acts as a sidecar application, stochastically selecting and suspending ("zombifying") processes to simulate Grey Failures (degradation) rather than simple termination.

Purpose

Standard supervisors recover from crashes (Termination). They do not recover from hanging processes (Degradation). Entropy validates system resilience by forcibly suspending processes for defined intervals, proving whether the host system correctly handles timeouts and backpressure.

Features

  • Grey Failure Simulation: Simulates degradation (freezing) in addition to simple termination to validate timeout handling.
  • Stochastic Selection: Uses weighted probabilistic selection to ensure fair coverage of the process tree over time.
  • Safety Circuit Breaker: Automatically halts injection if node CPU or Memory exceeds configured safety thresholds.
  • Immunity: Supports static and dynamic immunity to protect critical infrastructure processes.
  • Dead Man's Switch: Guarantees that all suspended processes are automatically resumed if the Entropy daemon crashes.

Installation

Add entropy to your list of dependencies in mix.exs:

def deps do
  [
    {:entropy, "~> 0.1.0"}
  ]
end

Ensure the :os_mon application is enabled in your application callback, as Entropy relies on :cpu_sup for safety checks.

def application do
  [
    extra_applications: [:logger, :os_mon]
  ]
end

Configuration

Entropy is configured via the standard application environment.

Note: The safety thresholds define the Circuit Breaker. If system resources exceed these limits, Entropy halts injection to prevent cascading failure.

# config/config.exs

config :entropy,
  # Enable/Disable the injection scheduler.
  # Default: false (Safety first)
  is_injection_enabled: true,

  # The time between injection attempts in milliseconds.
  # Default: 5000
  injection_interval_ms: 2000,

  # The frequency at which the Circuit Breaker polls system resources.
  # Lower values increase reaction time but add system overhead.
  # Default: 1000
  safety_check_interval_ms: 1000,

  # The maximum CPU utilization (0.0 - 100.0) allowed.
  # If the host node exceeds this, injection pauses.
  # Default: 95.0
  max_cpu_util_percent: 80.0,

  # The maximum Memory utilization (0.0 - 100.0) allowed.
  # Default: 90.0
  max_memory_util_percent: 80.0,

  # The maximum number of concurrent zombies allowed.
  # Default: 50
  max_active_zombies: 25,

  # A list of atoms (application names) strictly immune to selection.
  # :kernel, :init, :logger, and :entropy are immune by default.
  # Default: []
  immune_modules: [:my_critical_app],

  # A list of atoms (application names) allowed to be targeted.
  # If empty, all applications are valid targets.
  # Default: []
  target_applications: [:my_target_app],

  # The duration range {min, max} in ms for a process suspension.
  # Default: {1000, 10_000}
  zombie_ttl_range_ms: {1000, 10_000},

  # Fault Strategy Weights
  # A keyword list defining the relative frequency of fault types.
  # Keys: :suspend, :kill
  # Default: [suspend: 10, kill: 0] = Suspension only
  fault_strategy_weights: [suspend: 9, kill: 1],

  # Cooldown period for repetitive telemetry events in ms.
  # Default: 1000
  telemetry_debounce_ms: 1000,

  # Whether the AxiomaticLogger should output to `stdout`.
  # In standard operation, if the system crashes (T=0), the what and why must be
  # preserved.
  is_axiomatic_reporting_enabled: true

Usage

Entropy operates as a daemon. Interactions occur via the Entropy module or by observing Telemetry events.

1. Verification

After deployment, confirm the daemon is active and the environment permits injection.

# Returns true if the Entropy supervision tree is alive.
iex> Entropy.is_alive?()
true

# Returns true if the Circuit Breaker allows injection.
# (i.e., CPU < max_cpu_util_percent AND Memory < max_memory_util_percent)
iex> Entropy.is_ready?()
true

2. Runtime Control

Configuration changes (e.g., increasing aggression) can be applied without restarting the node.

  1. Modify config.exs or runtime.exs
  2. Execute reload:
iex> Entropy.reload_config()
:ok

3. Dynamic Immunity

Specific processes can be temporarily granted immunity during critical transactions.

# Protect the current process from chaos
Entropy.State.ImmunityRegistry.register(self())

# Critical work...

# Revoke protection
Entropy.State.ImmunityRegistry.unregister(self())

Observability

Entropy emits structured events via :telemetry.

Injection Events

  • [:entropy, :injection, :start] - Injection attempt initiated.
  • [:entropy, :injection, :stop] - Injection successfully completed.
    • Metadata: %{strategy: :suspend | :killm ...}
  • [:entropy, :injection, :failure] - Injection failed (e.g., target died before suspension).

Safety Events

  • [:entropy, :safety, :veto] - Circuit Breaker tripped. Injection paused.
  • [:entropy, :safety, :recovery] - Circuit Breaker reset. Injection resumed.

Scheduler Events

  • [:entropy, :scheduler, :skip] - Cycle skipped (e.g., due to circuit breaker or zombie limit).
  • [:entropy, :scheduler, :noop] - Cycle executed but no valid victim found.

Configuration Events

Events emitted when Entropy.reload_config/0 applies runtime changes.

  • [:entropy, :scheduler, :injection_interval_change]
    • Metadata: %{old: integer(), new: integer()}
  • [:entropy, :circuit_breaker, :threshold_change]
    • Metadata: %{old: map(), new: map()}

Architecture

Circuit Breaker

Entropy polls :cpu_sup and :memsup at a configurable interval (Default: 1000ms). If usage exceeds the configured max_cpu_util_percent or max_memory_util_percent, the system enters a Safety State. No new faults are injected until the system stabilizes.

Zombie Registry

Suspended processes are tracked in an ETS table owned by Entropy.State.ZombieRegistry.

  • Constraint: If the registry process crashes, the BEAM VM automatically resumes all suspended processes (Dead Man's Switch).
  • Limit: The system enforces a hard limit of max_active_zombies (Default: 50) to prevent total resource starvation.

Census

Entropy maintains a cached snapshot of the process table to minimize overhead. The Entropy.Sanctuary.Census process refreshes this list on a fixed interval (default: 5s).

Refresh Lifecycle:

  1. Retrieves the global process list.
  2. Filters processes based on the target_applications allowlist (if configured).
  3. Converts the result to a Tuple for O(1) random access.

This architecture ensures that the Scheduler performs constant-time victim selection without blocking the VM with expensive Process.list/0 calls during every tick.

Development

This section explains how to set up the project locally for development.

Requirements

  • Elixir ~> 1.16 (OTP 26+)
  • :os_mon (Required for System Sensors)

Setup

# 1. Clone the repository
## via HTTPS
git clone https://github.com/nrednav/entropy.git

## via SSH
git clone git@github.com:nrednav/entropy.git

cd entropy

# 2. Install dependencies
mix deps.get

# 3. Run the test suite
# Note: Tests use a Simulated Physics engine to avoid actual system interference.
mix test

Testing Strategy

Entropy uses a Deterministic Testing Pattern to eliminate race conditions.

  • Simulated Physics: Tests run against a Physics simulation, not the host OS.
  • Manual Polling: In the test environment, the Circuit Breaker's automatic polling loop is paused. You must explicitly trigger state updates.

Example test workflow:

# 1. Set the simulated physical state
Entropy.Simulation.Physics.set_cpu_util_percent(99.9)

# 2. Force the Circuit Breaker to read the new state
Entropy.State.CircuitBreaker.force_safety_check()

# 3. Assert the system reaction
Wait.until(fn ->
  case Entropy.State.CircuitBreaker.get_safety_report() do
    {:unsafe, metrics} -> metrics.cpu_util_percent == 99.9
    _ -> false
  end
end)

# or
assert Entropy.is_ready?() == false

Versioning

This project uses Semantic Versioning. For a list of available versions, see the repository tag list.

Issues & Requests

If you encounter a bug or have a feature request, please open an issue on the GitHub repository.

About

Entropy is a fault-injection tool for the Elixir/OTP runtime.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages