Entropy is a fault-injection tool for the Elixir/OTP runtime.
It acts as a sidecar application, stochastically selecting and suspending ("zombifying") processes to simulate Grey Failures (degradation) rather than simple termination.
Standard supervisors recover from crashes (Termination). They do not recover from hanging processes (Degradation). Entropy validates system resilience by forcibly suspending processes for defined intervals, proving whether the host system correctly handles timeouts and backpressure.
- Grey Failure Simulation: Simulates degradation (freezing) in addition to simple termination to validate timeout handling.
- Stochastic Selection: Uses weighted probabilistic selection to ensure fair coverage of the process tree over time.
- Safety Circuit Breaker: Automatically halts injection if node CPU or Memory exceeds configured safety thresholds.
- Immunity: Supports static and dynamic immunity to protect critical infrastructure processes.
- Dead Man's Switch: Guarantees that all suspended processes are automatically resumed if the Entropy daemon crashes.
Add entropy to your list of dependencies in mix.exs:
def deps do
[
{:entropy, "~> 0.1.0"}
]
endEnsure the :os_mon application is enabled in your application callback, as
Entropy relies on :cpu_sup for safety checks.
def application do
[
extra_applications: [:logger, :os_mon]
]
endEntropy is configured via the standard application environment.
Note: The safety thresholds define the Circuit Breaker. If system resources exceed these limits, Entropy halts injection to prevent cascading failure.
# config/config.exs
config :entropy,
# Enable/Disable the injection scheduler.
# Default: false (Safety first)
is_injection_enabled: true,
# The time between injection attempts in milliseconds.
# Default: 5000
injection_interval_ms: 2000,
# The frequency at which the Circuit Breaker polls system resources.
# Lower values increase reaction time but add system overhead.
# Default: 1000
safety_check_interval_ms: 1000,
# The maximum CPU utilization (0.0 - 100.0) allowed.
# If the host node exceeds this, injection pauses.
# Default: 95.0
max_cpu_util_percent: 80.0,
# The maximum Memory utilization (0.0 - 100.0) allowed.
# Default: 90.0
max_memory_util_percent: 80.0,
# The maximum number of concurrent zombies allowed.
# Default: 50
max_active_zombies: 25,
# A list of atoms (application names) strictly immune to selection.
# :kernel, :init, :logger, and :entropy are immune by default.
# Default: []
immune_modules: [:my_critical_app],
# A list of atoms (application names) allowed to be targeted.
# If empty, all applications are valid targets.
# Default: []
target_applications: [:my_target_app],
# The duration range {min, max} in ms for a process suspension.
# Default: {1000, 10_000}
zombie_ttl_range_ms: {1000, 10_000},
# Fault Strategy Weights
# A keyword list defining the relative frequency of fault types.
# Keys: :suspend, :kill
# Default: [suspend: 10, kill: 0] = Suspension only
fault_strategy_weights: [suspend: 9, kill: 1],
# Cooldown period for repetitive telemetry events in ms.
# Default: 1000
telemetry_debounce_ms: 1000,
# Whether the AxiomaticLogger should output to `stdout`.
# In standard operation, if the system crashes (T=0), the what and why must be
# preserved.
is_axiomatic_reporting_enabled: trueEntropy operates as a daemon. Interactions occur via the Entropy module or by observing Telemetry events.
After deployment, confirm the daemon is active and the environment permits injection.
# Returns true if the Entropy supervision tree is alive.
iex> Entropy.is_alive?()
true
# Returns true if the Circuit Breaker allows injection.
# (i.e., CPU < max_cpu_util_percent AND Memory < max_memory_util_percent)
iex> Entropy.is_ready?()
trueConfiguration changes (e.g., increasing aggression) can be applied without restarting the node.
- Modify
config.exsorruntime.exs - Execute reload:
iex> Entropy.reload_config()
:okSpecific processes can be temporarily granted immunity during critical transactions.
# Protect the current process from chaos
Entropy.State.ImmunityRegistry.register(self())
# Critical work...
# Revoke protection
Entropy.State.ImmunityRegistry.unregister(self())Entropy emits structured events via :telemetry.
[:entropy, :injection, :start]- Injection attempt initiated.[:entropy, :injection, :stop]- Injection successfully completed.- Metadata:
%{strategy: :suspend | :killm ...}
- Metadata:
[:entropy, :injection, :failure]- Injection failed (e.g., target died before suspension).
[:entropy, :safety, :veto]- Circuit Breaker tripped. Injection paused.[:entropy, :safety, :recovery]- Circuit Breaker reset. Injection resumed.
[:entropy, :scheduler, :skip]- Cycle skipped (e.g., due to circuit breaker or zombie limit).[:entropy, :scheduler, :noop]- Cycle executed but no valid victim found.
Events emitted when Entropy.reload_config/0 applies runtime changes.
[:entropy, :scheduler, :injection_interval_change]- Metadata:
%{old: integer(), new: integer()}
- Metadata:
[:entropy, :circuit_breaker, :threshold_change]- Metadata:
%{old: map(), new: map()}
- Metadata:
Entropy polls :cpu_sup and :memsup at a configurable interval (Default:
1000ms). If usage exceeds the configured max_cpu_util_percent or
max_memory_util_percent, the system enters a Safety State. No new faults
are injected until the system stabilizes.
Suspended processes are tracked in an ETS table owned by
Entropy.State.ZombieRegistry.
- Constraint: If the registry process crashes, the BEAM VM automatically resumes all suspended processes (Dead Man's Switch).
- Limit: The system enforces a hard limit of
max_active_zombies(Default: 50) to prevent total resource starvation.
Entropy maintains a cached snapshot of the process table to minimize overhead.
The Entropy.Sanctuary.Census process refreshes this list on a fixed interval
(default: 5s).
Refresh Lifecycle:
- Retrieves the global process list.
- Filters processes based on the
target_applicationsallowlist (if configured). - Converts the result to a Tuple for
O(1)random access.
This architecture ensures that the Scheduler performs constant-time victim
selection without blocking the VM with expensive Process.list/0 calls during
every tick.
This section explains how to set up the project locally for development.
- Elixir
~> 1.16(OTP 26+) :os_mon(Required for System Sensors)
# 1. Clone the repository
## via HTTPS
git clone https://github.com/nrednav/entropy.git
## via SSH
git clone git@github.com:nrednav/entropy.git
cd entropy
# 2. Install dependencies
mix deps.get
# 3. Run the test suite
# Note: Tests use a Simulated Physics engine to avoid actual system interference.
mix testEntropy uses a Deterministic Testing Pattern to eliminate race conditions.
- Simulated Physics: Tests run against a
Physicssimulation, not the host OS. - Manual Polling: In the test environment, the Circuit Breaker's automatic polling loop is paused. You must explicitly trigger state updates.
Example test workflow:
# 1. Set the simulated physical state
Entropy.Simulation.Physics.set_cpu_util_percent(99.9)
# 2. Force the Circuit Breaker to read the new state
Entropy.State.CircuitBreaker.force_safety_check()
# 3. Assert the system reaction
Wait.until(fn ->
case Entropy.State.CircuitBreaker.get_safety_report() do
{:unsafe, metrics} -> metrics.cpu_util_percent == 99.9
_ -> false
end
end)
# or
assert Entropy.is_ready?() == falseThis project uses Semantic Versioning. For a list of available versions, see the repository tag list.
If you encounter a bug or have a feature request, please open an issue on the GitHub repository.