# Windows (v5.0.6): out-of-bounds heap write in monkey libevent backend (`cb_event`) under high event load — out_forward to an unreachable upstream

## Summary

On Windows, an `out_forward` output pointing at an upstream that stays unreachable accumulates a large number of retry timers (each a socketpair + libevent event on the engine event base). The monkey libevent backend (`lib/monkey/mk_core/mk_event_libevent.c`) collects ready events in a **fixed-size `ctx->fired` array** that is allocated once at loop creation (256 entries for the engine loop) and **never grows**, while `cb_event` appends to it **with no bounds check**. When more than `queue_size` events become ready in a single `event_base_loop()` pass, `cb_event` writes past the end of the array — an out-of-bounds heap write that corrupts adjacent allocations or, when it reaches the guard/unmapped page, faults directly.

Confirmed on **v5.0.6** (latest release) under full page heap: the faulting write lands exactly on the guard page immediately after `ctx->fired`, and the debugger names the overrun variable `fired`. The same defect is also present in v4.0.13 (identical fault, identical Windows `FAILURE_ID_HASH`), so this is a long-standing issue, not a recent regression.

## Environment

- **Fluent Bit v5.0.6** — official Windows x64 build (`fluent-bit-5.0.6-win64`, `FileVersion 5.0.6.0`); source tag `v5.0.6`. Crash reproduced in ~7–22 minutes.
- Windows Server 2019 — `10.0.17763`, x64, 4 procs.
- Output: `forward` with TLS + `Upstream`, multiple workers; continuous input (`tail` / Windows event log).
- libevent (bundled) built **without** thread locking (`evthread_use_*` is never called).
- Also reproduced on v4.0.13 (see "Also present in v4.0.13").

## Reproduction

Point an `out_forward` at a black-hole address so every connect runs into the timeout and retries pile up. This happened in production while the destination was not available. The black hole is just to speed up the issue.

```ini
[OUTPUT]
    Name         forward
    Match        *
    Host         10.255.255.1   # black hole, no RST
    Port         24224
    Retry_Limit  false          # unlimited retries -> many concurrent retry timers
```

Drive continuous input so the engine keeps scheduling flushes/retries. The process crashes after minutes. Capture with:
`procdump -accepteula -ma -t -e -w fluent-bit.exe C:\dumps`

## Root cause

The engine event loop is created with a fixed size:

```c
evl = mk_event_loop_create(256);          /* src/flb_engine.c  */
```

which allocates the fired array exactly once and records its capacity:

```c
/* _mk_event_loop_create() */
ctx->fired = mk_mem_alloc_z(sizeof(struct mk_event) * size);   /* size = 256 */
ctx->queue_size = size;
```

`_mk_event_add()` registers further fds into libevent (`event_new(... cb_event, event); event_add(...)`) **without ever growing `ctx->fired` or `queue_size`**. The number of registered events is therefore unbounded, but the fired array stays at 256.

`cb_event()` appends one entry per ready event, with **no bounds check**:

```c
/* cb_event(), mk_event_libevent.c */
i = ctx->fired_count;
fired = &ctx->fired[i];
fired->fd   = event->fd;       /* line 99 */
fired->mask = mask;            /* line 100 */
fired->data = event;
ctx->fired_count++;
```

`fired_count` is reset to 0 before each loop and counts up across all events fired in that pass:

```c
/* _mk_event_wait_with_flags() */
ctx->fired_count = 0;
event_base_loop(ctx->base, flags);
```

When more than `queue_size` (256) events become ready in a single pass, `&ctx->fired[fired_count]` walks past the allocation and `cb_event` corrupts whatever follows it on the heap. (The same unchecked append exists in `_mk_event_inject()`.)

### Faulting dump (v5.0.6, without page heap)

```
fluent_bit!cb_event+0xa8                       [mk_event_libevent.c @ 100]   <-- mov [rax+8],ecx
fluent_bit!event_persist_closure+0x2f6         [libevent/event.c @ 1580]
fluent_bit!event_process_active_single_queue   [libevent/event.c @ 1639]
fluent_bit!event_process_active                [libevent/event.c @ 1738]
fluent_bit!event_base_loop+0x296               [libevent/event.c @ 1961]
fluent_bit!_mk_event_wait_with_flags+0x3a      [mk_event_libevent.c @ 456]
fluent_bit!mk_event_wait / flb_engine_start    [src/flb_engine.c @ 1141]
```

`rax = 0x0000023e9628eff8`, write to `[rax+8] = 0x0000023e9628f000` (next, unmapped page), `ecx = 1` (`MK_EVENT_READ`). `struct mk_event` is `{ int fd; int type; uint32_t mask; ... }`, so offset 8 is `mask` — the faulting instruction is exactly `fired->mask = mask`, with `fired = &ctx->fired[fired_count]` at the end of the 256-entry allocation.

### Confirmed under full page heap (v5.0.6)

Re-run with full page heap enabled (`gflags /p /enable fluent-bit.exe /full`; `NTGLOBALFLAG: 2000000`, `APPLICATION_VERIFIER_LOADED: 1`):

```
fluent_bit!cb_event+0x9d   [mk_event_libevent.c @ 99]   mov dword ptr [rax],ecx
FAULTING_LOCAL_VARIABLE_NAME:  fired
FAILURE_BUCKET_ID:  INVALID_POINTER_WRITE_AVRF_c0000005_fluent-bit.exe!cb_event
```

`rax = 0x0000025a923e6000` is exactly page-aligned — the page-heap guard page placed immediately after the `ctx->fired` allocation. With page heap the fault now occurs on the **first** field write of the entry (`fired->fd = event->fd`, line 99, `ecx` = the fd value) rather than `fired->mask`, because the whole entry now starts past the array end. The debugger names the overrun target directly: `FAULTING_LOCAL_VARIABLE_NAME: fired`. The guard-page alignment places this write at index 256 of the 256-entry array — the append for the 257th simultaneously-ready event in one `event_base_loop` pass. This is a definitive heap buffer overrun of `ctx->fired`, not a use-after-free.

## Also present in v4.0.13

Under full page heap, v4.0.13 faults **identically** (`cb_event`, line 99, `FAULTING_LOCAL_VARIABLE_NAME: fired`, write on the guard page after `ctx->fired`) and Windows assigns it the **same `FAILURE_ID_HASH`** as the v5.0.6 page-heap crash — i.e. it is classified as the same defect. Without page heap the overrun surfaced in v4.0.13 as roaming corruption of adjacent structures (a libevent timer min-heap and the engine event priority queue, with stray fd-range integers and partial-pointer overwrites), consistent with `struct mk_event` entries written past `ctx->fired`. The defect is unchanged across releases.

## Secondary defect — timeout teardown (also present in v5.0.6)

Independent of the overflow, the timer teardown double-closes the read-end fd and has two owners freeing the same `ev_map`:

- `_mk_event_timeout_destroy()` closes `event->fd` (= `ev_map->pipe[0]`) without nulling it, then calls `_mk_event_del()`, which closes `ev_map->pipe[0]` again. On Windows the fd is reused immediately, so the second close can hit a socket now owned by another `event_base`.
- `cb_timeout()` self-frees `ev_map` on `send` failure while `_mk_event_del()` also frees it (double-free / UAF).

Worth fixing in the same pass, but not the corruptor demonstrated above.

## Suggested fixes

### 1. Bound / grow the `fired` array (primary)

The number of events that can fire in one `event_base_loop()` pass equals the number of registered events, which is unbounded — so the fixed-capacity `fired` array must grow with it. Both append sites (`cb_event` and `_mk_event_inject`) need the guard, so factor it into one helper in `mk_event_libevent.c`:

```c
/* Append a fired event, growing ctx->fired on demand so it can never
 * overflow when more events fire in one loop pass than queue_size. */
static inline int mk_event_fired_push(struct mk_event_ctx *ctx,
                                      evutil_socket_t fd, int mask,
                                      struct mk_event *event)
{
    struct mk_event *tmp;
    int new_size;

    if (ctx->fired_count >= ctx->queue_size) {
        new_size = (ctx->queue_size > 0) ? (ctx->queue_size * 2) : 256;
        tmp = mk_mem_realloc(ctx->fired, sizeof(struct mk_event) * new_size);
        if (tmp == NULL) {
            return -1;            /* OOM: drop rather than overflow */
        }
        ctx->fired      = tmp;
        ctx->queue_size = new_size;
    }

    ctx->fired[ctx->fired_count].fd   = fd;
    ctx->fired[ctx->fired_count].mask = mask;
    ctx->fired[ctx->fired_count].data = event;
    ctx->fired_count++;

    return 0;
}
```

`cb_event` then becomes:

```c
static void cb_event(evutil_socket_t fd, short flags, void *data)
{
    int mask = 0;
    struct mk_event *event = data;
    struct ev_map  *map    = event->data;

    if (flags & EV_READ)  mask |= MK_EVENT_READ;
    if (flags & EV_WRITE) mask |= MK_EVENT_WRITE;

    mk_event_fired_push(map->ctx, event->fd, mask, event);
}
```

and the append block in `_mk_event_inject`:

```c
    event->mask = mask;
    if (mk_event_fired_push(ctx, event->fd, mask, event) == 0) {
        loop->n_events++;
    }
    return 0;
```

The `mk_mem_realloc` happens inside `cb_event` during `event_base_loop()`, but that is safe: no pointer into `ctx->fired` is cached across `cb_event` calls (each call re-indexes `ctx->fired[ctx->fired_count]`), libevent holds no pointer into it, and the consumer reads `ctx->fired` only after the loop returns. Simply raising the static `256` in `mk_event_loop_create()` is not a fix — it only moves the threshold.

### 2. Single-owner timeout teardown (secondary)

```c
static inline int _mk_event_timeout_destroy(struct mk_event_ctx *ctx, void *data)
{
    if (data == NULL) {
        return 0;
    }
    /* _mk_event_del() is the single owner: it closes both pipe ends, sets them
     * to -1, and frees the event + ev_map exactly once. Do NOT pre-close
     * event->fd here -- it aliases ev_map->pipe[0] and would be closed twice
     * (the second close can hit a fd already reused by another event_base). */
    return _mk_event_del(ctx, (struct mk_event *) data);
}

static void cb_timeout(evutil_socket_t fd, short flags, void *data)
{
    uint64_t val = 1;
    struct ev_map *ev_map = data;
    /* Signal only. Lifetime is owned solely by the explicit destroy path; never
     * free here, or it races with _mk_event_del() and double-frees ev_map. */
    (void) send(ev_map->pipe[1], (char *) &val, sizeof(uint64_t), 0);
}
```

This makes the explicit destroy path the sole owner. It assumes every timeout is torn down via `mk_event_timeout_destroy()`; any timeout that relied on `cb_timeout`'s self-cleanup (on read-end close) should be reviewed before adopting this.

## Capture notes (for maintainers)

The overflow is confirmed under full page heap (see "Confirmed under full page heap" above): the fault lands on the guard page immediately after `ctx->fired`, and the debugger identifies the overrun variable as `fired`. `!heap -p -a <addr-inside-block>` shows the offending allocation originates from `_mk_event_loop_create` (`mk_mem_alloc_z(sizeof(struct mk_event) * 256)`).

## Workaround for affected users (mitigation, not a fix)

Keep the number of simultaneously-ready events on the engine loop well under the 256-entry `fired` capacity:

- lower `storage.max_chunks_up` **below 256** (default 128) — caps the concurrent up-chunk/task/timer population
- finite `Retry_Limit` on the forward output — failed chunks leave the retry population instead of accumulating
- `storage.total_limit_size` to bound the per-output backlog (drops oldest chunks)
- `log_level info` to cut log-pipe pressure

These keep the load under the overflow threshold but do not remove the bug; safety depends on chunk size and burst patterns. Lowering `log_level` alone did **not** prevent the crash in testing.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

# Windows (v5.0.6): out-of-bounds heap write in monkey libevent backend (`cb_event`) under high event load — out_forward to an unreachable upstream #11905

Summary

Environment

Reproduction

Root cause

Faulting dump (v5.0.6, without page heap)

Confirmed under full page heap (v5.0.6)

Also present in v4.0.13

Secondary defect — timeout teardown (also present in v5.0.6)

Suggested fixes

1. Bound / grow the `fired` array (primary)

2. Single-owner timeout teardown (secondary)

Capture notes (for maintainers)

Workaround for affected users (mitigation, not a fix)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

# Windows (v5.0.6): out-of-bounds heap write in monkey libevent backend (cb_event) under high event load — out_forward to an unreachable upstream #11905

Description

Summary

Environment

Reproduction

Root cause

Faulting dump (v5.0.6, without page heap)

Confirmed under full page heap (v5.0.6)

Also present in v4.0.13

Secondary defect — timeout teardown (also present in v5.0.6)

Suggested fixes

1. Bound / grow the fired array (primary)

2. Single-owner timeout teardown (secondary)

Capture notes (for maintainers)

Workaround for affected users (mitigation, not a fix)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

# Windows (v5.0.6): out-of-bounds heap write in monkey libevent backend (`cb_event`) under high event load — out_forward to an unreachable upstream #11905

1. Bound / grow the `fired` array (primary)