Skip to content

Avro reader memory leak #2325

@Declow

Description

@Declow

Apache Iceberg version

version = "0.9.1"

Please describe the bug 🐞

It seems like there is a memory leak in the avro/reader.py
I have a long running service that keeps crashing. I tried to replicate the issue locally and it seems it also has this issue.

The following code creates a Memory catalog and generates some random data for ingestion into iceberg.

from pyiceberg.catalog.memory import InMemoryCatalog
import tracemalloc
from datetime import datetime, timezone
import polars as pl

def generate_df():
    df = pl.DataFrame(
        {
            "event_type": ["playback"] * 1000,
            "event_origin": ["origin1"] * 1000,
            "event_send_at": [datetime.now(timezone.utc)] * 1000,
            "event_saved_at": [datetime.now(timezone.utc)] * 1000,
            "data": [
                {
                    "calendarKey": "calendarKey",
                    "id": str(i),
                    "referenceId": f"ref-{i}",
                }
                for i in range(1000)
            ],
        }
    )
    return df

df = generate_df()
catalog = InMemoryCatalog("default", warehouse="/tmp/iceberg")
catalog.create_namespace("default")

df = generate_df()
catalog = InMemoryCatalog("default", warehouse="/tmp/iceberg")
catalog.create_namespace("default")
table = iceberg_table = catalog.create_table(
    "default.leak", schema=df.to_arrow().schema, location="/tmp/iceberg/leak"
)

df = pl.DataFrame()

tracemalloc.start()
for i in range(1000):
    df = generate_df()
    df.write_iceberg(table, mode="append")
    snapshot = tracemalloc.take_snapshot()
    top_stats = snapshot.statistics("lineno")
    for stat in top_stats[:10]:
        print(stat)

Slowly but steadily the outputs for the avro reader memory size increases

/Users/dits/git/play-recommendation-input-consumer/.venv/lib/python3.11/site-packages/pyiceberg/avro/reader.py:330: size=370 KiB, count=3782, average=100 B
/Users/dits/git/play-recommendation-input-consumer/.venv/lib/python3.11/site-packages/pyiceberg/avro/reader.py:190: size=222 KiB, count=1891, average=120 B
/Users/dits/git/play-recommendation-input-consumer/.venv/lib/python3.11/site-packages/pyiceberg/avro/reader.py:133: size=184 KiB, count=5673, average=33 B

After some more writes the output looks like this

/Users/dits/git/play-recommendation-input-consumer/.venv/lib/python3.11/site-packages/pyiceberg/avro/reader.py:330: size=420 KiB, count=4290, average=100 B
/Users/dits/git/play-recommendation-input-consumer/.venv/lib/python3.11/site-packages/pyiceberg/avro/reader.py:190: size=251 KiB, count=2145, average=120 B
/Users/dits/git/play-recommendation-input-consumer/.venv/lib/python3.11/site-packages/pyiceberg/avro/reader.py:133: size=208 KiB, count=6435, average=33 B

If we take a look at the AvroFile class it uses the enter and exit dunder methods. The enter method assigns the reader to a variable on the instance but it seems like the different reader classes sticks around.
https://github.com/apache/iceberg-python/blob/main/pyiceberg/avro/file.py#L192

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions