Skip to content

Feature request: cloudpickle.patch_multiprocessing() utility for ForkingPickler replacement #589

@clemlesne

Description

@clemlesne

Summary

Propose adding a cloudpickle.patch_multiprocessing() helper that replaces multiprocessing.reduction.ForkingPickler with a cloudpickle-based pickler, enabling Pool.map(lambda x: x**2, range(10)) to work out of the box.

Motivation: ecosystem fragmentation

Every project that needs cloudpickle + multiprocessing.Pool independently reinvents this patching. At least 6 projects maintain their own version:

Project Approach
loky/joblib Full custom _LokyPickler subsystem in loky/backend/reduction.py
PySpark Own CloudPickleSerializer wrapping cloudpickle.dumps/loads
Ray Bundled fork as ray.cloudpickle with custom object store
Dask Custom serialization protocol in distributed scheduler
multiprocess Complete fork of CPython's multiprocessing with dill substituted
trading-strategy/exec-sandbox/pypeln/pyrocko Ad-hoc monkey patches of varying correctness

Most ad-hoc implementations are incomplete because of a non-obvious CPython pitfall (see below).

The _ForkingPickler double-binding pitfall

CPython has two separate name bindings for ForkingPickler:

# multiprocessing/reduction.py
class ForkingPickler(pickle.Pickler):
    ...
# multiprocessing/connection.py
from .context import reduction
_ForkingPickler = reduction.ForkingPickler   # captured at import time

class Connection:
    def send(self, obj):
        self._send_bytes(_ForkingPickler.dumps(obj))  # uses the captured reference

Patching reduction.ForkingPickler alone is insufficientConnection.send() still uses the stale _ForkingPickler reference captured at import time. You must also patch multiprocessing.connection._ForkingPickler. Most ad-hoc implementations miss this.

Additionally, reduction.dump() is a module-level function that also needs replacing for completeness.

Proposed API

import cloudpickle

cloudpickle.patch_multiprocessing()

One call, idempotent, patches all three binding sites:

  1. multiprocessing.reduction.ForkingPickler — the class
  2. multiprocessing.reduction.dump — the module-level helper
  3. multiprocessing.connection._ForkingPickler — the import-time captured reference

Reference implementation

Here's a minimal working implementation (tested on Python 3.14):

import copyreg
import io
import multiprocessing.connection
import multiprocessing.reduction

import cloudpickle


class CloudForkingPickler(cloudpickle.Pickler):
    """ForkingPickler replacement backed by cloudpickle."""
    _extra_reducers = {}
    _copyreg_dispatch_table = copyreg.dispatch_table

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.dispatch_table = self._copyreg_dispatch_table.copy()
        self.dispatch_table.update(self._extra_reducers)

    @classmethod
    def register(cls, type, reduce):
        cls._extra_reducers[type] = reduce

    @classmethod
    def dumps(cls, obj, protocol=None):
        buf = io.BytesIO()
        cls(buf, protocol).dump(obj)
        return buf.getbuffer()

    loads = staticmethod(cloudpickle.loads)


def patch_multiprocessing():
    """Replace multiprocessing's ForkingPickler with cloudpickle-based version."""
    # 1. The class itself
    multiprocessing.reduction.ForkingPickler = CloudForkingPickler
    # 2. The module-level dump() helper
    multiprocessing.reduction.dump = lambda obj, file, protocol=None: \
        CloudForkingPickler(file, protocol).dump(obj)
    # 3. The import-time captured reference in connection.py
    multiprocessing.connection._ForkingPickler = CloudForkingPickler

After patch_multiprocessing():

from multiprocessing import Pool
with Pool(4) as p:
    print(p.map(lambda x: x**2, range(10)))
# [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

Why cloudpickle (not CPython)

There's an open discussion on discuss.python.org about adding a pluggable pickler API to multiprocessing, but no PEP has materialized. cloudpickle is the pragmatic place for this — it already provides Pickler/dumps/loads, and adding a one-shot integration helper is a small, natural extension.

Alternatives considered

  • "Just use loky/joblib" — Valid for many users, but loky replaces the entire process management layer. Many projects only need cloudpickle serialization with stdlib multiprocessing.Pool.
  • "Just use multiprocess (dill)" — Requires replacing all multiprocessing imports. dill is heavier than cloudpickle and has different serialization semantics.
  • "Document the pattern instead" — The _ForkingPickler double-binding makes documentation insufficient; people will keep getting it wrong.

Happy to submit a PR if there's interest.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions