Skip to content

Heartbeat timeout on long running tasks #359

@nstott

Description

@nstott

Hi All, I'm relatively new to simpleflow, and having some trouble understanding what the best practice is for long running jobs.

My workflow consists of a few tasks, one of which involves running an external process to crunch some data, and can take anywhere between 1 and 2 hours.

When this long task is running, the worker doesn't seem to be sending heartbeats, so I've set the heartbeat timeout to something unreasonable, so that the swf task doesn't fail due to a timeout.

The problem I'm having is that periodically my worker processes can crash (OOM, or due to other general kubernetes malfeasance), and because of the long heartbeat timeout, the workflow doesn't retry the failed task until the very end.

I'm looking for a way to continue to send heartbeats while the worker is occupied, or to find some other way to retry quickly on a failed worker. I'm not sure what the right pattern is for this approach

I'm not sure if this is related to #239

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions