Skip to content

1.0.70 suppresses later Workflow\Exception replay logs in signal-retry workflows #373

@rmcdaniel

Description

@rmcdaniel

Summary

After upgrading from 1.0.69 to 1.0.70, later handled activity failures in the same workflow lifecycle stop being recorded in workflow_logs as Workflow\\Exception rows.

I reproduced this in the real durable-workflow/sample-app app with a real queue worker and Redis queue, not with feature tests.

This looks like the same underlying problem described in discussion #372.

What I tested

I used the sample app with a real worker:

php -d opcache.enable_cli=0 artisan queue:work --sleep=1 --tries=1 --timeout=0 -v
php -d opcache.enable_cli=0 artisan app:exception-logging-repro --timeout=90

The repro workflow shape is:

class ExceptionLoggingRetryActivity extends Activity
{
    public $tries = 1;

    public function execute(string $step): string
    {
        return match ($step) {
            'first' => throw new RuntimeException('first failure from activity'),
            'second' => throw new InvalidArgumentException('second failure from activity'),
            default => "success on {$step}",
        };
    }
}

class ExceptionLoggingRetryWorkflow extends Workflow
{
    protected int $retryRequests = 0;

    #[SignalMethod]
    public function requestRetry(): void
    {
        $this->retryRequests++;
    }

    public function execute(): Generator
    {
        $caught = [];
        $stage = 0;

        while (true) {
            try {
                $result = yield activity(
                    ExceptionLoggingRetryActivity::class,
                    match ($stage) {
                        0 => 'first',
                        1 => 'second',
                        default => 'success',
                    }
                );

                return [
                    'caught' => $caught,
                    'result' => $result,
                ];
            } catch (Throwable $throwable) {
                $caught[] = get_class($throwable).': '.$throwable->getMessage();

                $requiredRetries = $stage + 1;
                yield await(fn () => $this->retryRequests >= $requiredRetries);
                $stage++;
            }
        }
    }
}

The command just starts the workflow, waits 3 seconds, sends requestRetry(), waits another 3 seconds, sends requestRetry() again, then waits for completion.

Expected behavior

The second handled failure happens at a new workflow index, so it should produce a second Workflow\\Exception row in workflow_logs, just like 1.0.69 does.

Actual behavior

On 1.0.69

The workflow completes successfully.

workflow_logs for the run:

[0] Workflow\\Exception
[1] Workflow\\Signal
[2] Workflow\\Exception
[3] Workflow\\Signal
[4] App\\Workflows\\Repro\\ExceptionLoggingRetryActivity

workflow_exceptions for the run:

RuntimeException: first failure from activity
InvalidArgumentException: second failure from activity

On 1.0.70

The workflow gets stuck in WorkflowWaitingStatus.

workflow_logs for the run:

[0] Workflow\\Exception
[1] Workflow\\Signal

workflow_exceptions for the run:

RuntimeException: first failure from activity
InvalidArgumentException: second failure from activity
InvalidArgumentException: second failure from activity

The worker output shows the later Workflow\\Exception jobs being dispatched, but they do not create new replay-log rows. Because index 2 never gets a Workflow\\Exception row, the workflow keeps replaying the same second failing activity on later retry signals.

Why this seems to happen

src/Exception.php in 1.0.70 now does:

if ($this->storedWorkflow->hasLogByIndex($this->index)) {
    $workflow->resume();
} elseif (! $this->storedWorkflow->logs()->where('class', self::class)->exists()) {
    $workflow->next($this->index, $this->now, self::class, $this->exception);
}

That global exists() check looks fine for suppressing stale sibling exception logs in a parallel fan-out, but it also suppresses later legitimate exceptions at new indexes in the same workflow lifecycle.

Why I think this is a real bug

workflow_exceptions keeps growing, so the later activity failures are definitely happening.

What stops working is the replay log in workflow_logs, which means the workflow cannot deterministically advance past the later failed stage.

So this is not just a visibility/logging issue. In signal-driven manual retry flows, it changes behavior and can leave the workflow stuck replaying the same failing stage.

Related context

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions