Skip to content

Refactor existing logs and add logs#840

Open
hellolittlej wants to merge 1 commit intomasterfrom
refactor-logs
Open

Refactor existing logs and add logs#840
hellolittlej wants to merge 1 commit intomasterfrom
refactor-logs

Conversation

@hellolittlej
Copy link
Copy Markdown
Collaborator

@hellolittlej hellolittlej commented Mar 29, 2026

Context

Not all scheduling constraints had enough workers available to fulfill the request
ResourceClusterActor.TaskExecutorBatchAssignmentRequest(allocationRequests=[TaskExecutorAllocationRequest(workerId=kafka-cluster-monitor-21-worker-18-168,
constraints=SchedulingConstraints(machineDefinition=MachineDefinition{cpuCores=2.0, memoryMB=14336.0, networkMbps=700.0, diskMB=65536.0, numPorts=1}, sizeName=Optional.empty,
 schedulingAttributes={jdk=17plus, jenkins_job=unknown, repo_name=corp/kafka-mantis-kafka-monitor}), jobMetadata=io.mantisrx.server.core.domain.JobMetadata@3240fd30,
stageNum=1, readyAt=-1, durationType=Perpetual)], clusterID=ClusterID(resourceID=mantisrc.kaasall),
reservation=Reservation(key=MantisResourceClusterReservationProto.ReservationKey(jobId=kafka-cluster-monitor-21, stageNumber=1),
schedulingConstraints=SchedulingConstraints(machineDefinition=MachineDefinition{cpuCores=2.0, memoryMB=14336.0, networkMbps=700.0, diskMB=65536.0, numPorts=1},
sizeName=Optional.empty, schedulingAttributes={jdk=17plus, jenkins_job=unknown, repo_name=corp/kafka-mantis-kafka-monitor}),
canonicalConstraintKey=md:2.0/14336.0/65536.0/700.0/1;size=~;attr=jdk=17plus,jenkins_job=unknown,repo_name=corp/kafka-mantis-kafka-monitor,, stageTargetSize=35,
priority=MantisResourceClusterReservationProto.ReservationPriority(type=REPLACE, tier=0, timestamp=1774746017312), createdAt=1774746017312))

currently logs are way too long to read, it basically just saying we don't have enough worker to fulfill the request that request for one single worker, and we don't need all these machineDefinition=MachineDefinition{cpuCores=2.0, memoryMB=14336.0, networkMbps=700.0, diskMB=65536.0, numPorts=1 to be part of the details.

Besides, we don't have logs to explain why we can't find the TE for the worker even though the scheduler sees there are 2 idle TE, adding logs to show details why TE not selected for the worker

Checklist

  • ./gradlew build compiles code correctly
  • Added new tests where applicable
  • ./gradlew test passes all tests
  • Extended README or added javadocs where applicable

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 29, 2026

Test Results

777 tests  ±0   765 ✅  - 1   11m 17s ⏱️ +44s
162 suites ±0    11 💤 ±0 
162 files   ±0     1 ❌ +1 

For more details on these failures, see this check.

Results for commit 54080a3. ± Comparison against base commit 047cb84.

♻️ This comment has been updated with latest results.

Comment on lines +404 to +409
// TODO: turn these two debug level
log.info("findBestFitFor: TE {} excluded - not in stateMap", teHolder.getId());
return false;
}
if (currentBestFit.contains(teHolder.getId())) {
log.info("findBestFitFor: TE {} excluded - already in bestFit", teHolder.getId());
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These logs will be too chatty on main agent pool on every schedule request. Maybe convert to metrics and use TE id as tag instead.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turn these into warn level for now.

}
return true;
})
.collect(Collectors.toList());
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double collect in this function btw. perf punishment.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can revert it back after we diagnose


if (noResourcesAvailable) {
log.warn("Not all scheduling constraints had enough workers available to fulfill the request {}", request);
log.warn("Not all scheduling constraints had enough workers for jobId={}, cluster={}",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think you still need workerId + schedulingConstraint info

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worker id and constraint info already logged at findTaskExecutorsFor before coming into this log line.

I can put constraint again in this log line.
worker id we can't output here because it's in the array nested fields.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants