-
Notifications
You must be signed in to change notification settings - Fork 2
Implement abstraction for resources #43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughIntroduces a resource claiming framework with a plugin-backed architecture. Implements PCI device discovery across Linux and Darwin, GPU claim management, asynchronous claim/release processing with validation, and comprehensive test coverage using Ginkgo/Gomega. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Claimer as Resource Claimer
participant GPUPlugin as GPU Plugin
participant PCIReader as PCI Reader
participant DeviceState as Device State
Client->>Claimer: Claim(ctx, [nvidia.com/gpu: 2])
activate Claimer
Claimer->>Claimer: Validate plugins exist
Claimer->>GPUPlugin: Validate resource type
activate GPUPlugin
GPUPlugin->>GPUPlugin: CanClaim(quantity: 2)
GPUPlugin->>DeviceState: Count free devices
DeviceState-->>GPUPlugin: free_count = 4
GPUPlugin->>GPUPlugin: Compare: 4 >= 2 ✓
GPUPlugin-->>Claimer: Can claim
deactivate GPUPlugin
Claimer->>GPUPlugin: Claim(quantity: 2)
activate GPUPlugin
GPUPlugin->>DeviceState: Allocate & mark 2 devices as Claimed
DeviceState-->>GPUPlugin: Allocated addresses
GPUPlugin->>GPUPlugin: Create gpuClaim(addresses)
GPUPlugin-->>Claimer: ResourceClaim{addresses}
deactivate GPUPlugin
Claimer->>Claimer: Aggregate results into Claims map
Claimer-->>Client: Claims{nvidia.com/gpu: gpuClaim}, nil
deactivate Claimer
Estimated code review effort🎯 4 (Complex) | ⏱️ ~55 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
🤖 Fix all issues with AI agents
In `@claimutils/claim/claimer_test.go`:
- Around line 41-49: The goroutine launching resourceClaimer.Start(innerCtx)
writes its result to errCh but the test never reads it, so Start failures are
ignored and the goroutine may outlive the spec; update the test to consume and
assert the value from errCh (or select between errCh and innerCtx.Done if
cancellation is expected) after cancelling/finishing the context, verifying the
returned error (or nil) matches the contract for resourceClaimer.Start to ensure
failures are detected and the goroutine exits.
In `@claimutils/claim/claimer.go`:
- Around line 106-116: Start currently flips c.running to false and returns
while the background goroutine c.run may still be mid-request, leaving callers
blocked on resultChan; modify Start and run to perform a graceful shutdown: when
ctx is cancelled, signal run to stop, then have run drain any pending requests
(the internal request queue/resultChan used in run) and send an explicit
error/closure to each waiting result before exiting, or close the result
channel(s) so waiters unblock; update claimer.Start, claimer.run and any code
paths that send to resultChan to handle a closed channel and to ensure every
pending request receives a terminal response (error) on shutdown.
- Around line 68-76: The claimer struct's log field is never initialized causing
a nil pointer panic when c.log.Error() is called (e.g., during claim rollback);
update NewResourceClaimer to accept a logr.Logger parameter and assign it to
claimer.log (or default to a no-op logger if nil), and update all
NewResourceClaimer call sites to pass a logger so claimer.log is always non-nil;
ensure the assignment occurs where the claimer value is constructed so
subsequent methods (e.g., where c.log.Error is invoked) are safe.
In `@claimutils/gpu/gpu_test.go`:
- Around line 149-167: The test’s spec name It("should claim different devices",
...) is duplicated and misleading because this case asserts a zero-quantity
claim; rename that spec to a descriptive unique name (e.g., It("should handle
zero-quantity claim", func(...) { ... })) so test output is unambiguous — update
the It(...) string in the gpu_test.go spec that wraps the plugin initialization
and Claim(resource.MustParse("0")) assertions (the anonymous func containing
plugin := gpu.NewGPUClaimPlugin(...) and
Expect(gpuClaim.PCIAddresses()).To(BeEmpty())).
In `@claimutils/gpu/gpu.go`:
- Around line 138-147: The loop over g.preClaimed incorrectly falls through and
adds pciDevice to g.devices even when it isn't discovered; inside the for loop
in gpu.go, after logging that a pciDevice is "Not discovered pre-claimed pci
address" (the if checking _, ok := g.devices[pciDevice]), add a continue so the
code does not execute the subsequent g.log.V(2).Info("Set device to claimed",
...) and g.devices[pciDevice] = ClaimStatusClaimed for undiscovered devices;
ensure references remain to g.preClaimed, g.devices and ClaimStatusClaimed.
In `@claimutils/pci/reader_test.go`:
- Line 62: The test function TestGPUClaimer_InitCorrect is misnamed because it
only validates the PCI reader; rename the test to reflect the PCI reader
behavior (e.g., TestPCIReader_InitCorrect or TestPCIReader_Init) to make test
output and -run filtering accurate; update the function declaration name and any
references to TestGPUClaimer_InitCorrect accordingly so the test suite runs
under the new test identifier.
In `@LICENSES/Apache-2.0.txt`:
- Line 69: Replace the incorrect boilerplate fragment "Unless requested by
applicable law or agreed to in writing, software" in the Apache-2.0 appendix
boilerplate with the official wording "Unless required by applicable law or
agreed to in writing, software" so the license matches the official Apache 2.0
text exactly.
- Around line 47-51: Revert the altered wording in the Apache-2.0 license text:
in Section 6 change "requested" back to "required" in the phrase "except as
requested for reasonable and customary use..." (Section 6), and in Sections 7
and 8 change both occurrences of "Unless requested by applicable law" and
"unless requested by applicable law" back to "Unless required by applicable law"
and "unless required by applicable law" respectively so the license matches the
official Apache License 2.0 wording.
🧹 Nitpick comments (8)
claimutils/pci/progfs_darwin.go (2)
23-26: Consider returning an error for unsupported OS.Returning
(nil, nil)silently succeeds, making it indistinguishable from a legitimate "no devices found" scenario. Callers cannot differentiate between an unsupported platform and an empty result.Consider returning a sentinel error to make the unsupported state explicit:
♻️ Proposed change
Add an error variable at package level (in
reader.goor here):var ErrUnsupportedOS = errors.New("PCI reading not supported on this OS")Then modify the Read method:
func (r *reader) Read() ([]Address, error) { r.log.V(1).Info("NOT SUPPORTED OS") - return nil, nil + return nil, ErrUnsupportedOS }
14-21: Alternatively, fail fast in NewReader.If PCI reading is fundamentally unsupported on Darwin, consider returning an error from
NewReaderto prevent usage altogether, rather than silently succeeding during construction.♻️ Alternative approach
+var ErrUnsupportedOS = errors.New("PCI reading not supported on this OS") + func NewReader(log logr.Logger, _ Vendor, _ Class) (*reader, error) { log.V(1).Info("NOT SUPPORTED OS") - - return &reader{ - log: log, - }, nil - + return nil, ErrUnsupportedOS }claimutils/claim/plugin.go (1)
25-25: Consider adding methods to ResourceClaim.
ResourceClaimis currently an empty interface, which provides flexibility but sacrifices type safety. Based on the GPU implementation inclaimutils/gpu/gpu.go, claims have meaningful operations (e.g.,PCIAddresses()).Consider whether a minimal interface with common claim methods would be beneficial:
type ResourceClaim interface { // String returns a human-readable identifier for the claim String() string }This is optional if you prefer the flexibility of the empty interface.
claimutils/pci/reader.go (1)
13-17: Useconstfor fixed PCI IDs.
Line 13 defines immutable values asvar, which is non-idiomatic for constants.♻️ Suggested change
-var ( - Class3DController Class = 0x030200 - - VendorNvidia Vendor = 0x10de -) +const ( + Class3DController Class = 0x030200 + VendorNvidia Vendor = 0x10de +)claimutils/pci/progfs_linux.go (1)
21-48: Return theReaderinterface from constructors for better abstraction.
Lines 21 and 36 expose an unexported concrete type; returning the interface matches the public contract and eases future swaps/mocking.♻️ Suggested change
-func NewReader(log logr.Logger, vendorFilter Vendor, classFilter Class) (*reader, error) { +func NewReader(log logr.Logger, vendorFilter Vendor, classFilter Class) (Reader, error) { fs, err := sysfs.NewDefaultFS() if err != nil { return nil, fmt.Errorf("failed to open sysfs: %w", err) } @@ - return &reader{ + return &reader{ log: log, fs: fs, vendorFilter: vendorFilter, classFilter: classFilter, }, nil } -func NewReaderWithMount(log logr.Logger, mountPoint string, vendorFilter Vendor, classFilter Class) (*reader, error) { +func NewReaderWithMount(log logr.Logger, mountPoint string, vendorFilter Vendor, classFilter Class) (Reader, error) { fs, err := sysfs.NewFS(mountPoint) if err != nil { return nil, fmt.Errorf("failed to open sysfs: %w", err) } @@ - return &reader{ + return &reader{ log: log, fs: fs, vendorFilter: vendorFilter, classFilter: classFilter, }, nil }claimutils/claim/claimer.go (1)
78-104: Consider consolidating duplicate plugin-check logic.
checkPluginsForResourcesandcheckPluginsForClaimsare nearly identical. A helper function could reduce duplication.♻️ Optional: Extract shared logic
+func (c *claimer) checkPluginsExist(resourceNames []v1alpha1.ResourceName) error { + var errs []error + for _, name := range resourceNames { + if _, ok := c.plugins[string(name)]; !ok { + errs = append(errs, fmt.Errorf("plugin for resource %s not found", name)) + } + } + return errors.Join(errs...) +} + func (c *claimer) checkPluginsForResources(resources v1alpha1.ResourceList) error { - var missingPluginErrors []error - for resourceName := range resources { - if _, ok := c.plugins[string(resourceName)]; !ok { - missingPluginErrors = append(missingPluginErrors, fmt.Errorf("plugin for resource %s not found", resourceName)) - } - } - if len(missingPluginErrors) > 0 { - return errors.Join(missingPluginErrors...) + names := make([]v1alpha1.ResourceName, 0, len(resources)) + for name := range resources { + names = append(names, name) } - - return nil + return c.checkPluginsExist(names) }claimutils/gpu/gpu.go (2)
35-40: Consider reversing the boolean semantics for clarity.
ClaimStatusFree = trueandClaimStatusClaimed = falseis counterintuitive. Typically, the "claimed" state would betrue. This could confuse maintainers.♻️ Optional: Use clearer semantics
-type ClaimStatus bool - -const ( - ClaimStatusFree ClaimStatus = true - ClaimStatusClaimed ClaimStatus = false -) +type ClaimStatus bool + +const ( + ClaimStatusFree ClaimStatus = false + ClaimStatusClaimed ClaimStatus = true +)Then update the checks (e.g., line 66, 92) accordingly.
79-101: Map iteration yields non-deterministic device selection.Iterating over
g.devices(line 87) has undefined order in Go. The same quantity request may claim different devices across runs, which could complicate debugging or cause flaky tests.If deterministic ordering matters, consider using a sorted slice of keys.
| func (c *claimer) Start(ctx context.Context) error { | ||
| if !c.running.CompareAndSwap(false, true) { | ||
| return ErrAlreadyStarted | ||
| } | ||
|
|
||
| go c.run(ctx) | ||
|
|
||
| <-ctx.Done() | ||
| c.running.Store(false) | ||
| return nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential request loss on shutdown.
When the context is cancelled, Start sets running = false and returns, but the run goroutine may exit mid-request. Callers blocked on resultChan (lines 190-195, 233-238) could hang indefinitely since no response is sent.
Consider draining pending requests or closing channels on shutdown to unblock waiters.
🤖 Prompt for AI Agents
In `@claimutils/claim/claimer.go` around lines 106 - 116, Start currently flips
c.running to false and returns while the background goroutine c.run may still be
mid-request, leaving callers blocked on resultChan; modify Start and run to
perform a graceful shutdown: when ctx is cancelled, signal run to stop, then
have run drain any pending requests (the internal request queue/resultChan used
in run) and send an explicit error/closure to each waiting result before
exiting, or close the result channel(s) so waiters unblock; update
claimer.Start, claimer.run and any code paths that send to resultChan to handle
a closed channel and to ensure every pending request receives a terminal
response (error) on shutdown.
023d755 to
6fc7a00
Compare
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@claimutils/gpu/gpu_test.go`:
- Around line 35-37: The test comment is incorrect: the By("init plugin without
reader") text contradicts the call to gpu.NewGPUClaimPlugin which is passed
&MockReader{}; either update the comment to reflect that the plugin is
initialized with a reader (change the By text to "init plugin with reader") or,
if the test intended no reader, change the NewGPUClaimPlugin call to pass nil
instead of &MockReader{}; locate this in the test around the plugin variable
initialization and the gpu.NewGPUClaimPlugin / MockReader usage and make the
comment and call consistent.
♻️ Duplicate comments (1)
claimutils/claim/claimer.go (1)
107-117: Potential request loss on shutdown remains unaddressed.When context is cancelled,
Startsetsrunning = falseand returns, but pending requests ontoClaim/toReleasechannels may have callers blocked onresultChanthat will never receive a response, causing goroutine leaks.🐛 Suggested approach: drain channels on shutdown
func (c *claimer) Start(ctx context.Context) error { if !c.running.CompareAndSwap(false, true) { return ErrAlreadyStarted } go c.run(ctx) <-ctx.Done() c.running.Store(false) + + // Drain pending requests to unblock waiters + for { + select { + case req := <-c.toClaim: + req.resultChan <- claimRes{err: ctx.Err()} + case req := <-c.toRelease: + req.resultChan <- ctx.Err() + default: + return nil + } + } - return nil }
🧹 Nitpick comments (4)
claimutils/gpu/gpu.go (2)
35-40: Consider clearer naming for ClaimStatus constants.Using
truefor "Free" andfalsefor "Claimed" is counterintuitive. Typically, a "claimed" state would betrue. This inverted mapping may cause confusion during maintenance.♻️ Suggested refactor
-type ClaimStatus bool - -const ( - ClaimStatusFree ClaimStatus = true - ClaimStatusClaimed ClaimStatus = false -) +type ClaimStatus int + +const ( + ClaimStatusFree ClaimStatus = iota + ClaimStatusClaimed +)
87-96: Map iteration order is non-deterministic.Iterating over
g.devicesmap produces non-deterministic device selection order. While functionally correct, this can cause flaky test behavior and inconsistent allocation patterns across runs.♻️ Suggested approach
Consider maintaining a separate ordered slice of device addresses or sorting the keys before iteration to ensure deterministic allocation:
+ // Collect and sort addresses for deterministic allocation + var freeDevices []pci.Address + for device, claimed := range g.devices { + if claimed == ClaimStatusFree { + freeDevices = append(freeDevices, device) + } + } + // Sort freeDevices by some consistent ordering (e.g., by Address.String()) + gClaim := &gpuClaim{} - for device, claimed := range g.devices { + for _, device := range freeDevices { if int64(len(gClaim.devices)) == requested { break } - - if claimed == ClaimStatusFree { - g.devices[device] = ClaimStatusClaimed - gClaim.devices = append(gClaim.devices, device) - } + g.devices[device] = ClaimStatusClaimed + gClaim.devices = append(gClaim.devices, device) }claimutils/claim/claimer.go (1)
79-105: Consider extracting common plugin-check logic.
checkPluginsForResourcesandcheckPluginsForClaimshave nearly identical implementations. A generic helper could reduce duplication.♻️ Suggested refactor
func (c *claimer) checkPlugins[K ~string](names iter.Seq[K]) error { var errs []error for name := range names { if _, ok := c.plugins[string(name)]; !ok { errs = append(errs, fmt.Errorf("plugin for resource %s not found", name)) } } return errors.Join(errs...) }Or simply use
maps.Keys()with a shared implementation.claimutils/gpu/gpu_test.go (1)
133-143: Misleading variable name:ociAddressshould begpuClaimor similar.The variables
ociAddress1andociAddress2are actuallygpu.Claiminstances containing PCI addresses, not OCI addresses. This naming is confusing.♻️ Suggested fix
- ociAddress1, ok := gpuClaim1.(gpu.Claim) + claim1, ok := gpuClaim1.(gpu.Claim) Expect(ok).To(BeTrue()) - Expect(ociAddress1.PCIAddresses()).To(HaveLen(1)) + Expect(claim1.PCIAddresses()).To(HaveLen(1)) By("claim resources again") gpuClaim2, err := plugin.Claim(resource.MustParse("1")) Expect(err).ToNot(HaveOccurred()) - ociAddress2, ok := gpuClaim2.(gpu.Claim) + claim2, ok := gpuClaim2.(gpu.Claim) Expect(ok).To(BeTrue()) - Expect(ociAddress2.PCIAddresses()).To(HaveLen(1)) + Expect(claim2.PCIAddresses()).To(HaveLen(1)) By("ensure claims are not equal") - Expect(ociAddress1.PCIAddresses()[0]).NotTo(Equal(ociAddress2.PCIAddresses()[0])) + Expect(claim1.PCIAddresses()[0]).NotTo(Equal(claim2.PCIAddresses()[0]))
| By("init plugin without reader") | ||
| plugin = gpu.NewGPUClaimPlugin(log.FromContext(ctx), "test-plugin", &MockReader{}, nil) | ||
| Expect(plugin.Init()).ShouldNot(HaveOccurred()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Misleading comment: step initializes plugin with a reader.
The By comment says "init plugin without reader" but the code creates a plugin with &MockReader{}.
🔧 Suggested fix
- By("init plugin without reader")
+ By("init plugin with reader")
plugin = gpu.NewGPUClaimPlugin(log.FromContext(ctx), "test-plugin", &MockReader{}, nil)🤖 Prompt for AI Agents
In `@claimutils/gpu/gpu_test.go` around lines 35 - 37, The test comment is
incorrect: the By("init plugin without reader") text contradicts the call to
gpu.NewGPUClaimPlugin which is passed &MockReader{}; either update the comment
to reflect that the plugin is initialized with a reader (change the By text to
"init plugin with reader") or, if the test intended no reader, change the
NewGPUClaimPlugin call to pass nil instead of &MockReader{}; locate this in the
test around the plugin variable initialization and the gpu.NewGPUClaimPlugin /
MockReader usage and make the comment and call consistent.
Proposed Changes
Fixes #
Summary by CodeRabbit
Release Notes
New Features
Tests
Chores
✏️ Tip: You can customize this high-level summary in your review settings.