Skip to content

WIP: model refs for cog push#3018

Draft
michaeldwan wants to merge 13 commits into
mainfrom
md/model-refs
Draft

WIP: model refs for cog push#3018
michaeldwan wants to merge 13 commits into
mainfrom
md/model-refs

Conversation

@michaeldwan
Copy link
Copy Markdown
Member

Model-aware push flow: resolve model refs from cog.yaml, tag images with cog-weight.*/cog-image.* conventions, and print a structured ref tree after push.

Work in progress -- opening draft to get CI running.

michaeldwan and others added 10 commits May 13, 2026 17:31
Introduces 'model' in cog.yaml as the structured alternative to 'image'
for OCI bundles. 'image' keeps its legacy single-image behavior unchanged;
'model' is required for managed weights.

Validation enforces mutual exclusion of model/image and rejects tags or
digests on 'model'. Bare-repo parsing uses name.NewRepository with a
ParseReference fallback so the error path can distinguish "looks like
repo:tag" from "malformed" — handles host:port repos like
localhost:5000/foo correctly.

cog build falls back to Config.Model when Config.Image is empty so the
build command works for users on the new field. Other CLI consumers
(push, weights *) are repointed in later tasks alongside env-var
resolution and tag generation.

Refs: cog.md-model-refs-pqk7
Closes: cog.md-model-refs-977x, cog.md-model-refs-o5b3
Introduce model.Format (FormatImage, FormatBundle) as the explicit
representation of a model's output shape. IsBundle() now switches on
Format rather than weight count, which means a FormatBundle model with
zero weights is a bundle — it pushes as a single-entry OCI index
containing only the image manifest. This is the forward-compatible
shape for all models migrating off FormatImage.

Format is set in two places:
- modelFromImage / ToModel default to FormatImage.
- Build overrides to FormatBundle when cog.yaml has a model: ref or
  managed weights; modelFromIndex overrides when a registry returns an
  OCI index.
Add ResolveModelRef, the composition logic that decides what registry/
repo/tag a model push will land at. The cog.yaml `model` field supplies
a base; COG_MODEL_REGISTRY, COG_MODEL_REPO, and COG_MODEL_TAG layer per
field; COG_MODEL overrides outright. Missing tags get an auto-generated
ISO 8601 basic-form timestamp. The "cog-" tag prefix is reserved for
auto-generated image/weight tags.

Wire resolution into Resolver.Build() as the format-determination step:
a resolvable ref selects FormatBundle, ErrNoModelRef falls back to
FormatImage, anything else fails the build before Docker work starts —
so a malformed COG_MODEL_TAG surfaces immediately rather than after a
multi-minute image build.

Push integration (s7tp), CLI positional-arg rejection (qgjx), and the
weights commands (igmg) re-resolve on demand; they're intentionally not
touched here.
Move weight manifest tags from "weights-{name}-{short}" to
"cog-weight.{sanitized-name}.{short}" and add an ImageTag helper that
produces "cog-image.{timestamp}" for the upcoming bundle push path.
Dots namespace the tag type, hyphens stay inside segments, and
SanitizeTagSegment enforces the segment grammar so generated tags
always satisfy the OCI tag regex.

Export ReservedTagPrefix ("cog-") and consolidate the previously
private reservedTagPrefix from resolve.go into pkg/model/tag.go
alongside the new ParseWeightTag / ParseImageTag / IsReservedTag
helpers. The list/prune callers landing in a sibling task can use
those parsers to filter tags by type without re-deriving the format.

This task only ships the helpers and changes the existing weight tag
format. Wiring ImageTag into the push path and the CLI output changes
are separate sibling tasks (s7tp, qj17). Bundle verify continues to
work because it resolves weights by repo@sha256:..., not by tag.
Surfaces docker.Tag so callers can apply a second local tag to an
existing image -- needed by the upcoming bundle push path to push the
image manifest at a stable cog-image.{tag} ref independent of the
build-time tag.

Implementation calls client.ImageTag and maps errdefs.IsNotFound to
the existing NotFoundError{Object: "image"}; everything else
bubbles wrapped. Integration tests cover tag-by-name, tag-by-image-ID,
and the missing-source case.

mockery v3.7.0 is pinned via mise.toml so the regenerated
command_mocks.go reproduces; a new 'mise run generate:mocks' task
wraps it. The accompanying diff in command_mocks.go is mostly the
generated Tag method plus unavoidable template drift from running the
current mockery against a file produced by an older version.
Resolver.Build resolves the model ref once (from cog.yaml + COG_MODEL*
env vars) and pins it on the returned Model. Resolver.Push consumes
that Ref to decide where the bundle index lands, retags the local
image as {repo}:cog-image.{tag} so the image manifest gets a stable
namespaced ref independent of the build-time tag, then assembles and
pushes the OCI index at the model ref. The return type changes from
error to (*Model, error); the returned Model carries Ref pinned to
the index digest, the image artifact's Reference rewritten to
repo@digest, and each Weight populated with its registry reference
and tag.

The index descriptor comes from idx.Digest() rather than a post-push
HEAD -- the index is content-addressed, so the local digest is
exactly what the registry stored.

CLI integration: cog push logs the pinned digest reference so users
can pin to it. Detailed output formatting lives in a follow-on task.

Other helpers landing here:
- ResolvedRef.Repository() -- the bare registry/repo prefix used
  everywhere the cog-image, weight, and digest refs are constructed.
- ImageArtifact.WithDigest(repo, desc) -- the post-push enrichment
  recipe extracted so the bundle and FormatImage paths share one
  implementation.

Bundle push skips the docker.Tag+RemoveImage pair when the local
image already lives at the cog-image ref, so we don't delete the only
local tag (and thus the underlying image) on cleanup.
cog push used to confirm success with a generic 'Image pushed' line and
the user-facing image name -- which doesn't include the OCI index digest
or any of the per-weight refs. Anyone who wanted to pin a Replicate
deployment to a content-addressed ref had to spelunk with crane after
the fact.

Now, on a successful push, cog prints a tree of everything that landed:

  model   registry.example.com/acct/flux@sha256:abc123...
  ├─ image   registry.example.com/acct/flux@sha256:f3c67c...
  ├─ weight  transformer   registry.example.com/acct/flux@sha256:d2daaf...
  └─ weight  text-encoder  registry.example.com/acct/flux@sha256:e4f5a6...

The pre-push announce changes too: 'Pushing image {name}...' becomes
'Pushing to {ref}...', preferring the resolved bundle ref so users see
exactly where their model is going before any layers move.

formatPushResult is split out as a pure string-returning function so it's
unit-testable without stdout capture plumbing. Production calls it then
writes directly to stderr -- console.InfoUnformatted wraps at terminal
width and would hard-break the digest refs we want to be copy-pasteable.

PostPush in the generic and replicate providers no longer prints
'Image pushed' on success; the tree carries that signal now. Replicate
still prints its model URL since that's a genuinely useful extra.
When cog.yaml has `model:` or COG_MODEL* env vars resolve to a ref, the
legacy positional [IMAGE] arg is ambiguous — the resolved model ref is
the source of truth, and a conflicting arg silently lost in the old
code path. Reject up front with a message pointing the user at
COG_MODEL and COG_MODEL_TAG.

FormatImage paths are unchanged: the positional arg keeps working when
no model ref is configured.

Extracted as `validatePushArgs` so the four-case matrix (Format ×
positional arg presence) plus env-var promotion and malformed-env-var
cases are testable without a Docker dependency.
In FormatBundle mode the legacy `Config.Image` field was driving
`provider.ForImage` credential selection, which silently picked the
wrong provider whenever `COG_MODEL*` env vars promoted resolution to
a different registry. `pkg/cli/push.go` now uses the resolved model
ref (when present) to drive both the provider lookup and `pushOpts.Image`;
FormatImage keeps falling back to `Config.Image` or the positional arg.

To make the two modes safely disjoint, lift the `image:`+COG_MODEL*
mode-mix check into `pkg/model.ResolveModelRef` so every caller
(push, build, predict, serve, future weights commands) surfaces the
same `ErrImageModelEnvConflict` instead of silently pushing to the
wrong registry. cog.yaml's schema already enforces image:/model: as
mutex, but env-var promotion on a legacy image:-only config slipped
through that check.

`validatePushArgs` now returns the resolved ref so `push` can reuse
it for the provider lookup instead of re-calling `ResolveModelRef`
(which would risk timestamp drift across the two calls).
gofmt re-aligned the TestParseWeightTag table struct fields after a
prior edit left them over-padded. No behavior change.
Copy link
Copy Markdown
Contributor

@ask-bonk ask-bonk Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm Bonk, and I've done a quick review of your PR.

This PR introduces model-aware push flow with model refs from cog.yaml, cog-weight.* / cog-image.* tagging conventions, and structured ref tree output. Overall the implementation is solid with good test coverage.

I found one issue with error handling in the cleanup defer:

Comment thread pkg/model/pusher.go
return nil, err
}
defer func() {
if err := p.docker.RemoveImage(context.WithoutCancel(ctx), imageRef); err != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error assignment here shadows nothing (Push doesn't have named return values) and the variable is never used. If cleanup failure shouldn't fail the push, consider explicitly discarding the error:

Suggested change
if err := p.docker.RemoveImage(context.WithoutCancel(ctx), imageRef); err != nil {
defer func() {
_ = p.docker.RemoveImage(context.WithoutCancel(ctx), imageRef)
}()

Alternatively, if cleanup errors should be surfaced, capture the named return value.

@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk Bot commented May 13, 2026

Review posted successfully to PR #3018.

I found one issue in pkg/model/pusher.go:124 where the error assignment in the defer statement creates an unused variable (since Push doesn't have named return values). The error is logged but the assignment itself serves no purpose since the variable immediately goes out of scope.

The rest of the PR looks solid:

  • Clean model ref resolution with environment variable overrides
  • Good validation for reserved tag prefixes and OCI tag format
  • Comprehensive test coverage for both happy paths and edge cases
  • Proper handling of the image/model mutex in validation
  • Nice structured output for push results with proper tree formatting

github run

cog push has been resolving its bundle target through model.ResolveModelRef
(cog.yaml 'model:' + COG_MODEL* env vars) since the model-refs work
landed, but the weights commands were still reaching for the legacy
Config.Image field and a --image flag. Any project that adopts 'model:'
finds 'cog weights import/pull/status' broken on the very same cog.yaml
that 'cog push' accepts.

Route all three weights subcommands through a shared resolveWeightRepo
helper in pkg/cli that calls model.ResolveModelRef("", Config.Model)
and surfaces user-friendly errors when no ref can be determined or when
'image:' is set on a weights-bearing config (the latter is defense-in-
depth: config validation already rejects it, but a hand-rolled Source
could slip through).

Drop the --image flag from 'cog weights import' and 'cog weights pull'
— COG_MODEL_REPO / COG_MODEL replace it. Delete the now-dead
parseRepoOnly helper; ResolveModelRef does the parsing.

newWeightManager simplifies to a single-arg signature; predict.go and
train.go call sites updated. The no-weights → no-op-Manager contract
those callers depend on is preserved (resolveWeightRepo short-circuits
on len(Weights) == 0).

Help text for the three subcommands shares a single
weightRegistryResolutionHelp const so the docs can't drift. Errors
weave configFilename through so '--config -f' values aren't lied about.

Tests cover config-only, COG_MODEL and COG_MODEL_REPO precedence,
image+weights rejection, missing ref, and invalid env-var wrapping.

Integration tests in integration-tests/tests/weights*.txtar still use
the legacy 'image:' + --image shape and were already broken by the
'model:' + 'weights:' validation rule on this branch; they'll be
migrated in a follow-up.
@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk Bot commented May 14, 2026

I'm Bonk, and I've done a quick review of your PR.

This PR introduces model-aware push flow with model refs from cog.yaml, cog-weight.* / cog-image.* tagging conventions, and structured ref tree output. Overall the implementation is solid with good test coverage.

I found one issue with error handling in the cleanup defer:

cat > /tmp/review.json << 'REVIEW'
{
  "event": "COMMENT",
  "body": "I'm Bonk, and I've done a quick review of your PR.\n\nThis PR introduces model-aware push flow with model refs from `cog.yaml`, cog-weight.* / cog-image.* tagging conventions, and structured ref tree output. Overall the implementation is solid with good test coverage.\n\nI found one issue with error handling in the cleanup defer:",
  "comments": [
    {
      "path": "pkg/model/pusher.go",
      "line": 124,
      "side": "RIGHT",
      "body": "The error assignment here shadows nothing (Push doesn't have named return values) and the variable is never used. If cleanup failure shouldn't fail the push, consider explicitly discarding the error:\n\n```suggestion\n\t\tdefer func() {\n\t\t\t_ = p.docker.RemoveImage(context.WithoutCancel(ctx), imageRef)\n\t\t}()\n```\n\nAlternatively, if cleanup errors should be surfaced, capture the named return value."
    }
  ]
}
REVIEW
gh api repos/replicate/cog/pulls/3018/reviews --input /tmp/review.json

github run

The 'image:' + 'weights:' config combination was rejected by cog.yaml
validation when the model-refs work landed (commit 00289f7), and the
'--image' flag was dropped from cog weights subcommands in the
companion commit. Five integration tests still used the old shape and
failed wholesale in CI.

For weight import/pull/status flows, set 'model: <bare-repo>' in
cog.yaml and COG_MODEL_REGISTRY=$TEST_REGISTRY at runtime — the cog.yaml
side declares intent at parse time, the env var swaps in the ephemeral
registry host that 'registry-start' assigns. COG_MODEL_REPO would have
been an alternative but its value must be a bare path (no host), so it
doesn't accept '$TEST_REGISTRY/test/...'.

oci_bundle_push is more involved:

  - cog push positional [IMAGE] args are rejected in model mode, so
    the test now exports COG_MODEL=$TEST_REGISTRY/test/bundle-model:v1
    (full ref including tag) and runs 'cog push' with no arguments.
    Pinning the tag keeps registry-inspect deterministic; without it
    ResolveModelRef would append a fresh timestamp tag.
  - the legacy 'Pushing image ' stderr assertion was a pre-model-refs
    string. After commit 9a6d6e0 ('print structured ref tree after
    cog push'), the success output is a tree:

        Pushing to <ref>...
          model   <ref>@<digest>
          ├─ image   <ref>@<digest>
          ├─ weight  alpha  <ref>@<digest>
          └─ weight  beta   <ref>@<digest>

    Assert each row so the entire tree shape is covered.
Copy link
Copy Markdown
Contributor

@ask-bonk ask-bonk Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm Bonk, and I've done a quick review of your PR.

This PR introduces model-aware push flow with model refs from cog.yaml, cog-weight.* / cog-image.* tagging conventions, and structured ref tree output. Overall the implementation is solid with good test coverage.

I found one issue with error handling in the cleanup defer:

Comment thread pkg/model/pusher.go
return nil, err
}
defer func() {
if err := p.docker.RemoveImage(context.WithoutCancel(ctx), imageRef); err != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error assignment here with := creates a new variable scoped only to the defer function. Since Push doesn't have named return values, this error can't be captured or returned. If cleanup failure shouldn't fail the push (which seems correct), explicitly discard the error:

Suggested change
if err := p.docker.RemoveImage(context.WithoutCancel(ctx), imageRef); err != nil {
defer func() {
_ = p.docker.RemoveImage(context.WithoutCancel(ctx), imageRef)
}()

The current code logs the error (which is fine) but the := assignment is unnecessary since the variable immediately goes out of scope.

@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk Bot commented May 14, 2026

I'm Bonk, and I've done a quick review of your PR.

This PR introduces model-aware push flow with model refs from cog.yaml, cog-weight.* / cog-image.* tagging conventions, and structured ref tree output. Overall the implementation is solid with good test coverage.

I found one issue with error handling in the cleanup defer:

pkg/model/pusher.go:124: The error assignment with := creates a variable scoped only to the defer function. Since Push doesn't have named return values, explicitly discard the error:

		defer func() {
			_ = p.docker.RemoveImage(context.WithoutCancel(ctx), imageRef)
		}()

github run

Two callers (pkg/cli/push_test.go, pkg/cli/weights_manager_test.go)
each kept their own copy of the COG_MODEL* env-clearing helper. A
third caller arrived with the weights_manager tests, tipping the
threshold where duplication risks silent drift: anyone adding a new
COG_MODEL_* env var only updates one helper, and tests still pass.

pkg/model's own test files can't import pkg/model/modeltest without a
build cycle (modeltest imports pkg/model, and the tests live in
package model). Going package model_test would lose access to
unexported buildFunc/mockRegistry across ~58 test functions. Kept the
in-package helper as the source of truth for pkg/model and pointed its
doc comment at modeltest.ClearEnv for everyone else.
Copy link
Copy Markdown
Contributor

@ask-bonk ask-bonk Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm Bonk, and I've done a quick review of your PR.

This PR introduces model-aware push flow with model refs from cog.yaml, cog-weight.* / cog-image.* tagging conventions, and structured ref tree output. Overall the implementation is solid with good test coverage.

I found one issue with error handling in the cleanup defer:

Comment thread pkg/model/pusher.go
return nil, err
}
defer func() {
if err := p.docker.RemoveImage(context.WithoutCancel(ctx), imageRef); err != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error assignment with := creates a variable scoped only to the defer function. Since Push doesn't have named return values, this error variable is never used outside the defer and immediately goes out of scope.

If cleanup failure shouldn't fail the push (which seems correct based on the debug logging), explicitly discard the error:

Suggested change
if err := p.docker.RemoveImage(context.WithoutCancel(ctx), imageRef); err != nil {
defer func() {
_ = p.docker.RemoveImage(context.WithoutCancel(ctx), imageRef)
}()

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The err is bound by the if-init clause and used inside the block — console.Debugf("removing local cog-image tag %q: %v", imageRef, err) reads it on line 125. The suggested _ = p.docker.RemoveImage(...) would drop that debug log, so cleanup failures become silent. Leaving as-is.

@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk Bot commented May 14, 2026

Posted my review to PR #3018. I found one issue: in pkg/model/pusher.go:124, the error assignment with := inside the defer creates a variable that immediately goes out of scope since Push doesn't have named return values. The fix is to explicitly discard the error with _ = p.docker.RemoveImage(...).

github run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant