Skip to content

feat: implement local LLM client with yzma (purego)#60

Closed
nvandessel wants to merge 6 commits intomainfrom
feat/local-llm-yzma
Closed

feat: implement local LLM client with yzma (purego)#60
nvandessel wants to merge 6 commits intomainfrom
feat/local-llm-yzma

Conversation

@nvandessel
Copy link
Owner

Summary

  • Replace llama-go (CGo) with hybridgroup/yzma (purego) for local LLM embeddings
  • Single local.go file — no build tags, no stub file, no C++ compiler needed
  • go build always works; shared libs only needed at runtime (yzma install --lib)
  • Per-call embedding context creation (simpler for low-throughput dedup workload)
  • Package-level sync.Once for library init, instance-level sync.Once for model loading
  • Available() does cheap os.Stat checks without loading any libraries

Stacked on

Test plan

  • go build ./cmd/floop — builds cleanly (no yzma libs needed at build time)
  • go vet ./... — no issues
  • go test ./... — all 25 packages pass
  • Integration: FLOOP_TEST_LIB_PATH=/path/to/libs FLOOP_TEST_MODEL_PATH=/path/to/model.gguf go test -tags integration ./internal/llm/ -v

🤖 Generated with Claude Code

@greptile-apps
Copy link

greptile-apps bot commented Feb 11, 2026

Greptile Overview

Greptile Summary

Replaces llama-go (CGo) with hybridgroup/yzma (purego) for local LLM embeddings, eliminating C++ compiler requirements at build time.

Key implementation details:

  • Single local.go file with full yzma integration (no build tags or stub files)
  • Package-level sync.Once ensures llama.Load() and llama.Init() happen once per process
  • Instance-level sync.Once for lazy model loading on first Embed() call
  • Per-call embedding context creation with immediate cleanup (simpler for low-throughput dedup workload)
  • Thread-safe via sync.Mutex protecting all model operations
  • Available() performs cheap os.Stat checks without loading libraries
  • Comprehensive unit tests (error paths, config validation, interface compliance)
  • Integration tests with build tag and env-based skip conditions

The PR achieves its goal of pure Go builds while maintaining runtime flexibility for local embeddings.

Confidence Score: 4/5

  • This PR is safe to merge with low risk; the implementation is solid with comprehensive tests and proper resource management
  • Score reflects well-structured implementation with thread-safe lazy loading, proper error handling, comprehensive test coverage (both unit and integration), and adherence to Go best practices. Minor style suggestion on context padding doesn't affect correctness. The change successfully eliminates CGo dependency while maintaining all functionality.
  • No files require special attention - all changes follow project conventions and include proper tests

Important Files Changed

Filename Overview
go.mod Added github.com/hybridgroup/yzma v1.7.0 dependency with transitive deps ebitengine/purego and jupiterrider/ffi; updated golang.org/x/* packages
go.sum Updated checksums for new dependencies and updated Go standard library packages
internal/llm/local.go Implemented full yzma-based embedding client with thread-safe lazy loading, per-call context creation, package-level library init, and proper resource cleanup; one minor style suggestion on context padding
internal/llm/local_test.go Comprehensive unit tests covering config validation, Available() checks, error paths, and interface compliance; tests properly use t.TempDir() and t.Setenv()
internal/llm/local_integration_test.go Well-structured integration tests with build tag, clear skip conditions via env vars, and realistic test cases for embeddings and similarity comparisons

Sequence Diagram

sequenceDiagram
    participant App as Application
    participant LC as LocalClient
    participant Lib as yzma Library
    participant Model as GGUF Model

    Note over App,Model: First Call (Lazy Initialization)
    
    App->>LC: Embed(ctx, text)
    LC->>LC: loadModel() [once.Do]
    LC->>Lib: loadLib(libPath) [package-level once]
    Lib-->>LC: library loaded
    LC->>Lib: llama.ModelLoadFromFile()
    Lib->>Model: load model
    Model-->>Lib: model handle
    Lib-->>LC: model, vocab, nEmbd
    
    Note over LC: Model loaded, proceed with embedding
    
    LC->>LC: mu.Lock()
    LC->>Lib: llama.Tokenize(vocab, text)
    Lib-->>LC: tokens
    LC->>Lib: llama.InitFromModel() [per-call context]
    Lib-->>LC: lctx
    LC->>Lib: llama.SetEmbeddings(lctx, true)
    LC->>Lib: llama.Decode(lctx, batch)
    LC->>Lib: llama.GetEmbeddingsSeq(lctx, 0, nEmbd)
    Lib-->>LC: rawVec
    LC->>LC: normalize(vec)
    LC->>Lib: llama.Free(lctx)
    LC->>LC: mu.Unlock()
    LC-->>App: []float32 embedding

    Note over App,Model: Subsequent Calls (Model Already Loaded)
    
    App->>LC: CompareEmbeddings(ctx, a, b)
    LC->>LC: Embed(ctx, a) [reuses loaded model]
    LC->>LC: Embed(ctx, b) [reuses loaded model]
    LC->>LC: CosineSimilarity(embA, embB)
    LC-->>App: float64 similarity
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

tokens := llama.Tokenize(c.vocab, text, true, true)

ctxParams := llama.ContextDefaultParams()
ctxParams.NCtx = uint32(len(tokens) + 16) // fit the text with padding
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

padding of +16 tokens may be insufficient for special tokens and context overhead

Most embedding models require additional space beyond the raw token count for BOS/EOS tokens, padding tokens, and internal context management. The current +16 may cause allocation issues with longer inputs or models with larger special token sets.

Suggested change
ctxParams.NCtx = uint32(len(tokens) + 16) // fit the text with padding
ctxParams.NCtx = uint32(len(tokens) + 64) // fit the text with padding for special tokens

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@nvandessel nvandessel force-pushed the feat/local-llm-foundations-v2 branch from c06d230 to 43db901 Compare February 11, 2026 16:04
@nvandessel nvandessel force-pushed the feat/local-llm-foundations-v2 branch from 43db901 to 4dbaf80 Compare February 11, 2026 16:08
@nvandessel nvandessel force-pushed the feat/local-llm-yzma branch 2 times, most recently from 7023d3d to ffef5ea Compare February 11, 2026 16:11
@nvandessel
Copy link
Owner Author

@greptileai review

@greptile-apps
Copy link

greptile-apps bot commented Feb 11, 2026

Greptile Overview

Greptile Summary

This PR replaces the previous (CGo-based) local embedding backend with hybridgroup/yzma (purego) and wires a functional LocalClient implementation in internal/llm/local.go. The client a one-time process-global yzma library init, lazy-loads the GGUF embedding model on first use, and creates/frees a llama context per Embed() call. Unit tests were expanded for Available() and configuration fallbacks, and an integration test file (behind -tags integration) was added to exercise embedding/similarity with real libs and a real model.

The change fits into the existing llm package by keeping MergeBehaviors delegated to the existing FallbackClient while enabling embedding-based CompareBehaviors/CompareEmbeddings when the local runtime is present.

Confidence Score: 3/5

  • Moderately safe to merge once the local client error/cleanup issues are addressed.
  • Core wiring is straightforward and test coverage is improved, but Embed() currently ignores decode errors and Close() leaves a freed model handle in a state where the client can be used again unsafely due to sync.Once preventing reload.
  • internal/llm/local.go

Important Files Changed

Filename Overview
go.mod Adds hybridgroup/yzma dependency and updates indirect x/* versions for purego local embeddings.
go.sum Updates checksums for new yzma/purego/ffi dependencies and bumps x/* module versions.
internal/llm/local.go Implements yzma-based embedding client with global library init and per-call contexts; has a couple of correctness issues around error handling/cleanup.
internal/llm/local_integration_test.go Adds integration tests behind build tag that exercise Available/Embed/Compare paths when libs and model are provided.
internal/llm/local_test.go Expands unit tests for Available/config fallbacks and verifies error behavior when local backend isn't configured.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant LocalClient
    participant L as llama (yzma)

    Caller->>LocalClient: CompareBehaviors(ctx, a, b)
    LocalClient->>LocalClient: CompareEmbeddings(ctx, aText, bText)

    LocalClient->>LocalClient: Embed(ctx, aText)
    LocalClient->>LocalClient: loadModel() [sync.Once]
    LocalClient->>L: Load(libPath) [process once]
    LocalClient->>L: Init()
    LocalClient->>L: ModelLoadFromFile(modelPath)
    LocalClient->>L: ModelGetVocab(model)
    LocalClient->>L: ModelNEmbd(model)

    LocalClient->>L: Tokenize(vocab, text)
    LocalClient->>L: InitFromModel(model, ctxParams)
    LocalClient->>L: SetEmbeddings(lctx, true)
    LocalClient->>L: BatchGetOne(tokens)
    LocalClient->>L: Decode(lctx, batch)
    LocalClient->>L: GetEmbeddingsSeq(lctx, 0, nEmbd)
    LocalClient->>LocalClient: normalize(vec)
    LocalClient->>L: Free(lctx)

    LocalClient->>LocalClient: Embed(ctx, bText)
    Note over LocalClient: Same per-call context flow

    LocalClient->>LocalClient: CosineSimilarity(embA, embB)
    LocalClient-->>Caller: ComparisonResult
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines 187 to 190
llama.SetEmbeddings(lctx, true)

batch := llama.BatchGetOne(tokens)
llama.Decode(lctx, batch)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unchecked decode errors

llama.Decode(lctx, batch) returns an error, but the current code ignores it. If decoding fails (e.g., context too small, invalid tokens, model/runtime issues), GetEmbeddingsSeq may return an error or (worse) succeed with undefined/garbage embeddings. Please capture and propagate the decode error before proceeding.

Comment on lines 241 to 248
func (c *LocalClient) Close() error {
c.mu.Lock()
defer c.mu.Unlock()

if c.loaded {
llama.ModelFree(c.model)
c.loaded = false
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Close doesn’t reset handle

After llama.ModelFree(c.model), c.model remains non-zero and c.once prevents re-loading. If the client is closed and then used again, loadModel() will not run and Embed() will try to create a context from a freed model handle. Consider resetting c.model/c.vocab/c.nEmbd and/or allowing loadModel() to run again after Close() so post-close usage fails cleanly or reloads safely.

@nvandessel nvandessel force-pushed the feat/local-llm-foundations-v2 branch from 4dbaf80 to 62e320c Compare February 11, 2026 16:28
Base automatically changed from feat/local-llm-foundations-v2 to main February 12, 2026 01:41
nvandessel and others added 2 commits February 11, 2026 17:42
Replace llama-go CGo bindings with hybridgroup/yzma, which uses purego
to load llama.cpp shared libraries at runtime. No CGo, no build tags,
no C++ compiler needed — go build always works.

- Single local.go file (no stub), package-level sync.Once for lib init
- Per-call context creation in Embed() for simplicity
- Available() checks lib dir + model file via os.Stat
- Integration tests gated on FLOOP_TEST_LIB_PATH + FLOOP_TEST_MODEL_PATH

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Increase token context padding from +16 to +64 for special tokens
- Check llama.Decode() error return instead of ignoring it
- Reset model/vocab/nEmbd handles and sync.Once in Close() so
  post-close usage fails cleanly or reloads safely

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@nvandessel
Copy link
Owner Author

@greptileai review

@greptile-apps
Copy link

greptile-apps bot commented Feb 12, 2026

Greptile Overview

Greptile Summary

This PR successfully replaces the CGo-based llama-go with hybridgroup/yzma (purego) for local LLM embeddings, eliminating C++ compiler dependencies while maintaining full functionality.

Key changes:

  • Replaced llama-go with yzma v1.7.0 for pure Go embedding generation
  • Implemented full LocalClient with thread-safe model loading and per-call context management
  • Added package-level sync.Once for library initialization and instance-level sync.Once for model loading
  • All three previous review issues addressed: decode error checking (line 190), proper Close() cleanup with once reset (line 253), and increased token padding from +16 to +64 (line 179)
  • Comprehensive test coverage with unit tests and integration tests (requires FLOOP_TEST_LIB_PATH and FLOOP_TEST_MODEL_PATH)
  • Available() performs cheap filesystem checks without loading libraries

Build improvement: go build now works without yzma libraries; shared libs only needed at runtime via yzma install --lib

Confidence Score: 4/5

  • Safe to merge with minor style suggestion regarding context cancellation handling
  • All previous review feedback addressed correctly, comprehensive error handling throughout, proper resource management with defer and mutex protection, extensive test coverage. One minor style suggestion about context cancellation timing (non-critical). The implementation is solid and follows Go best practices.
  • No files require special attention - all changes are well-structured and thoroughly tested

Important Files Changed

Filename Overview
internal/llm/local.go Implements full local embedding client with yzma (purego); addresses all previous review feedback (decode errors, close handling, token padding)
internal/llm/local_test.go Expanded unit tests for config validation, path resolution, error cases, and YZMA_LIB fallback
internal/llm/local_integration_test.go New integration tests for embedding, comparison, and behavior operations with real GGUF models

Sequence Diagram

sequenceDiagram
    participant Client as LocalClient
    participant Lib as yzma/llama
    participant Model as GGUF Model
    
    Note over Client: First call to Embed()
    Client->>Client: loadModel() with sync.Once
    Client->>Lib: Load(libPath) - package-level once
    Lib-->>Client: OK
    Client->>Lib: Init() - process global
    Client->>Lib: ModelLoadFromFile(path, params)
    Lib->>Model: Load GGUF
    Model-->>Lib: Model handle
    Lib-->>Client: model, vocab, nEmbd
    
    Note over Client: Per-call embedding context
    Client->>Client: Acquire mutex
    Client->>Lib: Tokenize(vocab, text)
    Lib-->>Client: tokens[]
    Client->>Lib: InitFromModel(model, ctxParams)
    Lib-->>Client: lctx (context handle)
    Client->>Lib: SetEmbeddings(lctx, true)
    Client->>Lib: Decode(lctx, batch)
    Lib->>Model: Process tokens
    Model-->>Lib: OK
    Client->>Lib: GetEmbeddingsSeq(lctx, 0, nEmbd)
    Lib-->>Client: rawVec (float32[])
    Client->>Client: Copy + L2 normalize
    Client->>Lib: Free(lctx)
    Client->>Client: Release mutex
    Client-->>Client: Return embedding vector
    
    Note over Client: On Close()
    Client->>Lib: ModelFree(model)
    Client->>Client: Reset once for reload
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +170 to +174
select {
case <-ctx.Done():
return nil, ctx.Err()
default:
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Context check happens before acquiring mutex. If blocked waiting for c.mu.Lock() and context cancelled during that wait, cancellation won't be detected until after lock acquired. Consider checking context again after lock or using select with ctx.Done() for long operations.

nvandessel and others added 3 commits February 11, 2026 18:00
- Fix gofmt alignment in LocalClient struct fields
- Check llama.Free and llama.ModelFree error returns (errcheck)
- Suppress gosec G115 for safe int->int32/uint32 conversions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Change GPULayers field type to int32 (matches llama API)
- Exclude gosec G115 (integer overflow) — no dataflow analysis,
  false positives on bounded values like len()
- Remove unnecessary math import and bounds check wrappers

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
if v := os.Getenv("FLOOP_LOCAL_GPU_LAYERS"); v != "" {
if n, err := strconv.Atoi(v); err == nil {
config.LLM.LocalGPULayers = n
config.LLM.LocalGPULayers = int32(min(n, math.MaxInt32))

Check failure

Code scanning / CodeQL

Incorrect conversion between integer types High

Incorrect conversion of an integer with architecture-dependent bit size from
strconv.Atoi
to a lower bit size type int32 without an upper bound check.

Copilot Autofix

AI 9 days ago

In general, to avoid incorrect conversions between integer types, you should either parse directly into the target sized type using strconv.ParseInt/ParseUint with the correct bit size, or ensure you perform explicit bounds checks against the target type’s min/max constants before casting.

For this specific code, the best fix without changing functionality is:

  • Replace strconv.Atoi with strconv.ParseInt using base 10 and bit size 32.
  • Use the returned int64 value as already guaranteed to be within the int32 range (because we pass 32 as the bit size), then cast to int32.
  • Apply the existing min limiting logic on the int32 value (or on the parsed int64 if we ensure it’s within int32) without relying on Atoi’s architecture-dependent int.

Concretely, in internal/config/config.go around lines 242–245:

  • Change n, err := strconv.Atoi(v) to parsed, err := strconv.ParseInt(v, 10, 32).
  • Replace config.LLM.LocalGPULayers = int32(min(n, math.MaxInt32)) with a form that:
    • First converts parsed to int32.
    • Then applies min with an int32-typed math.MaxInt32 (or an equivalent constant), so we aren't mixing different integer types.

Because we must not modify code outside the shown snippets or invent unseen helpers, we’ll perform the clamp after converting to int32 and use a correctly typed int32 max value. No new imports are needed; math and strconv are already imported.


Suggested changeset 1
internal/config/config.go

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/internal/config/config.go b/internal/config/config.go
--- a/internal/config/config.go
+++ b/internal/config/config.go
@@ -240,8 +240,10 @@
 		config.LLM.LocalEmbeddingModelPath = v
 	}
 	if v := os.Getenv("FLOOP_LOCAL_GPU_LAYERS"); v != "" {
-		if n, err := strconv.Atoi(v); err == nil {
-			config.LLM.LocalGPULayers = int32(min(n, math.MaxInt32))
+		if parsed, err := strconv.ParseInt(v, 10, 32); err == nil {
+			n32 := int32(parsed)
+			const maxInt32 int32 = math.MaxInt32
+			config.LLM.LocalGPULayers = min(n32, maxInt32)
 		}
 	}
 	if v := os.Getenv("FLOOP_LOCAL_CONTEXT_SIZE"); v != "" {
EOF
@@ -240,8 +240,10 @@
config.LLM.LocalEmbeddingModelPath = v
}
if v := os.Getenv("FLOOP_LOCAL_GPU_LAYERS"); v != "" {
if n, err := strconv.Atoi(v); err == nil {
config.LLM.LocalGPULayers = int32(min(n, math.MaxInt32))
if parsed, err := strconv.ParseInt(v, 10, 32); err == nil {
n32 := int32(parsed)
const maxInt32 int32 = math.MaxInt32
config.LLM.LocalGPULayers = min(n32, maxInt32)
}
}
if v := os.Getenv("FLOOP_LOCAL_CONTEXT_SIZE"); v != "" {
Copilot is powered by AI and may make mistakes. Always verify output.
The IntentMatch threshold (>0.8) is too strict for small embedding
models like all-MiniLM-L6-v2. Use MergeCandidate (>0.7) which is
the actionable threshold for dedup and works across model sizes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@tsaucedogithub
Copy link

local llm? seems like a bitter lesson tbh

@nvandessel
Copy link
Owner Author

this is gonna close soon in favor of new prs that will implement dedupe and also emebeddings for a vector search system to help bootstrap association and edges prior to them being learned through activation.

@nvandessel nvandessel closed this Feb 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments