feat: implement local LLM client with yzma (purego)#60
feat: implement local LLM client with yzma (purego)#60nvandessel wants to merge 6 commits intomainfrom
Conversation
Greptile OverviewGreptile SummaryReplaces llama-go (CGo) with hybridgroup/yzma (purego) for local LLM embeddings, eliminating C++ compiler requirements at build time. Key implementation details:
The PR achieves its goal of pure Go builds while maintaining runtime flexibility for local embeddings. Confidence Score: 4/5
|
| Filename | Overview |
|---|---|
| go.mod | Added github.com/hybridgroup/yzma v1.7.0 dependency with transitive deps ebitengine/purego and jupiterrider/ffi; updated golang.org/x/* packages |
| go.sum | Updated checksums for new dependencies and updated Go standard library packages |
| internal/llm/local.go | Implemented full yzma-based embedding client with thread-safe lazy loading, per-call context creation, package-level library init, and proper resource cleanup; one minor style suggestion on context padding |
| internal/llm/local_test.go | Comprehensive unit tests covering config validation, Available() checks, error paths, and interface compliance; tests properly use t.TempDir() and t.Setenv() |
| internal/llm/local_integration_test.go | Well-structured integration tests with build tag, clear skip conditions via env vars, and realistic test cases for embeddings and similarity comparisons |
Sequence Diagram
sequenceDiagram
participant App as Application
participant LC as LocalClient
participant Lib as yzma Library
participant Model as GGUF Model
Note over App,Model: First Call (Lazy Initialization)
App->>LC: Embed(ctx, text)
LC->>LC: loadModel() [once.Do]
LC->>Lib: loadLib(libPath) [package-level once]
Lib-->>LC: library loaded
LC->>Lib: llama.ModelLoadFromFile()
Lib->>Model: load model
Model-->>Lib: model handle
Lib-->>LC: model, vocab, nEmbd
Note over LC: Model loaded, proceed with embedding
LC->>LC: mu.Lock()
LC->>Lib: llama.Tokenize(vocab, text)
Lib-->>LC: tokens
LC->>Lib: llama.InitFromModel() [per-call context]
Lib-->>LC: lctx
LC->>Lib: llama.SetEmbeddings(lctx, true)
LC->>Lib: llama.Decode(lctx, batch)
LC->>Lib: llama.GetEmbeddingsSeq(lctx, 0, nEmbd)
Lib-->>LC: rawVec
LC->>LC: normalize(vec)
LC->>Lib: llama.Free(lctx)
LC->>LC: mu.Unlock()
LC-->>App: []float32 embedding
Note over App,Model: Subsequent Calls (Model Already Loaded)
App->>LC: CompareEmbeddings(ctx, a, b)
LC->>LC: Embed(ctx, a) [reuses loaded model]
LC->>LC: Embed(ctx, b) [reuses loaded model]
LC->>LC: CosineSimilarity(embA, embB)
LC-->>App: float64 similarity
internal/llm/local.go
Outdated
| tokens := llama.Tokenize(c.vocab, text, true, true) | ||
|
|
||
| ctxParams := llama.ContextDefaultParams() | ||
| ctxParams.NCtx = uint32(len(tokens) + 16) // fit the text with padding |
There was a problem hiding this comment.
padding of +16 tokens may be insufficient for special tokens and context overhead
Most embedding models require additional space beyond the raw token count for BOS/EOS tokens, padding tokens, and internal context management. The current +16 may cause allocation issues with longer inputs or models with larger special token sets.
| ctxParams.NCtx = uint32(len(tokens) + 16) // fit the text with padding | |
| ctxParams.NCtx = uint32(len(tokens) + 64) // fit the text with padding for special tokens |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
c06d230 to
43db901
Compare
188f32a to
b3af4a6
Compare
43db901 to
4dbaf80
Compare
7023d3d to
ffef5ea
Compare
|
@greptileai review |
Greptile OverviewGreptile SummaryThis PR replaces the previous (CGo-based) local embedding backend with The change fits into the existing llm package by keeping Confidence Score: 3/5
|
| Filename | Overview |
|---|---|
| go.mod | Adds hybridgroup/yzma dependency and updates indirect x/* versions for purego local embeddings. |
| go.sum | Updates checksums for new yzma/purego/ffi dependencies and bumps x/* module versions. |
| internal/llm/local.go | Implements yzma-based embedding client with global library init and per-call contexts; has a couple of correctness issues around error handling/cleanup. |
| internal/llm/local_integration_test.go | Adds integration tests behind build tag that exercise Available/Embed/Compare paths when libs and model are provided. |
| internal/llm/local_test.go | Expands unit tests for Available/config fallbacks and verifies error behavior when local backend isn't configured. |
Sequence Diagram
sequenceDiagram
participant Caller
participant LocalClient
participant L as llama (yzma)
Caller->>LocalClient: CompareBehaviors(ctx, a, b)
LocalClient->>LocalClient: CompareEmbeddings(ctx, aText, bText)
LocalClient->>LocalClient: Embed(ctx, aText)
LocalClient->>LocalClient: loadModel() [sync.Once]
LocalClient->>L: Load(libPath) [process once]
LocalClient->>L: Init()
LocalClient->>L: ModelLoadFromFile(modelPath)
LocalClient->>L: ModelGetVocab(model)
LocalClient->>L: ModelNEmbd(model)
LocalClient->>L: Tokenize(vocab, text)
LocalClient->>L: InitFromModel(model, ctxParams)
LocalClient->>L: SetEmbeddings(lctx, true)
LocalClient->>L: BatchGetOne(tokens)
LocalClient->>L: Decode(lctx, batch)
LocalClient->>L: GetEmbeddingsSeq(lctx, 0, nEmbd)
LocalClient->>LocalClient: normalize(vec)
LocalClient->>L: Free(lctx)
LocalClient->>LocalClient: Embed(ctx, bText)
Note over LocalClient: Same per-call context flow
LocalClient->>LocalClient: CosineSimilarity(embA, embB)
LocalClient-->>Caller: ComparisonResult
internal/llm/local.go
Outdated
| llama.SetEmbeddings(lctx, true) | ||
|
|
||
| batch := llama.BatchGetOne(tokens) | ||
| llama.Decode(lctx, batch) |
There was a problem hiding this comment.
Unchecked decode errors
llama.Decode(lctx, batch) returns an error, but the current code ignores it. If decoding fails (e.g., context too small, invalid tokens, model/runtime issues), GetEmbeddingsSeq may return an error or (worse) succeed with undefined/garbage embeddings. Please capture and propagate the decode error before proceeding.
| func (c *LocalClient) Close() error { | ||
| c.mu.Lock() | ||
| defer c.mu.Unlock() | ||
|
|
||
| if c.loaded { | ||
| llama.ModelFree(c.model) | ||
| c.loaded = false | ||
| } |
There was a problem hiding this comment.
Close doesn’t reset handle
After llama.ModelFree(c.model), c.model remains non-zero and c.once prevents re-loading. If the client is closed and then used again, loadModel() will not run and Embed() will try to create a context from a freed model handle. Consider resetting c.model/c.vocab/c.nEmbd and/or allowing loadModel() to run again after Close() so post-close usage fails cleanly or reloads safely.
4dbaf80 to
62e320c
Compare
ffef5ea to
c19e4dc
Compare
Replace llama-go CGo bindings with hybridgroup/yzma, which uses purego to load llama.cpp shared libraries at runtime. No CGo, no build tags, no C++ compiler needed — go build always works. - Single local.go file (no stub), package-level sync.Once for lib init - Per-call context creation in Embed() for simplicity - Available() checks lib dir + model file via os.Stat - Integration tests gated on FLOOP_TEST_LIB_PATH + FLOOP_TEST_MODEL_PATH Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Increase token context padding from +16 to +64 for special tokens - Check llama.Decode() error return instead of ignoring it - Reset model/vocab/nEmbd handles and sync.Once in Close() so post-close usage fails cleanly or reloads safely Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
35c084e to
b81226c
Compare
|
@greptileai review |
Greptile OverviewGreptile SummaryThis PR successfully replaces the CGo-based llama-go with hybridgroup/yzma (purego) for local LLM embeddings, eliminating C++ compiler dependencies while maintaining full functionality. Key changes:
Build improvement: Confidence Score: 4/5
|
| Filename | Overview |
|---|---|
| internal/llm/local.go | Implements full local embedding client with yzma (purego); addresses all previous review feedback (decode errors, close handling, token padding) |
| internal/llm/local_test.go | Expanded unit tests for config validation, path resolution, error cases, and YZMA_LIB fallback |
| internal/llm/local_integration_test.go | New integration tests for embedding, comparison, and behavior operations with real GGUF models |
Sequence Diagram
sequenceDiagram
participant Client as LocalClient
participant Lib as yzma/llama
participant Model as GGUF Model
Note over Client: First call to Embed()
Client->>Client: loadModel() with sync.Once
Client->>Lib: Load(libPath) - package-level once
Lib-->>Client: OK
Client->>Lib: Init() - process global
Client->>Lib: ModelLoadFromFile(path, params)
Lib->>Model: Load GGUF
Model-->>Lib: Model handle
Lib-->>Client: model, vocab, nEmbd
Note over Client: Per-call embedding context
Client->>Client: Acquire mutex
Client->>Lib: Tokenize(vocab, text)
Lib-->>Client: tokens[]
Client->>Lib: InitFromModel(model, ctxParams)
Lib-->>Client: lctx (context handle)
Client->>Lib: SetEmbeddings(lctx, true)
Client->>Lib: Decode(lctx, batch)
Lib->>Model: Process tokens
Model-->>Lib: OK
Client->>Lib: GetEmbeddingsSeq(lctx, 0, nEmbd)
Lib-->>Client: rawVec (float32[])
Client->>Client: Copy + L2 normalize
Client->>Lib: Free(lctx)
Client->>Client: Release mutex
Client-->>Client: Return embedding vector
Note over Client: On Close()
Client->>Lib: ModelFree(model)
Client->>Client: Reset once for reload
| select { | ||
| case <-ctx.Done(): | ||
| return nil, ctx.Err() | ||
| default: | ||
| } |
There was a problem hiding this comment.
Context check happens before acquiring mutex. If blocked waiting for c.mu.Lock() and context cancelled during that wait, cancellation won't be detected until after lock acquired. Consider checking context again after lock or using select with ctx.Done() for long operations.
- Fix gofmt alignment in LocalClient struct fields - Check llama.Free and llama.ModelFree error returns (errcheck) - Suppress gosec G115 for safe int->int32/uint32 conversions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Change GPULayers field type to int32 (matches llama API) - Exclude gosec G115 (integer overflow) — no dataflow analysis, false positives on bounded values like len() - Remove unnecessary math import and bounds check wrappers Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| if v := os.Getenv("FLOOP_LOCAL_GPU_LAYERS"); v != "" { | ||
| if n, err := strconv.Atoi(v); err == nil { | ||
| config.LLM.LocalGPULayers = n | ||
| config.LLM.LocalGPULayers = int32(min(n, math.MaxInt32)) |
Check failure
Code scanning / CodeQL
Incorrect conversion between integer types High
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 9 days ago
In general, to avoid incorrect conversions between integer types, you should either parse directly into the target sized type using strconv.ParseInt/ParseUint with the correct bit size, or ensure you perform explicit bounds checks against the target type’s min/max constants before casting.
For this specific code, the best fix without changing functionality is:
- Replace
strconv.Atoiwithstrconv.ParseIntusing base 10 and bit size 32. - Use the returned
int64value as already guaranteed to be within theint32range (because we pass 32 as the bit size), then cast toint32. - Apply the existing
minlimiting logic on theint32value (or on the parsedint64if we ensure it’s withinint32) without relying onAtoi’s architecture-dependentint.
Concretely, in internal/config/config.go around lines 242–245:
- Change
n, err := strconv.Atoi(v)toparsed, err := strconv.ParseInt(v, 10, 32). - Replace
config.LLM.LocalGPULayers = int32(min(n, math.MaxInt32))with a form that:- First converts
parsedtoint32. - Then applies
minwith anint32-typedmath.MaxInt32(or an equivalent constant), so we aren't mixing different integer types.
- First converts
Because we must not modify code outside the shown snippets or invent unseen helpers, we’ll perform the clamp after converting to int32 and use a correctly typed int32 max value. No new imports are needed; math and strconv are already imported.
| @@ -240,8 +240,10 @@ | ||
| config.LLM.LocalEmbeddingModelPath = v | ||
| } | ||
| if v := os.Getenv("FLOOP_LOCAL_GPU_LAYERS"); v != "" { | ||
| if n, err := strconv.Atoi(v); err == nil { | ||
| config.LLM.LocalGPULayers = int32(min(n, math.MaxInt32)) | ||
| if parsed, err := strconv.ParseInt(v, 10, 32); err == nil { | ||
| n32 := int32(parsed) | ||
| const maxInt32 int32 = math.MaxInt32 | ||
| config.LLM.LocalGPULayers = min(n32, maxInt32) | ||
| } | ||
| } | ||
| if v := os.Getenv("FLOOP_LOCAL_CONTEXT_SIZE"); v != "" { |
The IntentMatch threshold (>0.8) is too strict for small embedding models like all-MiniLM-L6-v2. Use MergeCandidate (>0.7) which is the actionable threshold for dedup and works across model sizes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
local llm? seems like a bitter lesson tbh |
|
this is gonna close soon in favor of new prs that will implement dedupe and also emebeddings for a vector search system to help bootstrap association and edges prior to them being learned through activation. |
Summary
local.gofile — no build tags, no stub file, no C++ compiler neededgo buildalways works; shared libs only needed at runtime (yzma install --lib)sync.Oncefor library init, instance-levelsync.Oncefor model loadingAvailable()does cheapos.Statchecks without loading any librariesStacked on
feat/local-llm-foundations-v2) — pure Go foundationsrevert/local-llm-llamago) — revert llama-goTest plan
go build ./cmd/floop— builds cleanly (no yzma libs needed at build time)go vet ./...— no issuesgo test ./...— all 25 packages passFLOOP_TEST_LIB_PATH=/path/to/libs FLOOP_TEST_MODEL_PATH=/path/to/model.gguf go test -tags integration ./internal/llm/ -v🤖 Generated with Claude Code