Skip to content

feat: add euclidean one2many#188

Open
richyreachy wants to merge 26 commits intomainfrom
feat/euclidean_one2many
Open

feat: add euclidean one2many#188
richyreachy wants to merge 26 commits intomainfrom
feat/euclidean_one2many

Conversation

@richyreachy
Copy link
Collaborator

@richyreachy richyreachy commented Mar 1, 2026

add euclidean one2many

Greptile Summary

This PR introduces Euclidean one-to-many batch distance kernels (EuclideanDistanceBatch and SquaredEuclideanDistanceBatch) for FP32 and FP16 across AVX2, AVX512F, and AVX512FP16 ISA tiers, wires them into EuclideanMetric::batch_distance() and upgrades SquaredEuclideanMetric from BatchSize=1 to BatchSize=12. Several correctness issues from prior review rounds have been addressed (masked-tail formula, swapped variable names, UB in shift expression), though a few regressions introduced alongside the new feature — removal of HammingMetric::batch_distance(), removal of DT_BINARY32/DT_BINARY64 from SquaredEuclideanMetric, and two disabled tests — remain unresolved.

  • New issue — wrong prefetch target in fallback: compute_one_to_many_squared_euclidean_fallback calls ailego_prefetch(&prefetch_ptrs[j]), which prefetches the stack address of the pointer array element rather than the vector data. Every SIMD path uses ailego_prefetch(prefetch_ptrs[i] + dim) for the actual data address; the fallback should follow the same pattern.
  • Dead code in distance_batch_math.h: sum4() and sum_top_bottom_avx() are defined and included in two translation units but are never called; the implementations use HorizontalAdd_FP32_V256/HorizontalAdd_FP32_V512 from matrix_utility.i instead.
  • Prior unresolved regressions (Hamming batch_distance() removal, binary-type cases removed from SquaredEuclideanMetric, tests silently disabled, off-by-one in euclidean_distance_batch_impl_fp16_avx512.cc) were flagged in earlier review rounds and are still present.

Confidence Score: 2/5

  • Not safe to merge — the new fallback path has an incorrect prefetch, and previously flagged regressions (null batch_distance() for Hamming, disabled tests masking the regression) remain unaddressed.
  • Several correctness issues flagged in earlier review rounds (Hamming batch_distance removal, DT_BINARY32/64 removal, off-by-one in AVX512 FP16 partial block) are still present. A new bug introduced in this version — prefetching the wrong address in the scalar fallback — makes the fallback's prefetch hint ineffective. Two tests covering the affected code paths are silently disabled with no explanation.
  • src/ailego/math_batch/euclidean_distance_batch.h (wrong prefetch), src/core/metric/hamming_metric.cc and src/core/metric/euclidean_metric.cc (removed binary-type support), tests/core/algorithm/hnsw/hnsw_streamer_test.cc (disabled tests).

Important Files Changed

Filename Overview
src/ailego/math_batch/distance_batch_math.h New helper file providing sum4() and sum_top_bottom_avx() but neither function is actually called by any file that includes this header — dead code.
src/ailego/math_batch/euclidean_distance_batch.h New core dispatch header defining SquaredEuclideanDistanceBatch and EuclideanDistanceBatch templates; contains a bug where ailego_prefetch(&prefetch_ptrs[j]) prefetches the array's stack address instead of the pointed-to data.
src/ailego/math_batch/euclidean_distance_batch_impl_fp16_avx512.cc AVX512F FP16 kernel; 16-element partial block formula is now correct; however the 8-element partial-block guard still uses strict < (off-by-one, previously flagged).
src/core/metric/euclidean_metric.cc Adds EuclideanMetric::batch_distance() with BatchSize=12 and bumps SquaredEuclideanMetric from BatchSize=1 to 12; removes DT_BINARY32/DT_BINARY64 cases from SquaredEuclideanMetric (previously flagged regression).
src/core/metric/hamming_metric.cc Removes HammingMetric::batch_distance() override entirely; callers will now silently receive nullptr from the base class (previously flagged regression).
tests/core/algorithm/hnsw/hnsw_streamer_test.cc TestBinaryConverter and TestBasicRefiner silently disabled with #if 0 and no explanation; masks the regression caused by removing Hamming batch-distance support (previously flagged).

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["BaseDistance::ComputeBatch()"] --> B{DistanceType?}
    B -- EuclideanDistanceMatrix --> C["EuclideanDistanceBatch::ComputeBatch()"]
    B -- SquaredEuclideanDistanceMatrix --> D["SquaredEuclideanDistanceBatch::ComputeBatch()"]
    B -- Other --> E["_ComputeBatch() (existing)"]

    C --> D
    C --> F["sqrt(results[i]) ∀ i"]

    D --> G{num_vecs loop\nbatches of BatchSize}
    G --> H["SquaredEuclideanDistanceBatchImpl\n::compute_one_to_many()"]
    G --> I["Tail: BatchSize=1 per vector"]

    H --> J{CPU dispatch}
    J -- AVX512FP16 --> K["avx512fp16_fp16 kernel"]
    J -- AVX512F --> L["avx512f_fp16 / fp32 kernel"]
    J -- AVX2 --> M["avx2_fp16 / fp32 kernel"]
    J -- Fallback --> N["compute_one_to_many\n_squared_euclidean_fallback()\n⚠️ wrong prefetch address"]
Loading

Comments Outside Diff (1)

  1. tests/core/algorithm/hnsw/hnsw_streamer_test.cc, line 3500 (link)

    P2 Tests disabled with #if 0

    TestBinaryConverter and TestBasicRefiner are both silenced with #if 0 in this PR. Permanently disabling tests with preprocessor guards makes failures invisible and can hide regressions. If these tests are failing due to the binary-type removals in hamming_metric.cc / euclidean_metric.cc, the root cause should be fixed rather than the tests suppressed. If the tests are being temporarily skipped for another reason, using DISABLED_ as a Google Test prefix is the conventional way to do this without removing coverage entirely.

Reviews (7): Last reviewed commit: "fix: remove unnecessary file" | Re-trigger Greptile

Greptile also left 1 inline comment on this PR.

@richyreachy richyreachy requested a review from iaojnh March 1, 2026 14:47
@greptile-apps
Copy link

greptile-apps bot commented Mar 1, 2026

Greptile Summary

This PR adds Euclidean one-to-many distance computation with SIMD optimizations (AVX2, AVX512F, AVX512FP16) for float32, float16, and int8 data types. However, multiple critical bugs exist in the new implementation files that will cause incorrect distance calculations:

  • euclidean_distance_batch_impl.h: Wrong FMA operation in AVX512F masked tail (line 81) computes dot product instead of squared difference; AVX2 switch statement (lines 128-148) missing dimension offsets
  • euclidean_distance_batch_impl_fp16.h: AVX512F implementation (line 139) computes dot product instead of squared distance
  • euclidean_distance_batch_impl_int8.h: Incorrect pointer arithmetic (line 74) and completely broken switch statement (lines 93-136) with wrong macro usage and array indices

These bugs were already identified in previous review threads and remain unfixed. Additionally:

  • Removed incorrect Hamming distance cases from SquaredEuclideanMetric::batch_distance()
  • Added batch_distance() method to EuclideanMetric
  • Minor formatting improvements across several files

Confidence Score: 0/5

  • This PR is NOT safe to merge - contains multiple critical bugs that will produce incorrect results
  • Score of 0 reflects multiple critical algorithmic errors in the core distance computation logic that will cause incorrect calculations across all three implementation files (fp32, fp16, int8). These bugs were previously identified but remain unfixed, and will result in wrong distance values being returned to callers.
  • All three implementation files require immediate attention: euclidean_distance_batch_impl.h, euclidean_distance_batch_impl_fp16.h, and euclidean_distance_batch_impl_int8.h

Important Files Changed

Filename Overview
src/ailego/math_batch/euclidean_distance_batch_impl.h New file with critical bugs in AVX512F masked tail handling (wrong FMA operation) and AVX2 switch statement (missing dim offsets)
src/ailego/math_batch/euclidean_distance_batch_impl_fp16.h New file with critical bug in AVX512F implementation (dot product instead of squared distance at line 139)
src/ailego/math_batch/euclidean_distance_batch_impl_int8.h New file with critical bugs in pointer arithmetic (line 74) and switch statement (lines 91-136: wrong macro usage with pointers and incorrect array indices)
src/ailego/math_batch/euclidean_distance_batch.h New file defining batch distance computation structure, looks correct but depends on buggy implementation files
src/ailego/math_batch/distance_batch.h Added Euclidean and SquaredEuclidean distance batch handlers using constexpr type checks
src/core/metric/euclidean_metric.cc Added batch_distance method to EuclideanMetric and removed incorrect Hamming distance cases from SquaredEuclideanMetric

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[BaseDistance::ComputeBatch] --> B{Distance Type?}
    B -->|Euclidean| C[EuclideanDistanceBatch::ComputeBatch]
    B -->|SquaredEuclidean| D[SquaredEuclideanDistanceBatch::ComputeBatch]
    B -->|Other| E[Default _ComputeBatch]
    
    D --> F{Data Type?}
    F -->|float| G[SquaredEuclideanDistanceBatchImpl float]
    F -->|int8_t| H[SquaredEuclideanDistanceBatchImpl int8_t]
    F -->|Float16| I[SquaredEuclideanDistanceBatchImpl Float16]
    F -->|Other| J[Fallback Implementation]
    
    G --> K{CPU Features?}
    K -->|AVX512F| L[compute_one_to_many_squared_euclidean_avx512f_fp32]
    K -->|AVX2| M[compute_one_to_many_squared_euclidean_avx2_fp32]
    K -->|None| J
    
    H --> N{CPU Features?}
    N -->|AVX2| O[compute_one_to_many_squared_euclidean_avx2_int8]
    N -->|None| J
    
    I --> P{CPU Features?}
    P -->|AVX512FP16| Q[compute_one_to_many_squared_euclidean_avx512fp16_fp16]
    P -->|AVX512F| R[compute_one_to_many_squared_euclidean_avx512f_fp16]
    P -->|AVX2| S[compute_one_to_many_squared_euclidean_avx2_fp16]
    P -->|None| J
    
    C --> T[Call SquaredEuclideanDistanceBatch]
    T --> U[Apply sqrt to results]
    
    style L fill:#f99,stroke:#333,stroke-width:2px
    style M fill:#f99,stroke:#333,stroke-width:2px
    style O fill:#f99,stroke:#333,stroke-width:2px
    style Q fill:#f99,stroke:#333,stroke-width:2px
    style R fill:#f99,stroke:#333,stroke-width:2px
    style S fill:#f99,stroke:#333,stroke-width:2px
Loading

Last reviewed commit: b766fb1

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

13 files reviewed, 5 comments

Edit Code Review Agent Settings | Greptile

@richyreachy
Copy link
Collaborator Author

@greptile

@richyreachy
Copy link
Collaborator Author

@greptile

@richyreachy
Copy link
Collaborator Author

@greptile

@richyreachy
Copy link
Collaborator Author

@greptile

@richyreachy
Copy link
Collaborator Author

@greptile

@richyreachy
Copy link
Collaborator Author

@greptile

1 similar comment
@richyreachy
Copy link
Collaborator Author

@greptile

richyreachy and others added 2 commits March 23, 2026 16:16
…2fp16.cc

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
@richyreachy
Copy link
Collaborator Author

@greptile

@richyreachy
Copy link
Collaborator Author

@greptile

_mm512_castps512_ps256(accs[i]),
_mm256_castpd_ps(_mm512_extractf64x4_pd(_mm512_castps_pd(accs[i]), 1)));
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Off-by-one in 8-element partial-block boundary condition

The condition dim + 8 < dimensionality uses strict less-than, which means the 8-element AVX2 SIMD block is skipped when exactly 8 FP16 elements remain after the 16-element partial-block. In that case all 8 elements fall through to the scalar loop, silently degrading throughput for any dimension that satisfies dimensionality % 16 == 8 after the dim + 16 <= dimensionality pass (e.g., dim=24, 40, 56 …).

This is the same off-by-one pattern already present in inner_product_distance_batch_impl_fp16.h (AVX512F and AVX2 partial blocks). The fix is to use <=:

Suggested change
if (dim + 8 <= dimensionality) {

@richyreachy
Copy link
Collaborator Author

@greptile

Comment on lines +40 to +42
}
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Prefetch of wrong address in fallback path

ailego_prefetch(&prefetch_ptrs[j]) prefetches the address of the j-th element within the stack-allocated prefetch_ptrs array itself — not the vector data the pointer refers to. This makes the prefetch completely ineffective. All the SIMD implementations use ailego_prefetch(prefetch_ptrs[i] + dim) to prefetch the actual data.

Additionally, there is no nullptr guard here, while every SIMD path checks if (prefetch_ptrs[0]) before issuing prefetches. Passing a null pointer to ailego_prefetch is typically a no-op, but the intent is clearly wrong.

Suggested change
}
}
ailego_prefetch(prefetch_ptrs[j]);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants