Skip to content

Conversation

@leejet
Copy link
Owner

@leejet leejet commented Dec 12, 2025

No description provided.

if (dim != 3) {
x = ggml_ext_torch_permute(ctx, x, perm[0], perm[1], perm[2], perm[3]);
x = ggml_cont(ctx, x);
if (cont) {
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can remove ggml_cont here, but I haven’t fully verified it yet, so I’ll keep ggml_cont here for now.

Copy link
Contributor

@stduhpf stduhpf Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't something like this work in all cases? It seems to work for unet models at least. The code is simpler and it seems (slightly) faster than using permutations.

__STATIC_INLINE__ std::vector<struct ggml_tensor*> ggml_ext_chunk(struct ggml_context* ctx,
                                                                  struct ggml_tensor* x,
                                                                  int num,
                                                                  int64_t dim,
                                                                  bool cont = true) {
    GGML_ASSERT(dim >= 0 && dim < 4);
    GGML_ASSERT(x->ne[dim] % num == 0);

    std::vector<struct ggml_tensor*> chunks;
    int64_t chunk_size  = x->ne[dim] / num;
    int64_t stride      = chunk_size * x->nb[dim];
    int64_t chunk_ne[4] = {x->ne[0], x->ne[1], x->ne[2], x->ne[3]};
    chunk_ne[dim]       = chunk_size;
    for (int i = 0; i < num; i++) {
        auto chunk = ggml_view_4d(
            ctx, x,
            chunk_ne[0], chunk_ne[1], chunk_ne[2], chunk_ne[3],
            x->nb[1], x->nb[2], x->nb[3], stride * i);
        if (cont) {
            chunk = ggml_cont(ctx, chunk);
        }
        chunks.push_back(chunk);
    }

    return chunks;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's not really faster, seems to be withing margin of error for run-to-run variations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it doesn't cause #1080 (comment)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stduhpf This change looks like a simpler optimization. Could you open a separate PR for it? I’ll close this PR.

@wbruna
Copy link
Contributor

wbruna commented Dec 12, 2025

0835e5c broke sd1.5:

master-408 0835e5c
teste_1765561010 teste_1765560820

@stduhpf
Copy link
Contributor

stduhpf commented Dec 12, 2025

@wbruna, Oh, you're right, I was only looking at the speed.

@daniandtheweb
Copy link
Contributor

0835e5c broke sd1.5:

Same on SDXL.

@wbruna
Copy link
Contributor

wbruna commented Dec 12, 2025

Testing each version on SD1.5: when compared with 59ebdf0, #1079 seems almost as fast on Vulkan, and around 9% slower on ROCm. The ggml_ext_chunk suggested above is ~3-4% slower on both:

version vulkan rocm
59ebdf0 2.65s/it 2.34s/it
347710f (and current master) 3.65s/it 3.44s/it
ggml_ext_chunk above 2.75s/it 2.41s/it
#1079 2.69s/it 2.54s/it

@leejet
Copy link
Owner Author

leejet commented Dec 13, 2025

0835e5c broke sd1.5:

master-408 0835e5c
teste_1765561010 teste_1765560820

It looks like the implementations of the CUDA backend and the Vulkan backend are a bit different. I was able to reproduce it with the Vulkan backend as well, but everything works fine with the CUDA backend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants