Conversation
|
How is this going to interact with the existing gemm / fused_gemm code? Are you going to create a calling convention and expect the existing generators to provide GEMM / fused GEMM kernels that only work on a single matrix chain multiplication at a time instead of a batch of matrix chains? You need to keep in mind that chainforge and tinytc have restrictions on the work group size (thread block size), that might depend on architecture and problem size. So when fusing those kernels into single one, combined with generic code from yateto, then you need to ensure that you use the same work group size for all kernels. |
|
Firstly, to answer the issues (note that the branch is still WIP):
Having to handle the kernel only one tensor contraction, instead of a whole batch, also feels more in line with the original "spirit" of Yateto (at least how I interpret it); as a tool besides other code in a kernel; and not a full kernel next to other kernels. Maybe debatable; but there's a difference on whether you include the launch code (the "parallel for") or not. Anyways; if someone really needs more "why" to why this PR: (a) an intermediate way to harness fast small GETTs/block-level kernels in kernels which can/will be fused later (but aren't yet due to time constraints) and (b) a way to harness fast small GETTs/block-level kernels in more complex kernels than we even currently plan. E.g. when, together with some grid syncs and some nv/roc/intel shmem implementation, maybe even going for one big permanent kernel per cluster—as a faint idea. If we'd get Yateto to cover all that far, I'd be really surprised. (side note to self: maybe kernel-in-kernel/dynamic parallelism launch code might be another tiny interesting thing to implement) Also, the current "GPU" kernels (which will probably be horribly slow) should just provide a bare baseline; so that we get e.g. SeisSol at least to compile without any additional codegen. Which can already help for a broken codegen. [Anyways, replacing all kernels by Yateto/batch calls would be absolutely great; cf. the TensorForge-develop branch in SeisSol; the Imposed-Slip-Rates DR could be already generated from a Python description there and yield (mostly) the correct results. In principle, the whole thing here is partially inspired by cuBLASDx which strives for such an interface for GEMMs (in almost LIBXSMM-esque fashion) in an in-kernel environment for NVIDIA GPUs. And it (surprisingly) hasn't been too hard to implement a similar mechanism in Yateto right now; probably minus performance. |
Yes that it is how it is intended, and that is also a major necessity to avoid data movement and get a good AI (arithmetic intensity, not the other thing :-D). The problem on GPUs is that it isn't as easy as on CPUs...
I see the point of the PR, I just wanted to point out some difficulties that one is going to encounter, in particular w.r.t. work-group size constraints. |
We
igpu; instead ofgpuorcpu)All still very WIP and probably pretty slow, and hence a draft. (the only "performance" benefits at the very moment are that multiple threads are used for some computations; and shared memory is used for intermediate results)