Skip to content

Add optional __declspec(dllimport) to amd_* libm functions, for faster speed#37

Open
leekillough wants to merge 1 commit into
amd:devfrom
leekillough:fix_dllimport_slowdown
Open

Add optional __declspec(dllimport) to amd_* libm functions, for faster speed#37
leekillough wants to merge 1 commit into
amd:devfrom
leekillough:fix_dllimport_slowdown

Conversation

@leekillough
Copy link
Copy Markdown

Optional decoration that marks public AOCL-LibM entry points as imported from libalm.dll on Windows. Without this attribute, taking the address of an amd_* function (or storing it in a function pointer in a hot loop) captures the local import-thunk address; every call then pays an extra indirect jmp through the IAT. For sub-3 ns functions like amd_expf that extra hop is ~15-30% of the per-call cost. Decorating the prototypes with __declspec(dllimport) tells MSVC / clang-cl to emit the optimized movq __imp_* form, which dereferences the IAT slot once and stores the actual function address - giving a single indirect call per invocation.

This is Opt-in (default is unchanged - no decoration):

Define ALM_DLLIMPORT before #include <amdlibm.h> when linking dynamically against libalm.dll on Windows to enable the faster call sequence. Safe to leave undefined: behavior is identical to previous releases (no breaking change for callers that link against libalm-static.lib or that depend on the existing codegen).

On non-Windows platforms, ALM_API decorator is always empty regardless of ALM_DLLIMPORT.


Opt-in chosen deliberately to keep the change backward-compatible: existing customers (including all libalm-static.lib users) get bit-identical codegen unless they explicitly add -DALM_DLLIMPORT. The exports themselves stay on scripts/libalm.def, so no __declspec(dllexport) is needed in the header.

Verification on libm_microbench (1-run sanity, build_dllimport/, millions of calls per second):

        func    UCRT     AOCL_WIN     AWD old    AWD new
        expf    476.6M   464.7M       293.3M     454.5M   <- recovered
        log2f   473.1M   466.4M       361.7M     464.7M   <- recovered
        expm1   291.3M   335.4M       255.2M     340.3M   <- recovered
        log2    319.4M   454.6M       400.2M     407.0M   <- partial
        log1p   235.1M   325.2M       287.5M     313.7M   <- partial
        hypot   285.0M   227.8M       190.1M     221.7M   <- partial
        remainder 170.1M 252.5M       209.3M     229.1M   <- partial

The fp32 B-mechanism cluster (sinf, cosf, log10f, cbrt, pow) was unchanged, as predicted - those gaps live in libalm.dll itself, not the dispatch path. lround did NOT recover (321M -> 325M); worth a closer look in a follow-up. The disassembly flip is confirmed at the .obj level:

leaq amd_expf (capture thunk) -> movq __imp_amd_expf (deref IAT slot for actual address).

    --------
    Optional decoration that marks public AOCL-LibM entry points as
    imported from libalm.dll on Windows. Without this attribute, taking
    the address of an amd_* function (or storing it in a function
    pointer in a hot loop) captures the local import-thunk address;
    every call then pays an extra indirect jmp through the IAT. For
    sub-3 ns functions like amd_expf that extra hop is ~15-30% of the
    per-call cost. Decorating the prototypes with __declspec(dllimport)
    tells MSVC / clang-cl to emit the optimized `movq __imp_*` form,
    which dereferences the IAT slot once and stores the actual function
    address - giving a single indirect call per invocation.

    Opt-in (default is unchanged - no decoration):

        Define ALM_DLLIMPORT before #include <amdlibm.h> when linking
        dynamically against libalm.dll on Windows to enable the faster
        call sequence. Safe to leave undefined: behavior is identical
        to previous releases (no breaking change for callers that link
        against libalm-static.lib or that depend on the existing
        codegen).

    On non-Windows platforms ALM_API is always empty regardless of
    ALM_DLLIMPORT.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant