Skip to content
This repository was archived by the owner on Aug 2, 2019. It is now read-only.

Latest commit

 

History

History
376 lines (276 loc) · 15.1 KB

File metadata and controls

376 lines (276 loc) · 15.1 KB

Native Interface

This chapter defines the Mu native interface.

NOTE: The term foreign function interface may have been used in many other places by experienced VM engineers to mean a heavy-weighted complex interface with another language. JikesRVM users use foreign function interface to refer to JNI and use syscall to refer to a light-weight unsafe mechanism to call arbitrary C functions with minimal overhead.

The CCALL instruction is more similar to the latter. It has minimum overhead, but provides no protection to malicious code. So it must be used with care.

To reduce confusion, we use the term unsafe native interface or just native interface instead of foreign function interface.

The native interface is a light-weight unsafe interface through which Mu IR programs communicate with native programs.

NOTE: This has no direct relationship with the Mu client interface.

  • Native programs are usually written in C, C++ or other low-level languages and usually does not run on VMs.
  • A Mu client is not necessary a native program. The client can be written in a managed language, running in a VM, running in the same Mu VM as user-level programs (i.e. a "metacircular" client), or living in a different process or even a different computer, communicating with Mu using sockets.

However, it does not rule out the possibility to implement the Mu client interface for native programs via this native interface.

The main purpose of the native interface is

  1. to interoperate with the operating system by invoking system libraries (including system calls), and
  2. to interoperate with libraries written in other programming languages.
NOTE: The purpose of the Mu client interface is to let the client control the Mu micro VM and handle events. The native interface is not about "controlling Mu".

It is not a purpose to interface with arbitrary native libraries. This interface should be minimal but just enough to handle most common system calls (e.g. open, read, write, close, ...) and common native libraries. Complex data types and functions (e.g. those with unusual size/alignment requirements or calling conventions) may require wrapper code provided by the language implementer.

The native interface is not required to be safe. The overhead of this interface should be as low as possible. It is the client's responsibility to implement things like JNI on top of this interface.

For JikesRVM users: The native interface includes raw memory access which is similar to "vmmagic" and the CCALL instruction is more like the "syscall" mechanism. They are not safe, but highly efficient and should be used with care.
NOTE: Directly making system calls from Mu and bypassing the C library (libc) is theoretically possible, but is not a mainstream way to do so. It has a lower priority in the design.

Outline

This interface has several aspects:

  1. Raw memory access: This interface provides pointer types and directly access the memory via pointers.
  2. Exposing Mu memory to the native world: This allows native programs to access Mu memory in a limited fashion.
  3. Native function call: This interface provides a mechanism to call a native function using a native calling convention.
  4. Callback from native programs: This interface will enable calling back from the native program.
  5. Inline assembly: Directly inserting machine-dependent instructions into a Mu IR function.

Raw Memory Access

This section defines mechanisms for raw memory access. Pointers give Mu programs access to the native (raw) memory, while pinning gives native programs access to the Mu memory.

Pointers

A pointer is an address in the memory space of the current process. A pointer can be a data pointer (type uptr<T>) or function pointer (type ufuncptr<sig>). The former assumes a data value is stored in a region beginning with the address. The latter assumes a piece of executable machine code is located at the address.

uptr<T>, ufuncptr<sig> and int<n>, where T is a type, sig is a function signature, can be cast to each other using the PTRCAST instruction. The address is preserved and the int<n> type has the numerical value of the address. Type checking is not performed.

Potential problem: There may be machines where data pointers have a different size from function pointers, but I have never seen one.

For C users: C spec never defined pointers as addresses. C pointers can point to either objects (region of storage) or functions. Casting between object pointers, function pointers and integers has implementation-defined behaviours.

There are segmented architectures, including x86, whose "pointers" are segments + offsets. However, apparently the trend is to move to a "flat" memory space.

Pinning

A pinning operations takes either a ref<T> value or an iref<T> value as parameter. The result is a data pointer. If it is an iref, the data pointer can be used to access the memory location referred by the iref. Pinning a NULL iref returns a NULL pointer whose address is 0. If it is a ref, it is equivalent to pin the iref of the memory location of the object itself, or 0 if the ref itself is NULL.

An unpinning operation also takes either a ref<T> value or an iref<T> value as parameter, but returns void.

In each thread, there is a conceptual "pinning multi-set" (may contain repeated elements). A pinning operation adds a ref or iref into this multi-set, and an unpinning operation removes one instance of the ref or iref from his multi-set. A memory location is pinned as long as there is at least one iref to that memory location in the pinning multi-set of any thread.

NOTE: This requires the Mu micro VM to perform somewhat complex book-keeping, but this gives Mu the opportunity for performance improvement over global Boolean pinning, where a pinned object can be unpinned instantly by an unpinning operation in any thread. The "pinning multi-thread" can be implemented as a thread-local buffer. In this case, if GC never happens, no expensive atomic memory access or inter-thread synchronisation is performed.

Calling between Mu and Native Functions

Calling Conventions

The calling conventions involving native programs are platform-dependent and implementation-dependent. It should be defined by platform-specific binary interfaces (ABI) as supplements to this Mu specification. Mu implementations should advertise what ABI it implements.

Calling conventions are identified by flags (#XXXXXX) in the IR. Mu defines the flag #DEFAULT and its numerical value 0x00 for the default calling convention of platforms. This flag is always available. Other calling conventions can be defined by implementations.

The calling convention determines the type of value that are callable by the CCALL instruction (described below), and the type of the exposed value for Mu functions (described below). The type is usually a ufuncptr<sig> for C functions, which are called via their addresses. Other examples are:

Mu Functions Calling Native Functions

The CCALL instruction calls a native function. Determined by calling conventions, the native function may be represented in different ways, and the arguments are passed in different ways. The return value of the call will be the return value of the CCALL instruction, which is a Mu SSA variable.

Native Functions Calling Mu Functions

A Mu function can be exposed as a native function pointer in three ways:

  1. Statically, an .expose top-level definition exposes a Mu function as a native value according to the desired calling convention. For the default calling convention, the result is usually a function pointer.
  2. Dynamically, the @uvm.native.expose common instructions can expose a Mu function, and the @uvm.native.unexpose common instruction deletes the exposed value.
  3. Dynamically, the expose and unexpose API function do the same thing as the above instructions.

A "cookie", which is a 64-bit integer value, can be attached to each exposed value. When a Mu function is called via one of its exposed value, the attached cookie can be retrieved by the @uvm.native.get_cookie common instruction in the callee, or 0 if called directly from Mu.

NOTE: The purpose for the cookie is to support "closures". In some high-level languages, the programmer-accessible "functions" are actually closures, i.e. codes with attached data. Implemented on Mu, multiple different closures may share the same Mu function as their codes, but has different attached data. For example, in Lua:

function make_adder(y)
    return function(x)
        return x + y
    end
end

plus_one = make_adder(1)
plus_two = make_adder(2)

print(plus_one(3), plus_two(3))     -- 4 5

plus_one and plus_two may probably share the same underlying Mu function as their common implementations, and they only differ by the different "up-value" y.

In C, any sane C programs that use call-backs should also have a void * as the "user data". For example, the pthread_create routine takes an extra void *arg parameter which will be passed to its start_routine as the argument. If the call-back is supposed to be a wrapper of a high-level language closure, the user data will be its context.

However, different C programs support user data in different ways (if at all). For example, the UNIX signal handler function takes exactly one parameter which is the signal number: typedef void (*sig_t) (int). If a closure is supposed to handle UNIX signals, it must be able to identify its context by merely the exposed function pointer.

One way to work around this problem is to generate a trampoline function which sets the cookie and jumps to the real callee. Many different trampolines can be made for a single Mu function, each of which supplies a different cookie. In this case, the cookie can identify the context for the closure.

The simplest kind of cookie is an integer, but an object reference may also be a candidate.

Since Mu programs need special contexts to execute (such as the thread-local memory allocation pool for the garbage collector, and the notion of the "current stack" for the SWAP-STACK operation), a native thread needs to attach itself to the Mu instance before calling any Mu functions. If a Mu thread calls native code from Mu, then it is already attached and can freely call back to Mu again. How to attach a thread to Mu is implementation-defined.

For JVM users: The JNI invocation API function AttachCurrentThread() and DetachCurrentThread() are the counterpart of this requirement.

Stack Sharing and Stack Introspection

The callee may share the stack with the caller.

When a Mu function "A" calls a native function which then calls back to another Mu function "B", Mu sees one single native frame between the frames for "A" and "B". When a Mu function is called from a native function without other Mu functions below, Mu consider the Mu function sitting on top of a native frame.

Stack introspection can skip native frames and introspect other Mu frames below.

NOTE: The requirement to "see through" native frames is partially required by exact garbage collection, in which case all references in the stack must be identified.

However, throwing Mu exceptions into native frames has implementation-defined behaviour. Attempting to pop native frames via the API also has implementation-defined behaviour.

NOTE: In general, it is not safe to force unwind native frames because native programs may need to clean up their own resources. Existing approaches, including JNI, models high-level (such as Java-level) exceptions as a query-able state rather than actual stack unwinding through native programs.

Native exceptions thrown into Mu frames also have implementation-defined behaviours.

NOTE: Similar to native frames, Mu programs may have even more necessary clean-up operations, such as GC barriers.

Changes in the Mu IR and the API introduced by the native interface

New types:

  • uptr < T >
  • ufuncptr < sig >

See Type System

New top-level definitions:

  • function exposing definition

See Mu IR.

New instructions:

  • PTRCAST
  • @uvm.native.pin
  • @uvm.native.unpin
  • @uvm.native.expose
  • @uvm.native.unexpose
  • @uvm.native.get_cookie

See Instruction Set and Common Instructions.

Modified instructions:

  • Memory addressing:
    • GETFIELDIREF
    • GETELEMIREF
    • SHIFTIREF
    • GETVARPARTIREF
    • LOAD
    • STORE
    • CMPXCHG
    • ATOMICRMW
  • CCALL

Memory addressing instructions take an additional PTR flag. If this flag is present, the location operand must be uptr<T> rather than iref<T>. For example:

  • %new_ptr = GETFIELDIREF PTR <@some_struct 3> %ptr_to_some_struct
  • %new_ptr = GETELEMIREF PTR <@some_array @i64> %ptr_to_some_array @const1
  • %new_ptr = SHIFTIREF PTR <@some_elem @i64> %ptr_to_some_elem @const2
  • %new_ptr = GETVARPARTIREF PTR <@some_hybrid> %ptr_to_some_hybrid
  • %old_val = LOAD PTR SEQ_CST <@T> %ptr_to_T
  • %void = STORE PTR SEQ_CST <@T> %ptr_to_T %newval
  • %result = CMPXCHG PTR ACQ_REL ACQUIRE <@T> %ptr_to_T %expected %desired
  • %old_val = ATOMICRMW ADD PTR SEQ_CST <@T> %ptr_to_T %rhs

See Instruction Set.

New API functions:

  • ptrcast
  • pin
  • unpin
  • expose
  • unexpose

Modified API functions:

The cur_func_ver function, in addition to returning the function version ID, it may also return 0 if the selected frame is a native frame. (Multiple native frames are counted as one between two Mu frames.)

The pop_frames_to function has implementation-defined behaviours when popping native frames.

When rebinding a thread to a stack with a value, and the top frame is on a call site (native or Mu), the value associated with the rebinding is the return value of the call.

Future Works

TODO: Inline assembly