Understanding the RPC and ULT model

When developing a Mochi service either with Margo or Thallium, it can be useful to have in mind how RPCs translate into user-level threads (ULT) when reaching the server. The figure bellow summarizes what happens when a client sends an RPC to a server, and the RPC handler on the server side includes some RDMA operations.

In this figure, we show only one execution stream for the client, assuming it has initialized Margo (or Thallium) without a Mercury progress thread. The case of using a Mercury progress thread on a client is similar, as the progress thread simply takes care of network activities on behalf of the caller thread.

This figure shows a client using margo_iforward, which sends an RPC to a server in a non-blocking manner. The case of margo_forward can be viewed as the same scenario but with margo_wait invoked immediately after margo_iforward. In Thallium, the equivalent code would use the async member function of a callable_remote_procedure object.

User-friendly Margo RPC Model

Explanations

margo_forward and margo_iforward start by calling the serialization function (provided by the user when registering RPCs using MARGO_REGISTER) to serialize the RPC argument into an input buffer. Mercury then sends a request, including this buffer, to the server.

In the server, the Mercury progress loop (which may execute on a dedicated execution stream) eventually sees the request and invokes the corresponding callback (in yellow). This callback has been automatically generated by DEFINE_MARGO_RPC_HANDLER in the user’s code. This callback (1) looks up the Argobots pool in which the RPC is supposed to execute, and (2) creates a ULT in that pool. This ULT will run the user’s RPC handler.

RPC handler ULTs are posted in a pool that may use different execution streams (ES) than the pool used by the Mercury progress loop. For instance when calling margo_init(..., 1, 8), 8 ES are created along with a shared pool in which RPC handler ULTs will be posed. When one of the ES is free, it pulls an ULT from the pool and executes it.

Generally, the RPC handler will start by deserializing the RPC’s argument by calling margo_get_input. This invokes the user-provided serialization function to deserialize the content of the Mercury buffer into the user’s input data structure.

When issuing a bulk transfer (RDMA) using margo_bulk_transfer, the ULT asks the Mercury progress loop to execute the transfer. Meanwhile, this ULT yields, so that the ES on which it runs can execute other ULTs (e.g. other RPC requests).

The Mercury progress loop eventually executes the RDMA operation and notifies the calling ULT. The calling ULT is marked as ready and will eventually resume.

When the RPC handler calls margo_respond to send a response to the client, it first calls the user-provided serialization function to encode the response into Mercury’s buffer, then yields, waiting for the Mercury progress loop to send the response and allowing the ES to potentially execute other RPCs in the meantime.

Once the response is sent by Mercury, the Mercury progress loop notifies the RPC handler ULT, which eventually resumes and complete.

Finally, margo_wait completes in the client. The client can then call margo_get_output on the RPC handler to deserialize the RPC’s output using the user-provided deserialization function.

Note

This model remains valid regardless of whether the Mercury progress loop runs on a separate ULT or not. Indeed you may notice that, from the point of view of a single RPC operation, the two ES shown here on the server could be merged into one. The advantage of dedicating an ES to Mercury progress is that multiple concurrent RPC handlers may rely on it with minimum interference. Should the Mercury progress loop run in the same ES as the RPC handlers, calling margo_respond in a handler could yield to another (potentially long-running) RPC handler instead of the progress loop, thus delaying the completion of the first RPC handler.