Understanding the RPC and ULT model
When developing a Mochi service either with Margo or Thallium, it can be useful to have in mind how RPCs translate into user-level threads (ULT) when reaching the server. The figure bellow summarizes what happens when a client sends an RPC to a server, and the RPC handler on the server side includes some RDMA operations.
In this figure, we show only one execution stream for the client, assuming it has initialized Margo (or Thallium) without a Mercury progress thread. The case of using a Mercury progress thread on a client is similar, as the progress thread simply takes care of network activities on behalf of the caller thread.
This figure shows a client using margo_iforward
, which sends an RPC
to a server in a non-blocking manner. The case of margo_forward
can
be viewed as the same scenario but with margo_wait
invoked immediately
after margo_iforward
. In Thallium, the equivalent code would use the
async member function of a callable_remote_procedure object.
Explanations
margo_forward
and margo_iforward
start by calling the serialization
function (provided by the user when registering RPCs using MARGO_REGISTER
) to
serialize the RPC argument into an input buffer. Mercury then sends a request,
including this buffer, to the server.
In the server, the Mercury progress loop (which may execute on a dedicated execution
stream) eventually sees the request and invokes the corresponding callback (in yellow).
This callback has been automatically generated by DEFINE_MARGO_RPC_HANDLER
in the user’s code. This callback (1) looks up the Argobots pool in which the RPC is
supposed to execute, and (2) creates a ULT in that pool. This ULT will run the user’s
RPC handler.
RPC handler ULTs are posted in a pool that may use different execution streams (ES) than
the pool used by the Mercury progress loop. For instance when calling margo_init(..., 1, 8)
,
8 ES are created along with a shared pool in which RPC handler ULTs will be posed.
When one of the ES is free, it pulls an ULT from the pool and executes it.
Generally, the RPC handler will start by deserializing the RPC’s argument by calling margo_get_input. This invokes the user-provided serialization function to deserialize the content of the Mercury buffer into the user’s input data structure.
When issuing a bulk transfer (RDMA) using margo_bulk_transfer
, the ULT
asks the Mercury progress loop to execute the transfer. Meanwhile, this ULT yields,
so that the ES on which it runs can execute other ULTs (e.g. other RPC requests).
The Mercury progress loop eventually executes the RDMA operation and notifies the calling ULT. The calling ULT is marked as ready and will eventually resume.
When the RPC handler calls margo_respond
to send a response to the client,
it first calls the user-provided serialization function to encode the response
into Mercury’s buffer, then yields, waiting for the Mercury progress loop to send
the response and allowing the ES to potentially execute other RPCs in the meantime.
Once the response is sent by Mercury, the Mercury progress loop notifies the RPC handler ULT, which eventually resumes and complete.
Finally, margo_wait
completes in the client. The client can then call
margo_get_output
on the RPC handler to deserialize the RPC’s output
using the user-provided deserialization function.
Note
This model remains valid regardless of whether the Mercury progress loop
runs on a separate ULT or not. Indeed you may notice that, from the point of
view of a single RPC operation, the two ES shown here on the server could be
merged into one. The advantage of dedicating an ES to Mercury progress is
that multiple concurrent RPC handlers may rely on it with minimum interference.
Should the Mercury progress loop run in the same ES as the RPC handlers,
calling margo_respond
in a handler could yield to another (potentially
long-running) RPC handler instead of the progress loop, thus delaying
the completion of the first RPC handler.