Backend: SWIM
The SWIM (Scalable Weakly-consistent Infection-style process group Membership) backend provides decentralized failure detection and group membership management. Unlike the centralized backend, every member participates in failure detection, eliminating the single point of failure.
When to use
Use the SWIM backend when:
You need decentralized failure detection with no single point of failure
Group membership changes over time (members join/leave/fail)
You want membership change notifications
You’re building large-scale services where centralized coordination is a bottleneck
Characteristics
Dynamic membership: Members can join and leave the group after initialization.
Decentralized failure detection: Every member probes other members in round-robin order. There is no designated coordinator. If any member fails, the remaining members will detect it independently.
Suspicion mechanism: Members are first marked as suspected before being declared dead. A suspected member can refute the suspicion by incrementing its incarnation number, reducing false positives caused by transient network issues.
Gossip-based dissemination: Membership events (joins, leaves, failures, suspicions) are piggybacked on protocol messages, ensuring efficient dissemination with O(log n) message overhead per event.
Notifications: Members are notified of membership changes via callbacks. Clients can update their view if the group changes.
Configuration
In Bedrock configuration:
{
"libraries": [
"libflock-bedrock-module.so"
],
"providers": [
{
"type": "flock",
"name": "my_flock_provider",
"provider_id": 42,
"config": {
"bootstrap": "self",
"group": {
"type": "swim",
"config": {
"protocol_period_ms": 1000,
"ping_timeout_ms": 200,
"ping_req_timeout_ms": 500,
"ping_req_members": 3,
"suspicion_timeout_ms": 5000
}
},
"file": "mygroup.flock"
}
}
]
}
Configuration options:
protocol_period_ms: Time between probe rounds in milliseconds (default: 1000). Each round, a member selects one target to probe.ping_timeout_ms: Timeout for a direct ping RPC (default: 200).ping_req_timeout_ms: Timeout for an indirect ping-req RPC (default: 500).ping_req_members: Number of members to use for indirect probing when a direct ping fails (default: 3).suspicion_timeout_ms: Time a member remains in the suspected state before being declared dead and removed (default: 5000).
In C code
/*
* (C) 2024 The University of Chicago
*
* See COPYRIGHT in top-level directory.
*/
#include <assert.h>
#include <stdio.h>
#include <margo.h>
#include <flock/flock-server.h>
#include <flock/flock-bootstrap.h>
int main(int argc, char** argv)
{
// Initialize Margo
margo_instance_id mid = margo_init("na+sm", MARGO_SERVER_MODE, 0, 0);
assert(mid);
// Initialize provider args
struct flock_provider_args args = FLOCK_PROVIDER_ARGS_INIT;
flock_group_view_t initial_view = FLOCK_GROUP_VIEW_INITIALIZER;
args.initial_view = &initial_view;
// Bootstrap using self
uint16_t provider_id = 42;
flock_group_view_init_from_self(mid, provider_id, &initial_view);
// Configure with SWIM backend
// SWIM backend: decentralized failure detection using the SWIM protocol.
// All members participate in failure detection via ping/ping-req rounds.
const char* config =
"{"
" \"group\": {"
" \"type\": \"swim\","
" \"config\": {"
" \"protocol_period_ms\": 1000,"
" \"ping_timeout_ms\": 200,"
" \"ping_req_timeout_ms\": 500,"
" \"ping_req_members\": 3,"
" \"suspicion_timeout_ms\": 5000"
" }"
" }"
"}";
// Register provider with SWIM backend
flock_provider_t provider;
int ret = flock_provider_register(mid, provider_id, config, &args, &provider);
assert(ret == FLOCK_SUCCESS);
printf("Flock provider registered with SWIM backend\n");
printf("Decentralized failure detection is active\n");
printf("Initial group size: %zu\n", initial_view.members.size);
// Wait for finalize
margo_wait_for_finalize(mid);
return 0;
}
How it works
The SWIM backend implements a protocol loop that runs on every member. Each protocol period, the following steps occur:
1. Select a probe target
The member maintains a shuffled list of all other members and probes them in round-robin order. This ensures every member is probed within a bounded number of periods.
2. Direct ping
The member sends a ping RPC to the target. If the target responds, it is considered alive. Gossip entries are piggybacked on the ping and its response to disseminate membership information.
3. Indirect probing (on direct ping failure)
If the direct ping fails (timeout), the member selects ping_req_members
random peers and asks them to ping the target on its behalf via ping-req RPCs.
If any of the peers successfully reaches the target, the target is considered
alive.
4. Suspicion
If both the direct ping and all indirect probes fail, the target is marked as suspected. A suspicion gossip entry is disseminated to the group.
5. Suspicion timeout
If a suspected member does not refute the suspicion within
suspicion_timeout_ms, it is declared dead and removed from the group.
All members are notified via the membership update callback.
6. Refutation
If a member learns that it is being suspected (via a received gossip entry), it increments its incarnation number and gossips an ALIVE message with the new incarnation. This overrides the suspicion and keeps the member in the group.
Join and leave
Joining a group:
Use the “join” bootstrap method. When a new member joins, it sends a JOIN announcement to a subset of existing members. The announcement is disseminated through the gossip protocol. The new member is added to every member’s view.
Graceful leave:
When a provider is destroyed, it sends a LEAVE announcement to a subset of members before shutting down. The LEAVE event is disseminated through gossip, and all remaining members remove the departing member from their views.
Crash (ungraceful leave):
If a member crashes without announcing its departure, the failure detection
mechanism will eventually detect it through ping timeouts and the suspicion
protocol. The flock_swim_set_crash_mode function can be used to
simulate this scenario for testing:
// Enable crash mode: provider will not send LEAVE on destroy
flock_swim_set_crash_mode(provider, true);
Gossip protocol
The SWIM backend uses a gossip buffer to disseminate membership events efficiently:
Each event (ALIVE, SUSPECT, CONFIRM, JOIN, LEAVE) is stored in a buffer
Events are piggybacked on ping, ping-req, and announce RPCs
Each event is gossiped up to 3 * log2(N) times, where N is the group size
Events with higher incarnation numbers override older events for the same member
Old events are automatically cleaned up after being gossiped enough times
This ensures that every membership event reaches all members with high probability while keeping network overhead bounded.
Comparison with centralized backend
Property |
Centralized |
SWIM |
|---|---|---|
Failure detection |
Primary only |
All members |
Single point of failure |
Yes (primary) |
No |
Network overhead per period |
O(N) from primary |
O(1) per member |
Failure detection time |
Depends on ping interval and max timeouts |
Depends on protocol period and suspicion timeout |
Join mechanism |
Forwarded to primary |
Announced via gossip |
Metadata operations |
Supported |
Not supported |
Tuning parameters
Choose parameter values based on your requirements:
Fast failure detection (more overhead):
{
"protocol_period_ms": 500,
"ping_timeout_ms": 100,
"ping_req_timeout_ms": 300,
"ping_req_members": 3,
"suspicion_timeout_ms": 2000
}
Failure detection time: ~2.5 seconds (suspicion timeout + a few protocol periods)
Conservative failure detection (less overhead):
{
"protocol_period_ms": 2000,
"ping_timeout_ms": 500,
"ping_req_timeout_ms": 1000,
"ping_req_members": 3,
"suspicion_timeout_ms": 10000
}
Failure detection time: ~12 seconds (suspicion timeout + a few protocol periods)
Large groups (>100 members):
Increase the protocol period to reduce traffic. The gossip protocol’s logarithmic dissemination ensures events still reach all members in a bounded number of rounds even with longer periods.
Limitations
The SWIM backend has the following limitations:
No metadata operations:
add_metadataandremove_metadataare not supported (returnsFLOCK_ERR_OP_UNSUPPORTED). Metadata can still be set in the initial view.Probabilistic guarantees: Failure detection and event dissemination are probabilistically guaranteed, not deterministic. In rare cases, a transient network partition may cause a healthy member to be declared dead.
No primary: There is no designated coordinator. Operations that require centralized coordination (e.g. consensus) are not available.