Backend: centralized
The centralized backend provides dynamic group management with a centralized coordinator. It supports membership changes, fault detection (of processes other than the coordinator), and notifications.
When to use
Use the centralized backend when:
Group membership changes over time (members join/leave)
You need fault tolerance and failure detection
You want membership change notifications
You’re building elastic services that scale dynamically
Characteristics
Dynamic membership: Members can join and leave the group after initialization.
Centralized coordination: One member acts as the primary (coordinator), managing the authoritative group view.
Failure detection: The primary periodically pings followers to detect failures.
Notifications: Members subscribe to membership change events. Clients can update their view if the group changes.
Higher overhead: More resource usage than static backend due to coordination (periodic pings).
Configuration
In Bedrock configuration:
{
"libraries": [
"libflock-bedrock-module.so"
],
"providers": [
{
"type": "flock",
"name": "my_flock_provider",
"provider_id": 42,
"config": {
"bootstrap": "self",
"group": {
"type": "centralized",
"config": {
"ping_timeout_ms": 2000,
"ping_interval_ms": 1000,
"ping_max_num_timeouts": 3
}
},
"file": "mygroup.flock"
}
}
]
}
Configuration options:
ping_timeout_ms: Timeout value when sending a ping RPC to a memberping_interval_ms: Time to wait between two ping RPCs to the same member. Can be a single value or a tuple[min, max]for a uniform random value in that intervals.ping_max_num_timeouts: Number of consecutive ping timeouts before a member is considered dead and removed from the groupprimary_address: (optional) Address of the process to use as primary (coordinator). If not provided, the first member in the initial view is used.primary_provider_id: (optional) Provider ID of the primary process.
Example with randomized ping interval:
{
"group": {
"type": "centralized",
"config": {
"ping_timeout_ms": 2000,
"ping_interval_ms": [500, 1500],
"ping_max_num_timeouts": 3
}
}
}
In C code
/*
* (C) 2024 The University of Chicago
*
* See COPYRIGHT in top-level directory.
*/
#include <assert.h>
#include <stdio.h>
#include <margo.h>
#include <flock/flock-server.h>
#include <flock/flock-bootstrap.h>
int main(int argc, char** argv)
{
// Initialize Margo
margo_instance_id mid = margo_init("na+sm", MARGO_SERVER_MODE, 0, 0);
assert(mid);
// Initialize provider args
struct flock_provider_args args = FLOCK_PROVIDER_ARGS_INIT;
flock_group_view_t initial_view = FLOCK_GROUP_VIEW_INITIALIZER;
args.initial_view = &initial_view;
// Bootstrap using self
uint16_t provider_id = 42;
flock_group_view_init_from_self(mid, provider_id, &initial_view);
// Configure with centralized backend
// Centralized backend: allows dynamic membership changes
// The primary (first member in view by default) pings followers to detect failures
const char* config =
"{"
" \"group\": {"
" \"type\": \"centralized\","
" \"config\": {"
" \"ping_timeout_ms\": 2000,"
" \"ping_interval_ms\": 1000,"
" \"ping_max_num_timeouts\": 3"
" }"
" }"
"}";
// Register provider with centralized backend
flock_provider_t provider;
int ret = flock_provider_register(mid, provider_id, config, &args, &provider);
assert(ret == FLOCK_SUCCESS);
printf("Flock provider registered with CENTRALIZED backend\n");
printf("Group membership can change dynamically\n");
printf("Initial group size: %zu\n", initial_view.members.size);
// Wait for finalize
margo_wait_for_finalize(mid);
return 0;
}
How it works
The centralized backend operates as follows:
Primary selection: By default, the first member in the initial view becomes the
primary (coordinator). You can override this with primary_address and
primary_provider_id configuration options.
Ping mechanism: The primary periodically pings all followers to check if they are still alive.
Join protocol:
New member contacts an existing member with “join” bootstrap
The request is forwarded to the primary member
Primary adds the new member to the view
All members are notified of the new member
New member receives the updated view
Failure detection:
Primary sends ping RPCs to followers at regular intervals
If a ping times out, the timeout counter for that member increments
After
ping_max_num_timeoutsconsecutive timeouts, the member is removedAll remaining members are notified of the change
Dynamic membership
Unlike the static backend, the centralized backend supports adding and removing members:
Adding members:
Use the “join” bootstrap method to add members to a running group.
Removing members:
Members are automatically removed when they:
Fail to respond to pings (failure detection)
Call
flock_provider_deregister(graceful shutdown)
Primary resilience
The primary is a single point of failure. If the primary fails:
Followers will no longer receive pings or view updates
The group becomes effectively frozen
Best practices:
Run the primary on a reliable node
Monitor the primary’s health
Consider restarting the group if the primary fails
Example: Elastic service
Here’s an example of building an elastic service that can scale up dynamically:
Initial primary:
$ ./server
Flock provider registered with CENTRALIZED backend
Group membership can change dynamically
Initial group size: 1
Additional workers (can be started at any time using the join method):
$ ./join_server mygroup.flock
Joined group with 2 members
As you start more workers, they automatically join the group and are discovered by all members.
Performance considerations
The centralized backend has some overhead:
Network traffic:
Pings: O(N) messages per ping interval (N = number of followers)
Joins: O(N) messages to notify all members
View queries: May require RPC depending on caching
Primary load:
Must send pings to all followers
Must track timeout counters for each follower
Must coordinate view updates
For large groups (>100 members), consider:
Increasing ping interval to reduce traffic
Partitioning into multiple smaller groups
Tuning parameters
Choose ping and timeout values based on your needs:
Fast failure detection (more overhead):
{
"ping_timeout_ms": 1000,
"ping_interval_ms": 500,
"ping_max_num_timeouts": 2
}
Failure detection time: ~1.5 seconds (500ms interval + 2 × 1000ms timeouts)
Slower failure detection (less overhead):
{
"ping_timeout_ms": 5000,
"ping_interval_ms": 3000,
"ping_max_num_timeouts": 3
}
Failure detection time: ~18 seconds (3 × 3000ms intervals + 3 × 5000ms timeouts)