Backend: centralized

The centralized backend provides dynamic group management with a centralized coordinator. It supports membership changes, fault detection (of processes other than the coordinator), and notifications.

When to use

Use the centralized backend when:

Group membership changes over time (members join/leave)
You need fault tolerance and failure detection
You want membership change notifications
You’re building elastic services that scale dynamically

Characteristics

Dynamic membership: Members can join and leave the group after initialization.

Centralized coordination: One member acts as the primary (coordinator), managing the authoritative group view.

Failure detection: The primary periodically pings followers to detect failures.

Notifications: Members subscribe to membership change events. Clients can update their view if the group changes.

Higher overhead: More resource usage than static backend due to coordination (periodic pings).

Configuration

In Bedrock configuration:

{
    "libraries": [
        "libflock-bedrock-module.so"
    ],
    "providers": [
        {
            "type": "flock",
            "name": "my_flock_provider",
            "provider_id": 42,
            "config": {
                "bootstrap": "self",
                "group": {
                    "type": "centralized",
                    "config": {
                        "ping_timeout_ms": 2000,
                        "ping_interval_ms": 1000,
                        "ping_max_num_timeouts": 3
                    }
                },
                "file": "mygroup.flock"
            }
        }
    ]
}

Configuration options:

ping_timeout_ms: Timeout value when sending a ping RPC to a member
ping_interval_ms: Time to wait between two ping RPCs to the same member. Can be a single value or a tuple [min, max] for a uniform random value in that intervals.
ping_max_num_timeouts: Number of consecutive ping timeouts before a member is considered dead and removed from the group
primary_address: (optional) Address of the process to use as primary (coordinator). If not provided, the first member in the initial view is used.
primary_provider_id: (optional) Provider ID of the primary process.

Example with randomized ping interval:

{
    "group": {
        "type": "centralized",
        "config": {
            "ping_timeout_ms": 2000,
            "ping_interval_ms": [500, 1500],
            "ping_max_num_timeouts": 3
        }
    }
}

In C code

/*
 * (C) 2024 The University of Chicago
 *
 * See COPYRIGHT in top-level directory.
 */
#include <assert.h>
#include <stdio.h>
#include <margo.h>
#include <flock/flock-server.h>
#include <flock/flock-bootstrap.h>

int main(int argc, char** argv)
{
    // Initialize Margo
    margo_instance_id mid = margo_init("na+sm", MARGO_SERVER_MODE, 0, 0);
    assert(mid);

    // Initialize provider args
    struct flock_provider_args args = FLOCK_PROVIDER_ARGS_INIT;
    flock_group_view_t initial_view = FLOCK_GROUP_VIEW_INITIALIZER;
    args.initial_view = &initial_view;

    // Bootstrap using self
    uint16_t provider_id = 42;
    flock_group_view_init_from_self(mid, provider_id, &initial_view);

    // Configure with centralized backend
    // Centralized backend: allows dynamic membership changes
    // The primary (first member in view by default) pings followers to detect failures
    const char* config =
        "{"
        "  \"group\": {"
        "    \"type\": \"centralized\","
        "    \"config\": {"
        "      \"ping_timeout_ms\": 2000,"
        "      \"ping_interval_ms\": 1000,"
        "      \"ping_max_num_timeouts\": 3"
        "    }"
        "  }"
        "}";

    // Register provider with centralized backend
    flock_provider_t provider;
    int ret = flock_provider_register(mid, provider_id, config, &args, &provider);
    assert(ret == FLOCK_SUCCESS);

    printf("Flock provider registered with CENTRALIZED backend\n");
    printf("Group membership can change dynamically\n");
    printf("Initial group size: %zu\n", initial_view.members.size);

    // Wait for finalize
    margo_wait_for_finalize(mid);

    return 0;
}

How it works

The centralized backend operates as follows:

Primary selection: By default, the first member in the initial view becomes the primary (coordinator). You can override this with primary_address and primary_provider_id configuration options.

Ping mechanism: The primary periodically pings all followers to check if they are still alive.

Join protocol:

New member contacts an existing member with “join” bootstrap
The request is forwarded to the primary member
Primary adds the new member to the view
All members are notified of the new member
New member receives the updated view

Failure detection:

Primary sends ping RPCs to followers at regular intervals
If a ping times out, the timeout counter for that member increments
After ping_max_num_timeouts consecutive timeouts, the member is removed
All remaining members are notified of the change

Dynamic membership

Unlike the static backend, the centralized backend supports adding and removing members:

Adding members:

Use the “join” bootstrap method to add members to a running group.

Removing members:

Members are automatically removed when they:

Fail to respond to pings (failure detection)
Call flock_provider_deregister (graceful shutdown)

Primary resilience

The primary is a single point of failure. If the primary fails:

Followers will no longer receive pings or view updates
The group becomes effectively frozen

Best practices:

Run the primary on a reliable node
Monitor the primary’s health
Consider restarting the group if the primary fails

Example: Elastic service

Here’s an example of building an elastic service that can scale up dynamically:

Initial primary:

$ ./server
Flock provider registered with CENTRALIZED backend
Group membership can change dynamically
Initial group size: 1

Additional workers (can be started at any time using the join method):

$ ./join_server mygroup.flock
Joined group with 2 members

As you start more workers, they automatically join the group and are discovered by all members.

Performance considerations

The centralized backend has some overhead:

Network traffic:

Pings: O(N) messages per ping interval (N = number of followers)
Joins: O(N) messages to notify all members
View queries: May require RPC depending on caching

Primary load:

Must send pings to all followers
Must track timeout counters for each follower
Must coordinate view updates

For large groups (>100 members), consider:

Increasing ping interval to reduce traffic
Partitioning into multiple smaller groups

Tuning parameters

Choose ping and timeout values based on your needs:

Fast failure detection (more overhead):

{
    "ping_timeout_ms": 1000,
    "ping_interval_ms": 500,
    "ping_max_num_timeouts": 2
}

Failure detection time: ~1.5 seconds (500ms interval + 2 × 1000ms timeouts)

Slower failure detection (less overhead):

{
    "ping_timeout_ms": 5000,
    "ping_interval_ms": 3000,
    "ping_max_num_timeouts": 3
}

Failure detection time: ~18 seconds (3 × 3000ms intervals + 3 × 5000ms timeouts)