User-Level Threads (ULTs)
In this tutorial, you will learn about User-Level Threads (ULTs), the work units in Argobots that enable lightweight parallelism and concurrency in your applications.
Note
Historical Note: Prior to Argobots 1.2, there was a distinction between ULTs (User-Level Threads) and tasklets. Tasklets were stackless work units with lower overhead but more limitations. As of Argobots 1.2, tasklets are simply a typedef for ULTs, and there is no longer any functional difference. All work units are now ULTs with their own stacks.
Key Concepts
- User-Level Threads (ULTs)
ULTs are the work units in Argobots. Each ULT has its own execution stack allocated from the heap. This allows ULTs to:
Yield execution: Temporarily pause and resume later
Be migrated: Move between execution streams
Make recursive calls: Stack accommodates function call depth
Be suspended/resumed: Full context switching support
Memory: Each ULT allocates a stack (default 16KB but can be configured differently at compile time and at run time)
- Stack Management
ULTs require sufficient stack space for their operations. The stack size can be configured when creating ULTs:
Default stack size varies by platform (typically 16KB-64KB)
Recursive algorithms may need larger stacks
Simple operations can use smaller stacks to save memory
Stack size is set via
ABT_thread_attr_set_stacksize()
Basic ULT Example
Here’s a basic example demonstrating ULT creation and execution:
1/*
2 * ULT Example: User-Level Threads with stacks
3 * ULTs can yield, be migrated, and make recursive calls
4 */
5
6#include <stdio.h>
7#include <stdlib.h>
8#include <abt.h>
9
10#define NUM_TASKS 8
11
12/* Recursive fibonacci function - requires stack */
13int fibonacci(int n) {
14 if (n <= 1) return n;
15 return fibonacci(n - 1) + fibonacci(n - 2);
16}
17
18typedef struct {
19 int task_id;
20 int n;
21} task_arg_t;
22
23void ult_func(void *arg)
24{
25 task_arg_t *task = (task_arg_t *)arg;
26 int result = fibonacci(task->n);
27
28 int xstream_rank;
29 ABT_xstream_self_rank(&xstream_rank);
30
31 printf("ULT %d on ES %d: fib(%d) = %d\n",
32 task->task_id, xstream_rank, task->n, result);
33}
34
35int main(int argc, char **argv)
36{
37 ABT_xstream xstream;
38 ABT_pool pool;
39 ABT_thread threads[NUM_TASKS];
40 ABT_thread_attr attr;
41 task_arg_t task_args[NUM_TASKS];
42
43 ABT_init(argc, argv);
44
45 printf("=== ULT Example ===\n");
46 printf("ULTs have their own stack and can make recursive calls\n\n");
47
48 /* Get primary execution stream and pool */
49 ABT_xstream_self(&xstream);
50 ABT_xstream_get_main_pools(xstream, 1, &pool);
51
52 /* Create thread attributes with custom stack size */
53 ABT_thread_attr_create(&attr);
54 ABT_thread_attr_set_stacksize(attr, 16384); /* 16KB stack */
55
56 /* Create ULTs */
57 for (int i = 0; i < NUM_TASKS; i++) {
58 task_args[i].task_id = i;
59 task_args[i].n = 10 + i; /* fib(10) through fib(17) */
60
61 ABT_thread_create(pool, ult_func, &task_args[i], attr, &threads[i]);
62 }
63
64 /* Free attribute */
65 ABT_thread_attr_free(&attr);
66
67 /* Wait for all ULTs */
68 for (int i = 0; i < NUM_TASKS; i++) {
69 ABT_thread_free(&threads[i]);
70 }
71
72 printf("\nAll ULTs completed\n");
73 printf("Note: ULTs can yield, be suspended/resumed, and migrated\n");
74
75 ABT_finalize();
76 return 0;
77}
Expected output:
=== ULT Example ===
ULTs have their own stack and can make recursive calls
ULT 0 on ES 0: fib(10) = 55
ULT 1 on ES 0: fib(11) = 89
ULT 2 on ES 0: fib(12) = 144
...
All ULTs completed
Note: ULTs can yield, be suspended/resumed, and migrated
Key Points
- Custom Stack Size
ABT_thread_attr_create(&attr); ABT_thread_attr_set_stacksize(attr, 16384); /* 16KB stack */
ULTs need sufficient stack for their operations. For recursive algorithms, you may need larger stacks. For simple operations, smaller stacks save memory.
- Recursive Computation
The fibonacci function makes recursive calls, requiring stack space. ULTs handle this naturally with their own stacks.
- Stack Memory Cost
Creating 8 ULTs with 16KB stacks costs 128KB of memory just for stacks. With thousands of ULTs, consider the memory implications and adjust stack size accordingly.
Note
Because RPC handlers can have deep callstacks throw networking libraries in Mochi,
Margo will automatically set the default stack size of ULTs to 2MB. It is therefore
recommended to use ABT_thread_attr_set_stacksize to set the stack size back
to a smaller number if you create a ULT that you know will not require more than a
few KB.
Lightweight Work Example
For simple computations that don’t require deep recursion, you can create ULTs with smaller stack sizes:
1/*
2 * Simple ULT Example: Lightweight work units
3 * All work units are ULTs (User-Level Threads)
4 */
5
6#include <stdio.h>
7#include <stdlib.h>
8#include <abt.h>
9
10#define NUM_TASKS 8
11
12typedef struct {
13 int task_id;
14 int value;
15} task_arg_t;
16
17/* Simple computation */
18void simple_func(void *arg)
19{
20 task_arg_t *task = (task_arg_t *)arg;
21 int result = task->value * task->value;
22
23 int xstream_rank;
24 ABT_xstream_self_rank(&xstream_rank);
25
26 printf("ULT %d on ES %d: %d^2 = %d\n",
27 task->task_id, xstream_rank, task->value, result);
28}
29
30int main(int argc, char **argv)
31{
32 ABT_xstream xstream;
33 ABT_pool pool;
34 ABT_thread threads[NUM_TASKS];
35 task_arg_t task_args[NUM_TASKS];
36
37 ABT_init(argc, argv);
38
39 printf("=== Simple ULT Example ===\n");
40 printf("ULTs can be used for all types of work\n\n");
41
42 /* Get primary execution stream and pool */
43 ABT_xstream_self(&xstream);
44 ABT_xstream_get_main_pools(xstream, 1, &pool);
45
46 /* Create ULTs with default attributes */
47 for (int i = 0; i < NUM_TASKS; i++) {
48 task_args[i].task_id = i;
49 task_args[i].value = 10 + i;
50
51 ABT_thread_create(pool, simple_func, &task_args[i],
52 ABT_THREAD_ATTR_NULL, &threads[i]);
53 }
54
55 /* Wait for all ULTs */
56 for (int i = 0; i < NUM_TASKS; i++) {
57 ABT_thread_free(&threads[i]);
58 }
59
60 printf("\nAll ULTs completed\n");
61
62 ABT_finalize();
63 return 0;
64}
Expected output:
=== Simple ULT Example ===
ULTs can be used for all types of work
ULT 0 on ES 0: 10^2 = 100
ULT 1 on ES 0: 11^2 = 121
...
All ULTs completed
Key Points
- Simple Computation
For lightweight work that doesn’t require much stack, you can still use ULTs. There’s no need for a separate work unit type.
- Default Attributes
You can pass
ABT_THREAD_ATTR_NULLto use default attributes, which is fine for most use cases.- Performance Considerations
While ULTs have some overhead for stack allocation, Argobots optimizes this well by pre-allocating and reusing stacks.
Work Unit Reuse with Revive
Creating and destroying work units has overhead. For operations that repeat frequently, you can reuse work units with revive operations:
1/*
2 * Work Unit Reuse Example: Reviving ULTs
3 * Instead of creating/destroying work units, reuse them for efficiency
4 */
5
6#include <stdio.h>
7#include <stdlib.h>
8#include <abt.h>
9
10#define NUM_ITERATIONS 4
11#define NUM_UNITS 4
12
13typedef struct {
14 int iteration;
15 int unit_id;
16} work_arg_t;
17
18void work_func(void *arg)
19{
20 work_arg_t *work = (work_arg_t *)arg;
21 printf(" Work unit %d, iteration %d\n", work->unit_id, work->iteration);
22}
23
24int main(int argc, char **argv)
25{
26 ABT_xstream xstream;
27 ABT_pool pool;
28 ABT_thread threads[NUM_UNITS];
29 work_arg_t thread_args[NUM_UNITS];
30
31 ABT_init(argc, argv);
32
33 printf("=== Work Unit Revive Example ===\n");
34 printf("Reusing ULTs across multiple iterations\n\n");
35
36 ABT_xstream_self(&xstream);
37 ABT_xstream_get_main_pools(xstream, 1, &pool);
38
39 /* Initial creation of ULTs */
40 printf("Iteration 0 (initial creation):\n");
41 for (int i = 0; i < NUM_UNITS; i++) {
42 thread_args[i].unit_id = i;
43 thread_args[i].iteration = 0;
44 ABT_thread_create(pool, work_func, &thread_args[i],
45 ABT_THREAD_ATTR_NULL, &threads[i]);
46 }
47
48 /* Wait for first iteration */
49 for (int i = 0; i < NUM_UNITS; i++) {
50 ABT_thread_join(threads[i]);
51 }
52
53 /* Revive and reuse ULTs for additional iterations */
54 for (int iter = 1; iter < NUM_ITERATIONS; iter++) {
55 printf("\nIteration %d (reviving existing ULTs):\n", iter);
56
57 for (int i = 0; i < NUM_UNITS; i++) {
58 thread_args[i].iteration = iter;
59
60 /* Revive the ULT instead of creating a new one */
61 ABT_thread_revive(pool, work_func, &thread_args[i], &threads[i]);
62 }
63
64 /* Wait for this iteration */
65 for (int i = 0; i < NUM_UNITS; i++) {
66 ABT_thread_join(threads[i]);
67 }
68 }
69
70 /* Finally free the ULTs */
71 for (int i = 0; i < NUM_UNITS; i++) {
72 ABT_thread_free(&threads[i]);
73 }
74
75 printf("\nReviving work units avoids creation/destruction overhead\n");
76 printf("This is especially important for high-frequency operations\n");
77
78 ABT_finalize();
79 return 0;
80}
Key Points
- Initial Creation
Work units are created normally the first time.
- Revive Instead of Create
ABT_thread_revive(pool, work_func, &thread_args[i], &threads[i]);
Instead of freeing and creating a new ULT, we revive the existing one. This:
Reuses the allocated stack
Avoids allocation/deallocation overhead
Maintains the work unit handle
- When to Use Revive
Iterative algorithms with repeated work patterns
High-frequency work unit creation
When the number of concurrent work units is bounded
Pool-based task systems with worker recycling
- Performance Impact
Reviving can be faster than creating/destroying for high-frequency operations, especially when stack allocation is expensive.
Thread Attributes
ULTs support various attributes for customization:
- Stack Size
ABT_thread_attr_create(&attr); ABT_thread_attr_set_stacksize(attr, 32768); /* 32KB */ ABT_thread_create(pool, func, arg, attr, &thread); ABT_thread_attr_free(&attr);
Default stack size can be queried with
ABT_thread_attr_get_stacksize().- Migratable
ABT_thread_attr_set_migratable(attr, ABT_FALSE);
Prevent ULT migration. Slightly improves performance if migration isn’t needed.
- Stack Guard
Some flags can be provided when building Argobots to support stack overflow detection. With spack, for instance, the
stackguardvariant (noneby default) can be set tocanary-32,mprotect, ormprotect-strict, for that purpose.
Mochi Usage Patterns
In Mochi applications, ULTs are used for everything. Each RPC turns into a ULT,
and each blocking call (e.g. margo_forward yields to other ULTs so that
other work can continue while progress is made on I/O and communications).
API Reference
- ULT Functions
int ABT_thread_create(ABT_pool pool, void (*thread_func)(void *), void *arg, ABT_thread_attr attr, ABT_thread *newthread)Create a new ULT.
int ABT_thread_revive(ABT_pool pool, void (*thread_func)(void *), void *arg, ABT_thread *thread)Revive a terminated ULT for reuse.
int ABT_thread_join(ABT_thread thread)Wait for a ULT to terminate (doesn’t free it).
int ABT_thread_free(ABT_thread *thread)Join and free a ULT (blocks until termination).
int ABT_thread_yield()Yield execution to allow other ULTs to run.
- Attribute Functions
int ABT_thread_attr_create(ABT_thread_attr *newattr)Create a ULT attribute object.
int ABT_thread_attr_free(ABT_thread_attr *attr)Free a ULT attribute object.
int ABT_thread_attr_set_stacksize(ABT_thread_attr attr, size_t stacksize)Set the stack size for ULTs created with this attribute.
int ABT_thread_attr_set_migratable(ABT_thread_attr attr, ABT_bool migratable)Control whether ULTs can be migrated between execution streams.