18 Lockfree
18 Lockfree
18 Lockfree
CMU 15-418/618,
Spring 2020
Locking Problem
▪ Locks can be big and expensive
- How many atomic operations does one lock require?
- How much data requires one lock?
- How much does it force threads to serialize?
CMU 15-418/618,
Spring 2020
CUDA 7 atomic operations
int atomicAdd(int* address, int val);
float atomicAdd(float* address, float val);
int atomicSub(int* address, int val);
int atomicExch(int* address, int val);
float atomicExch(float* address, float val);
int atomicMin(int* address, int val);
int atomicMax(int* address, int val);
unsigned int atomicInc(unsigned int* address, unsigned int val);
unsigned int atomicDec(unsigned int* address, unsigned int val);
int atomicCAS(int* address, int compare, int val);
int atomicAnd(int* address, int val); // bitwise
int atomicOr(int* address, int val); // bitwise
int atomicXor(int* address, int val); // bitwise
CMU 15-418/618,
Spring 2020
Recall: Atomic Increment in GCC / x86
type __sync_fetch_and_add (type *ptr, type value)
type __sync_add_and_fetch (type *ptr, type value)
0000000000000000 <fadd>:
0: 89 f0 mov %esi,%eax # x
2: f0 0f c1 07 lock xadd %eax,(%rdi) # t = *addr; *addr += x
6: c3 retq # return t
CMU 15-418/618,
Spring 2020
Fetch & Add Performance
▪ Task
- K threads each incrementing single global variable N times
𝑻"𝟏𝟎𝟗
- Measure 𝑵𝑷𝑰 =
𝑵"𝑲
Summing Global Variable
100
90
80
Race
70
Nanoseconds per increment
Fetch+Add
60
Spin lock
50 Mutex
40
30
20
10
0
1 2 3 4 5 6 7 8
Threads CMU 15-418/618,
Spring 2020
Atomic Compare-And-Swap (CAS)
// atomicCAS:
// atomic compare and swap performs this logic atomically
int atomicCAS(int* addr, int compare, int val) {
int old = *addr;
if (old == compare)
*addr = val;
return old;
}
CMU 15-418/618,
Spring 2020
x86 cmpxchg
▪ Compare and exchange (atomic when used with lock prefix)
lock cmpxchg src, dst
often a memory address
return false;
}
CMU 15-418/618,
Spring 2020
Other Atomic Ops in GCC / x86
type __sync_fetch_and_xor (type *ptr, type value)
type __sync_xor_and_fetch (type *ptr, type value)
▪ Uses cmpxchg
▪ Requires loop
▪ Other bit-level ops are similar CMU 15-418/618,
Spring 2020
C++ 11 atomic<T>
▪ Provides atomic read, write, read-modify-write of entire objects
- Atomicity may be implemented by mutex or efficiently by processor-supported atomic instructions (if T is
a basic type)
▪ Provides memory ordering semantics for operations before and after
atomic operations
- By default: sequential consistency
- See std::memory_order or more detail
atomic<int> i;
i++; // atomically increment i
int a = i;
// do stuff
i.compare_exchange_strong(a, 10); // if i has same value as a, set i to 10
bool b = i.is_lock_free(); // true if implementation of atomicity
// is lock free
CMU 15-418/618,
13
Spring 2020
Atomic Operations: Livelock?
Livelock is a state where a system is
executing many operations, but no
thread is making meaningful progress.
CMU 15-418/618,
14
Spring 2020
Atomic Operations: Starvation/Unfairness
State where a system is making overall
progress, but some processes make no
progress.
(green cars make progress, but yellow cars are stopped)
CMU 15-418/618,
15
Spring 2020
Locking more than one location
▪ Data structures are often larger than a single memory
location
- How can an entire data structure be protected?
E.g. 15213 Proxylab cache
CMU 15-418/618,
Spring 2020
Example: a sorted linked list
struct Node {
int value;
struct List {
Node* head;
What can go wrong if multiple threads
Node* next; }; operate on the linked list simultaneously?
};
void insert(List* list, int value) { void delete(List* list, int value) {
3 5 10 11 18
Thread 1: 6
3 5 10 11 18
prev cur
CMU 15-418/618,
Spring 2020
Example: simultaneous insertion
Thread 1 attempts to insert 6
Thread 2 attempts to insert 7
6
Thread 1:
3 5 10 11 18
prev cur
7
Thread 2:
3 5 10 11 18
prev cur
Thread 1 and thread 2 both compute same prev and cur.
Result: one of the insertions gets lost!
Result: (assuming thread 1 updates prev->next before thread 2)
6
7
3 5 10 11 18
CMU 15-418/618,
Spring 2020
Solution 1: protect the list with a single lock
struct Node { struct List {
int value; Node* head;
Node* next; Lock lock; Per-list lock
}; };
void insert(List* list, int value) { void delete(List* list, int value) {
▪ Bad:
- Operations on the data structure are serialized
- May limit parallel application performance
CMU 15-418/618,
Spring 2020
Solution 2: “hand-over-hand” locking
Thread 0: delete(11)
3 5 10 11 18
T0 T0 T0 T0
T0 prev T0 cur
3 5 10 11 18
T1 T1 T0 T0
T0 prev T0 cur
CMU 15-418/618,
Spring 2020
Solution 2: “hand-over-hand” locking
Thread 0: delete(11)
Thread 1: delete(10)
3 5 10 18
T1 T1
CMU 15-418/618,
Spring 2020
Solution 2: “hand-over-hand” locking
Thread 0: delete(11)
Thread 1: delete(10)
3 5 18
T1
CMU 15-418/618,
Spring 2020
Version 2a: Padded List
List Node
−∞ 5 10 11 +∞
▪ Assume
- Only insert/delete finite values
- List starts with −∞
- List ends with +∞
- Guaranteed to find insertion/deletion point within list
CMU 15-418/618,
Spring 2020
Solution 2a: Padded List HOH Locking
struct Node { struct List {
int value; Node* head;
Node* next; };
Lock* lock;
};
void insert(List* list, int value) { void delete(List* list, int value) {
CMU 15-418/618,
Spring 2020
Fine-grained (HOH) Locking
▪ Goal: enable parallelism in data structure operations
- Reduces contention for global data structure lock
- In previous linked-list example: a single monolithic lock is overly conservative
(operations on different parts of the linked list can proceed in parallel)
▪ Costs?
- Overhead of taking a lock each traversal step (extra instructions + traversal now
involves memory writes)
- Be sure to use spin locks!
- Extra storage cost (a lock per node)
CMU 15-418/618,
Spring 2020
Where Can HOH Locking (Possibly) Be Used?
▪ Acyclic data structures
- Must be able to order lock acquistion/release
- Singly linked list
- E.g., hash table bucket chain
- Binary search tree (very tricky)
- Skip list
▪ Not for cyclic structures
- E.g., doubly-linked list
CMU 15-418/618,
Spring 2020
Lock-free data structures
CMU 15-418/618,
Spring 2020
Blocking algorithms/data structures
▪ A blocking algorithm allows one thread to prevent other
threads from completing operations on a shared data structure
indefinitely
▪ Example:
- Thread 0 takes a lock on a node in our linked list
- Thread 0 is swapped out by the OS, or crashes, or is just really slow (takes a page fault), etc.
- Now, no other threads can complete operations on the data structure (although thread 0 is
not actively making progress modifying it)
CMU 15-418/618,
Spring 2020
Lock-free algorithms
▪ Non-blocking algorithms are lock-free if some thread is
guaranteed to make progress (“systemwide progress”)
- In lock-free case, it is not possible to preempt one of the threads at an
inopportune time and prevent progress by rest of system
- Note: this definition does not prevent starvation of any one thread
CMU 15-418/618,
Spring 2020
Single Reader/Single Writer Bounded Queue
0 1 2 3 4 5 6 7
Empty 6 2 9 5 4 1 3
head tail
0 1 2 3 4 5 6 7
Push 3, 1, 4, 1, 5, 9, 2 3 1 4 1 5 9 2
head tail
0 1 2 3 4 5 6 7
Pop 4X 6 2 9 5 5 9 2
Returns 3, 1, 4, 1
head tail
0 1 2 3 4 5 6 7
Push 6, 5, 3, 5 5 3 5 5 9 2 6
// if not empty
if (q->head != q->tail) {
*value = q->data[q->head];
q->head = MOD_N(q->head + 1);
return true;
}
return false;
}
▪ Only two threads (one producer, one consumer) accessing queue at the same time
▪ Threads never synchronize or wait on each other
- When queue is empty (pop fails), when it is full (push fails)
- What is special about operations on head & tail that avoids need for synchronization?
* Assume a sequentially consistent memory system
CMU 15-418/618,
Spring 2020
Single reader, single writer unbounded queue
(Leaky) head, tail
push 3, push 10
head tail
3 10
pop (returns 3)
head tail
(3) 10
pop (returns 10)
tail, head
(3) (10)
push 5
head tail
(3) (10) 5
CMU 15-418/618,
Spring 2020
Single reader, single writer unbounded queue * Source: Dr. Dobbs Journal
struct Queue { }
Node* head;
Node* tail; // returns false if queue is empty
}; bool pop(Queue* q, int* value) {
if (q->head != q->tail) {
void init(Queue* q) {
*value = q->head->next->value;
q->head = q->tail = new Node;
q->head = q->head->next;
} return true;
}
return false;
}
push 3, push 10
head, reclaim tail
3 10
pop (returns 3)
reclaim head tail
(3) 10
pop (returns 10)
reclaim tail, head
(3) (10)
struct Queue {
q->tail->next = n;
Node* head; q->tail = q->tail->next;
Node* tail;
Node* reclaim; while (q->reclaim != q->head) {
}; Node* tmp = q->reclaim;
q->reclaim = q->reclaim->next;
void init(Queue* q) { delete tmp;
q->head = q->tail = q->reclaim = new Node; }
} }
if (q->head != q->tail) {
*value = q->head->next->value;
q->head = q->head->next;
return true;
}
▪ Tail points to last element added }
return false;
Node* pop(Stack* s) {
while (1) {
Node* old_top = s->top;
if (old_top == NULL)
return NULL;
Node* new_top = old_top->next;
if (compare_and_swap(&s->top, old_top, new_top) == old_top)
return old_top; // Assume that consumer then recycles old_top
}
}
Main idea: as long as no other thread has modified the stack, a thread’s modification can proceed.
Note difference from fine-grained locks example earlier: before, implementation locked a part of a
data-structure for fine-grained access. Here, threads do not hold lock on data-structure at all.
top
begin pop() ( local variable: old_top = A, new_top = B)
begin pop() (local variable old_top == A)
complete pop() (returns A)
B C
top
begin push(D)
complete push(D)
D B C
top
modify node A: e.g., set value = 42
begin push(A)
complete push(A)
A D B C
top
CAS succeeds (sets top to B!)
complete pop() (returns A)
B C Stack structure is corrupted! (lost D)
CMU 15-418/618,
time top Spring 2020
Why Does ABA Problem Arise?
▪ Use node address as identifier
- Assume that if atomic CAS gets matching address, list has
not been modified
▪ But, what if node has been deleted and recycled?
- Atomic CAS can get matching address, even though has
been modified
CMU 15-418/618,
Spring 2020
Lock-free stack using counter for ABA soln
struct Node { void init(Stack* s) {
Node* next; s->top = NULL;
int value; }
};
void push(Stack* s, Node* n) {
struct Stack { while (1) {
Node* top; Node* old_top = s->top;
int pop_count; n->next = old_top;
}; if (compare_and_swap(&s->top, old_top, n) == old_top)
return;
}
}
- But could simply ensure the stack’s count and top fields are contiguous in
memory to use the 64-bit wide single compare-and-swap instruction below.
▪ cmpxchg8b
- “compare and exchange eight bytes”
- Can be used for compare-and-swap of two 32-bit values
▪ cmpxchg16b
- “compare and exchange 16 bytes”
- Can be used for compare-and-swap of two 64-bit values
CMU 15-418/618,
Spring 2020
Another Concern: Referencing Freed Memory
struct Node { void init(Stack* s) {
Node* next; s->top = NULL;
int value; }
};
void push(Stack* s, Node* n) {
struct Stack { while (1) {
Node* top; Node* old_top = s->top;
int pop_count; n->next = old_top;
}; if (compare_and_swap(&s->top, old_top, n) == old_top)
return;
}
} What if top has been freed at this point
by another thread that popped it?
Node* pop(Stack* s) {
T1 & T2 both popping
while (1) {
int pop_count = s->pop_count;
Case 1:
Node* top = s->top;
1. T1 completes pop and gets copy of top
if (top == NULL)
2. T2 starts pop
• But will get different value for top return NULL;
Node* new_top = top->next;
Case 2: if (double_compare_and_swap(&s->top, top, new_top,
1. T1 has not yet done CAS &s->pop_count, pop_count, pop_count+1))
2. T2 starts pop return top;
• Both have same copy of top }
• Both have same value for pop_count }
3. T1 does CAS Possible for T1 to recycle top and therefore T2 to get bogus value for new_top, but CAS will fail.
• Then CAS by T2 will fail
So, this all looks OK.
• So, doesn’t matter that T2 had stale data
CMU 15-418/618,
Spring 2020
Another ABA Solution: Hazard Pointers
void init(Stack* s) {
struct Node { s->top = NULL;
Node* next; }
int value;
}; void push(Stack* s, Node* n) {
while (1) {
struct Stack { Node* old_top = s->top;
Node* top; n->next = old_top;
}; if (compare_and_swap(&s->top, old_top, n) == old_top)
return;
Node *hazard[NUM_THREADS]; }
}
Node* pop(Stack* s) {
while (1) {
hazard[t] = s->top;
Node* top = hazard[t];
if (top == NULL)
return NULL;
Node* new_top = top->next;
if (compare_and_swap(&s->top, top, new_top))
return top; // Caller must clear hazard[t] when it’s done with top
}
}
// assume case of insert into empty list handled No overhead of taking locks
// here (keep code on slide simple for class discussion)
No per-node storage overhead
Node* prev = list->head;
while (prev->next) {
if (prev == after) {
while (1) {
Node* old_next = prev->next;
n->next = old_next;
if (compare_and_swap(&prev->next, old_next, n) == old_next)
return;
}
}
prev = prev->next;
}
}
* For simplicity, this slide assumes the *only* operation on the list is insert CMU 15-418/618,
Spring 2020
Lock-free linked list deletion
Supporting lock-free deletion significantly complicates data-structure
Consider case where B is deleted simultaneously with successful insertion of E after B.
B now points to E, but B is not in the list!
A X B C D
CAS succeeds CAS succeeds
on A->next on B->next
E
CMU 15-418/618,
Spring 2020
Lock-free vs. locks performance comparison
Lock-free algorithm run time normalized to run time of using pthread mutex locks
Queue Dequeue
Linked List
lf = “lock free”
fg = “fine grained lock”
CMU 15-418/618,
Spring 2020
In practice: why lock free data-structures?
▪ When optimizing parallel programs in this class you often assume that
only your program is using the machine
- Because you care about performance
- Typical assumption in scientific computing, graphics, data analytics, etc.
▪ In these cases, well written code with locks can be as fast (or faster) than
lock-free code
▪ But there are situations where code with locks can suffer from tricky
performance problems
- Multi-programmed situations where page faults, pre-emption, etc. can occur while thread
is in a critical section
- Creates problems like priority inversion, convoying, crashing in critical section, etc. that are
often discussed in OS classes
▪ Lock free also does well with large data structures with sparse updates
- Chances of two updates at same place are very low
CMU 15-418/618,
Spring 2020
Summary
▪ Use fine-grained locking to reduce contention (maximize parallelism)
in operations on shared data structures
- But fine-granularity can increase code complexity (errors) and increase execution overhead
CMU 15-418/618,
Spring 2020
More reading
▪ Michael and Scott 1996. Simple, Fast and Practical Non-Blocking and Blocking Concurrent
Queue Algorithms
- Multiple reader/writer lock-free queue
CMU 15-418/618,
Spring 2020