18 Lockfree

Lecture 18:
Fine-grained synchronization &

lock-free programming
Parallel Computer Architecture and Programming

CMU 15-418/15-618, Spring 2020
Today’s Topics
▪ Atomic operations
▪ Fine-grained Locking
▪ Lock-free Programming
CMU 15-418/618,
Spring 2020
Locking Problem
▪ Locks can be big and expensive
- How many atomic operations does one lock require?
- How much data requires one lock?
- How much does it force threads to serialize?
CMU 15-418/618,
Spring 2020
CUDA 7 atomic operations
int atomicAdd(int* address, int val);
float atomicAdd(float* address, float val);
int atomicSub(int* address, int val);
int atomicExch(int* address, int val);
float atomicExch(float* address, float val);
int atomicMin(int* address, int val);
int atomicMax(int* address, int val);
unsigned int atomicInc(unsigned int* address, unsigned int val);
unsigned int atomicDec(unsigned int* address, unsigned int val);
int atomicCAS(int* address, int compare, int val);
int atomicAnd(int* address, int val); // bitwise
int atomicOr(int* address, int val); // bitwise
int atomicXor(int* address, int val); // bitwise
(omitting additional 64 bit and unsigned int versions)

CMU 15-418/618,
Spring 2020
GCC atomic built-in functions
type __sync_fetch_and_add (type *ptr, type value, ...)
type __sync_fetch_and_sub (type *ptr, type value, ...)
type __sync_fetch_and_or (type *ptr, type value, ...)
type __sync_fetch_and_and (type *ptr, type value, ...)
type __sync_fetch_and_xor (type *ptr, type value, ...)
type __sync_fetch_and_nand (type *ptr, type value, ...)
type __sync_add_and_fetch (type *ptr, type value, ...)
type __sync_sub_and_fetch (type *ptr, type value, ...)
type __sync_or_and_fetch (type *ptr, type value, ...)
type __sync_and_and_fetch (type *ptr, type value, ...)
type __sync_xor_and_fetch (type *ptr, type value, ...)
type __sync_nand_and_fetch (type *ptr, type value, ...)
type can be (unsigned) char, short, int, or long
CMU 15-418/618,
Spring 2020
Recall: Atomic Increment in GCC / x86
type __sync_fetch_and_add (type *ptr, type value)
type __sync_add_and_fetch (type *ptr, type value)
int fadd(int *addr, int x) {

return __sync_fetch_and_add(addr, x);
}
0000000000000000 <fadd>:
0: 89 f0 mov %esi,%eax # x
2: f0 0f c1 07 lock xadd %eax,(%rdi) # t = *addr; *addr += x
6: c3 retq # return t
▪ Direct hardware implementation

▪ No need for a loop
▪ Fetch-and-subtract also implemented with xadd
CMU 15-418/618,
Spring 2020
Fetch & Add Performance
▪ Task
- K threads each incrementing single global variable N times
𝑻"𝟏𝟎𝟗
- Measure 𝑵𝑷𝑰 =
𝑵"𝑲
Summing Global Variable
100
90
80
Race
70
Nanoseconds per increment
Fetch+Add
60
Spin lock
50 Mutex
40
30
20
10
0
1 2 3 4 5 6 7 8
Threads CMU 15-418/618,
Spring 2020
Atomic Compare-And-Swap (CAS)
// atomicCAS:
// atomic compare and swap performs this logic atomically
int atomicCAS(int* addr, int compare, int val) {
int old = *addr;
if (old == compare)
*addr = val;
return old;
}
▪ Exercise: how can you build an atomic fetch+op out of atomicCAS()?

- try: atomic_fetch_and_min()
int atomic_fetch_and_min(int* addr, int x) {
int old, new;
do {
old = *addr;
new = min(old, x);
} while (atomicCAS(addr, old, new) != old);
return old;
}
CMU 15-418/618,
Spring 2020
x86 cmpxchg
▪ Compare and exchange (atomic when used with lock prefix)
lock cmpxchg src, dst
often a memory address
lock prefix (makes operation atomic)
x86 accumulator register

if (dst == %eax)
ZF = 1 flag register
dst = src Self-check: Can you implement ASM for
else atomic compare-and-swap using cmpxchg?
ZF = 0 bool compare_and_swap(int* x, a, b) {
%eax = dst if (*x == a) {
*x = b;
return true;
}
return false;
}
CMU 15-418/618,
Spring 2020
Other Atomic Ops in GCC / x86
type __sync_fetch_and_xor (type *ptr, type value)
type __sync_xor_and_fetch (type *ptr, type value)
int fxor(int *addr, int x) {

return __sync_fetch_and_xor(addr, x);
}
0000000000000020 <fxor>:
20: 8b 07 mov (%rdi),%eax # old = *addr
22: 41 89 c0 mov %eax,%r8d # loop: t = old
25: 89 c1 mov %eax,%ecx # r = old
27: 41 31 f0 xor %esi,%r8d # new = old^x
# if *addr==t then *addr = new else old = *addr
2a: f0 44 0f b1 07 lock cmpxchg %r8d,(%rdi)
2f: 75 f1 jne 22 <fxor+0x2> # Goto loop if failed
31: 89 c8 mov %ecx,%eax # Return r
33: c3 retq
▪ Uses cmpxchg
▪ Requires loop
▪ Other bit-level ops are similar CMU 15-418/618,
Spring 2020
C++ 11 atomic<T>
▪ Provides atomic read, write, read-modify-write of entire objects
- Atomicity may be implemented by mutex or efficiently by processor-supported atomic instructions (if T is
a basic type)
▪ Provides memory ordering semantics for operations before and after
atomic operations
- By default: sequential consistency
- See std::memory_order or more detail
atomic<int> i;
i++; // atomically increment i
int a = i;
// do stuff
i.compare_exchange_strong(a, 10); // if i has same value as a, set i to 10
bool b = i.is_lock_free(); // true if implementation of atomicity
// is lock free
▪ Will be useful if implementing the lock-free programming ideas in C++

CMU 15-418/618,
Spring 2020
How are the operations atomic?
▪ x86 Lock prefix
- If the memory location is cached, then the cache retains
that location until the operation completes
- If not:
- With bus: the processor uses the lock signal and holds
the bus until the operation completes
- With directories: the processor (probably) NACKs any
request for the cache line until the operation completes
N.B. Operations must be made on non-overlapping addresses

CMU 15-418/618,
Spring 2020
Atomic Operations: Deadlock?
Deadlock is a state where a system has

outstanding operations to complete, but
no operation can make progress.
Can arise when each operation has

acquired a shared resource that another
operation needs.
CMU 15-418/618,
13
Spring 2020
Atomic Operations: Livelock?
Livelock is a state where a system is
executing many operations, but no
thread is making meaningful progress.
Computer system examples:

Operations continually abort and retry
CMU 15-418/618,
14
Spring 2020
Atomic Operations: Starvation/Unfairness
State where a system is making overall
progress, but some processes make no
progress.
(green cars make progress, but yellow cars are stopped)
In this example: assume traffic moving left/right (yellow cars) must

yield to traffic moving up/down (green cars)
CMU 15-418/618,
15
Spring 2020
Locking more than one location
▪ Data structures are often larger than a single memory
location
- How can an entire data structure be protected?
E.g. 15213 Proxylab cache
CMU 15-418/618,
Spring 2020
Example: a sorted linked list
struct Node {
int value;
struct List {
Node* head;
What can go wrong if multiple threads
Node* next; }; operate on the linked list simultaneously?
};
void insert(List* list, int value) { void delete(List* list, int value) {
Node* n = new Node; // assume case of deleting first element is

n->value = value; // handled here (to keep slide simple)
// assume case of inserting before head of Node* prev = list->head;

// of list is handled here (to keep slide simple) Node* cur = list->head->next;
Node* prev = list->head; while (cur) {

Node* cur = list->head->next; if (cur->value == value) {
prev->next = cur->next; // Deletion
while (cur) { delete cur;
if (cur->value > value) return;
break; }
prev = cur; prev = cur;

cur = cur->next; cur = cur->next;
} }
}
n->next = cur;
prev->next = n; // Insertion
}
CMU 15-418/618,
Spring 2020
Example: simultaneous insertion
Thread 1 attempts to insert 6
3 5 10 11 18
Thread 1: 6
3 5 10 11 18
prev cur
CMU 15-418/618,
Spring 2020
Example: simultaneous insertion
6
Thread 1:
3 5 10 11 18
prev cur
7
Thread 2:
3 5 10 11 18
prev cur
Thread 1 and thread 2 both compute same prev and cur.
Result: one of the insertions gets lost!
Result: (assuming thread 1 updates prev->next before thread 2)
6
7
3 5 10 11 18
CMU 15-418/618,
Spring 2020
Solution 1: protect the list with a single lock
struct Node { struct List {
int value; Node* head;
Node* next; Lock lock; Per-list lock
}; };
Node* n = new Node; lock(list->lock);

n->value = value;
// assume case of deleting first element is
lock(list->lock); // handled here (to keep slide simple)
// assume case of inserting before head of Node* prev = list->head;

// of list is handled here (to keep slide simple) Node* cur = list->head->next;
Node* prev = list->head; while (cur) {

Node* cur = list->head->next; if (cur->value == value) {
prev->next = cur->next;
while (cur) { delete cur;
if (cur->value > value) unlock(list->lock);
break; return;
}
prev = cur;
cur = cur->next; prev = cur;
} cur = cur->next;
n->next = cur; }
prev->next = n; unlock(list->lock);
unlock(list->lock); }
}
CMU 15-418/618,
Spring 2020
Single global lock per data structure
▪ Good:
- It is relatively simple to implement correct mutual
exclusion for data structure operations (we just did it!)
▪ Bad:
- Operations on the data structure are serialized
- May limit parallel application performance
CMU 15-418/618,
Spring 2020
Solution 2: “hand-over-hand” locking
Thread 0: delete(11)
3 5 10 11 18
T0 T0 T0 T0
T0 prev T0 cur
▪ At any time, hold lock on at least one element

- Move along list “hand-over-hand”
- Prevents later operations from catch up
- Guarantees that don’t interfere with earlier operations
CMU 15-418/618,
Spring 2020
3 5 10 11 18
T1 T1 T0 T0
T0 prev T0 cur
CMU 15-418/618,
Spring 2020
3 5 10 18
T1 T1
CMU 15-418/618,
Spring 2020
3 5 18
T1
CMU 15-418/618,
Spring 2020
Version 2a: Padded List
List Node
−∞ 5 10 11 +∞
▪ Assume
- Only insert/delete finite values
- List starts with −∞
- List ends with +∞
- Guaranteed to find insertion/deletion point within list
CMU 15-418/618,
Spring 2020
Solution 2a: Padded List HOH Locking
Node* next; };
Lock* lock;
};
Node* n = new Node; Node *prev, *cur, *old_prev;

n->value = value; Node *del = NULL;
Node *prev, *cur, *old_prev; prev = list->head;

lock(prev->lock);
prev = list->head; cur = prev->next;
lock(prev->lock); lock(cur->lock);
cur = prev->next;
lock(cur->lock); while (value < cur->value) {
// Holding locks on prev & cur
while (value < cur->value) { old_prev = prev;
// Holding locks on prev & cur prev = cur;
old_prev = prev; cur = cur->next;
prev = cur; unlock(old_prev->lock);
cur = cur->next; lock(cur->lock);
unlock(old_prev->lock); }
lock(cur->lock);
} if (value == cur->value) {
// Found
n->next = cur; prev->next = cur->next;
prev->next = n; del = cur;
}
unlock(prev->lock);
unlock(cur->lock); unlock(prev->lock);
} unlock(cur->lock);
if (del) delete del;

}
CMU 15-418/618,
Spring 2020
Fine-grained (HOH) Locking
▪ Goal: enable parallelism in data structure operations
- Reduces contention for global data structure lock
- In previous linked-list example: a single monolithic lock is overly conservative
(operations on different parts of the linked list can proceed in parallel)
▪ Challenge: tricky to ensure correctness

- Determining when mutual exclusion is required
- Deadlock? (how do you immediately know the earlier linked-list code is deadlock free?)
- Livelock?
▪ Costs?
- Overhead of taking a lock each traversal step (extra instructions + traversal now
involves memory writes)
- Be sure to use spin locks!
- Extra storage cost (a lock per node)
CMU 15-418/618,
Spring 2020
Where Can HOH Locking (Possibly) Be Used?
▪ Acyclic data structures
- Must be able to order lock acquistion/release
- Singly linked list
- E.g., hash table bucket chain
- Binary search tree (very tricky)
- Skip list
▪ Not for cyclic structures
- E.g., doubly-linked list
CMU 15-418/618,
Spring 2020
Lock-free data structures
CMU 15-418/618,
Spring 2020
Blocking algorithms/data structures
▪ A blocking algorithm allows one thread to prevent other
threads from completing operations on a shared data structure
indefinitely
▪ Example:
- Thread 0 takes a lock on a node in our linked list
- Thread 0 is swapped out by the OS, or crashes, or is just really slow (takes a page fault), etc.
- Now, no other threads can complete operations on the data structure (although thread 0 is
not actively making progress modifying it)
▪ An algorithm that uses locks is blocking regardless of whether

the lock implementation uses spinning or pre-emption
CMU 15-418/618,
Spring 2020
Lock-free algorithms
▪ Non-blocking algorithms are lock-free if some thread is
guaranteed to make progress (“systemwide progress”)
- In lock-free case, it is not possible to preempt one of the threads at an
inopportune time and prevent progress by rest of system
- Note: this definition does not prevent starvation of any one thread
CMU 15-418/618,
Spring 2020
Single Reader/Single Writer Bounded Queue
0 1 2 3 4 5 6 7
Empty 6 2 9 5 4 1 3
head tail
0 1 2 3 4 5 6 7
Push 3, 1, 4, 1, 5, 9, 2 3 1 4 1 5 9 2
head tail
0 1 2 3 4 5 6 7
Pop 4X 6 2 9 5 5 9 2
Returns 3, 1, 4, 1
head tail
0 1 2 3 4 5 6 7
Push 6, 5, 3, 5 5 3 5 5 9 2 6
tail head CMU 15-418/618,

Spring 2020
Single reader, single writer bounded queue *
struct Queue { // return false if queue is full
bool push(Queue* q, int value) {
int data[N];
unsigned head; // head of queue // queue is full if tail is element before head
unsigned tail; // next free element if (q->head == MOD_N(q->tail + 1))
}; return false;
void init(Queue* q) { q.data[q->tail] = value;

q->head = q->tail = 0; q->tail = MOD_N(q->tail + 1);
} return true;
}
// returns false if queue is empty

bool pop(Queue* q, int* value) {
// if not empty
if (q->head != q->tail) {
*value = q->data[q->head];
q->head = MOD_N(q->head + 1);
return true;
}
return false;
}
▪ Only two threads (one producer, one consumer) accessing queue at the same time
▪ Threads never synchronize or wait on each other
- When queue is empty (pop fails), when it is full (push fails)
- What is special about operations on head & tail that avoids need for synchronization?
* Assume a sequentially consistent memory system
CMU 15-418/618,
Spring 2020
Single reader, single writer unbounded queue
(Leaky) head, tail
push 3, push 10
head tail
3 10
pop (returns 3)
head tail
(3) 10
pop (returns 10)
tail, head
(3) (10)
pop (returns false... queue empty)

tail, head
(3) (10)
push 5
head tail
(3) (10) 5
CMU 15-418/618,
Spring 2020
Single reader, single writer unbounded queue * Source: Dr. Dobbs Journal
(Leaky) void push(Queue* q, int value) {
Node* n = new Node;

n->next = NULL;
struct Node { n->value = value;
Node* next;
int value; q->tail->next = n;
}; q->tail = q->tail->next;
struct Queue { }
Node* head;
Node* tail; // returns false if queue is empty
}; bool pop(Queue* q, int* value) {
void init(Queue* q) {
*value = q->head->next->value;
q->head = q->tail = new Node;
q->head = q->head->next;
} return true;
}
return false;
}
▪ Tail points to last element added

▪ Head points to element BEFORE head of queue
▪ Construction of list performed by single thread
- Only push modifies tail; only pop modifies head
* Assume a sequentially consistent memory system CMU 15-418/618,

Spring 2020
Single reader, single writer unbounded queue
head, tail, reclaim
push 3, push 10
head, reclaim tail
3 10
pop (returns 3)
reclaim head tail
(3) 10
pop (returns 10)
reclaim tail, head
(3) (10)
pop (returns false... queue empty)

reclaim tail, head Reclaiming performed
(3) (10) as part of push
push 5 (triggers reclaim)
reclaim, head tail
(10) 5
CMU 15-418/618,
Spring 2020
Single reader, single writer unbounded queue * Source: Dr. Dobbs Journal
struct Node { void push(Queue* q, int value) {
Node* next;
int value; Node* n = new Node;
n->next = NULL;
};
n->value = value;
struct Queue {
q->tail->next = n;
Node* head; q->tail = q->tail->next;
Node* tail;
Node* reclaim; while (q->reclaim != q->head) {
}; Node* tmp = q->reclaim;
q->reclaim = q->reclaim->next;
void init(Queue* q) { delete tmp;
q->head = q->tail = q->reclaim = new Node; }
} }
// returns false if queue is empty

bool pop(Queue* q, int* value) {
*value = q->head->next->value;
q->head = q->head->next;
return true;
}
▪ Tail points to last element added }
return false;
▪ Head points to element BEFORE head of queue

▪ Allocation and deletion performed by the same thread (producer)
- Only push modifies tail & reclaim; only pop modifies head
Spring 2020
Lock-free stack (first try)
struct Node { void init(Stack* s) {
Node* next; s->top = NULL;
int value; }
};
void push(Stack* s, Node* n) {
struct Stack { while (1) {
Node* top; Node* old_top = s->top;
}; n->next = old_top;
if (compare_and_swap(&s->top, old_top, n) == old_top)
return;
}
}
Node* pop(Stack* s) {
while (1) {
Node* old_top = s->top;
if (old_top == NULL)
return NULL;
Node* new_top = old_top->next;
if (compare_and_swap(&s->top, old_top, new_top) == old_top)
return old_top; // Assume that consumer then recycles old_top
}
}
Main idea: as long as no other thread has modified the stack, a thread’s modification can proceed.
Note difference from fine-grained locks example earlier: before, implementation locked a part of a
data-structure for fine-grained access. Here, threads do not hold lock on data-structure at all.

Spring 2020
The ABA problem A, B, C, and D are stack node addresses.
Thread 0 Thread 1
A B C
top
begin pop() ( local variable: old_top = A, new_top = B)
begin pop() (local variable old_top == A)
complete pop() (returns A)
B C
top
begin push(D)
complete push(D)
D B C
top
modify node A: e.g., set value = 42
begin push(A)
complete push(A)
A D B C
top
CAS succeeds (sets top to B!)
complete pop() (returns A)
B C Stack structure is corrupted! (lost D)
CMU 15-418/618,
time top Spring 2020
Why Does ABA Problem Arise?
▪ Use node address as identifier
- Assume that if atomic CAS gets matching address, list has
not been modified
▪ But, what if node has been deleted and recycled?
- Atomic CAS can get matching address, even though has
been modified
CMU 15-418/618,
Spring 2020
Lock-free stack using counter for ABA soln
int value; }
};
int pop_count; n->next = old_top;
}; if (compare_and_swap(&s->top, old_top, n) == old_top)
return;
}
}
Node* pop(Stack* s) { test to see if either have changed (in this

while (1) { example: return true if no changes)
int pop_count = s->pop_count;
Node* top = s->top;
if (top == NULL)
return NULL;
Node* new_top = top->next;
if (double_compare_and_swap(&s->top, top, new_top,
&s->pop_count, pop_count, pop_count+1))
return top;
}
}
▪ Maintain counter of pop operations

▪ Requires machine to support “double compare and swap” (DCAS) or doubleword CAS
▪ Could also solve ABA problem with node allocation and/or element reuse policies CMU 15-418/618,
Spring 2020
Compare and swap on x86
▪ x86 supports a “wide” compare-and-swap instruction
- Not quite the “double compare-and-swap” used in the code on the previous
slide
- But could simply ensure the stack’s count and top fields are contiguous in
memory to use the 64-bit wide single compare-and-swap instruction below.
▪ cmpxchg8b
- “compare and exchange eight bytes”
- Can be used for compare-and-swap of two 32-bit values
▪ cmpxchg16b
- “compare and exchange 16 bytes”
- Can be used for compare-and-swap of two 64-bit values
CMU 15-418/618,
Spring 2020
Another Concern: Referencing Freed Memory
int value; }
};
int pop_count; n->next = old_top;
return;
}
} What if top has been freed at this point
by another thread that popped it?
T1 & T2 both popping
while (1) {
int pop_count = s->pop_count;
Case 1:
Node* top = s->top;
1. T1 completes pop and gets copy of top
if (top == NULL)
2. T2 starts pop
• But will get different value for top return NULL;
Case 2: if (double_compare_and_swap(&s->top, top, new_top,
1. T1 has not yet done CAS &s->pop_count, pop_count, pop_count+1))
2. T2 starts pop return top;
• Both have same copy of top }
• Both have same value for pop_count }
3. T1 does CAS Possible for T1 to recycle top and therefore T2 to get bogus value for new_top, but CAS will fail.
• Then CAS by T2 will fail
So, this all looks OK.
• So, doesn’t matter that T2 had stale data
CMU 15-418/618,
Spring 2020
Another ABA Solution: Hazard Pointers
void init(Stack* s) {
struct Node { s->top = NULL;
Node* next; }
int value;
}; void push(Stack* s, Node* n) {
while (1) {
struct Stack { Node* old_top = s->top;
Node* top; n->next = old_top;
return;
Node *hazard[NUM_THREADS]; }
}
while (1) {
hazard[t] = s->top;
Node* top = hazard[t];
if (top == NULL)
return NULL;
if (compare_and_swap(&s->top, top, new_top))
return top; // Caller must clear hazard[t] when it’s done with top
}
}
▪ Node cannot be recycled or reused if matches any hazard pointer

CMU 15-418/618,
Spring 2020
Lock-free linked list insertion *
Node* next; };
};
// insert new node after specified node

void insert_after(List* list, Node* after, int value) { Compared to fine-grained
Node* n = new Node; locking implementation:
n->value = value;
// assume case of insert into empty list handled No overhead of taking locks
// here (keep code on slide simple for class discussion)
No per-node storage overhead
Node* prev = list->head;
while (prev->next) {
if (prev == after) {
while (1) {
Node* old_next = prev->next;
n->next = old_next;
if (compare_and_swap(&prev->next, old_next, n) == old_next)
return;
}
}
prev = prev->next;
}
}
* For simplicity, this slide assumes the *only* operation on the list is insert CMU 15-418/618,
Spring 2020
Lock-free linked list deletion
Supporting lock-free deletion significantly complicates data-structure
Consider case where B is deleted simultaneously with successful insertion of E after B.
B now points to E, but B is not in the list!
For the curious:

- Harris 2001. A Pragmatic Implementation of Non-blocking Linked-Lists
- Fomitchev 2004. Lock-free linked lists and skip lists
A X B C D
CAS succeeds CAS succeeds
on A->next on B->next
E
CMU 15-418/618,
Spring 2020
Lock-free vs. locks performance comparison
Lock-free algorithm run time normalized to run time of using pthread mutex locks
Queue Dequeue
Linked List
lf = “lock free”
fg = “fine grained lock”
Source: Hunt 2011. Characterizing the Performance and Energy

Efficiency of Lock-Free Data Structures
CMU 15-418/618,
Spring 2020
In practice: why lock free data-structures?
▪ When optimizing parallel programs in this class you often assume that
only your program is using the machine
- Because you care about performance
- Typical assumption in scientific computing, graphics, data analytics, etc.
▪ In these cases, well written code with locks can be as fast (or faster) than
lock-free code
▪ But there are situations where code with locks can suffer from tricky
performance problems
- Multi-programmed situations where page faults, pre-emption, etc. can occur while thread
is in a critical section
- Creates problems like priority inversion, convoying, crashing in critical section, etc. that are
often discussed in OS classes
▪ Lock free also does well with large data structures with sparse updates
- Chances of two updates at same place are very low
CMU 15-418/618,
Spring 2020
Summary
▪ Use fine-grained locking to reduce contention (maximize parallelism)
in operations on shared data structures
- But fine-granularity can increase code complexity (errors) and increase execution overhead
▪ Lock-free data structures: non-blocking solution to avoid overheads

due to locks
- But can be tricky to implement (ensuring correctness in a lock-free setting has its own
overheads)
- Still requires appropriate memory fences on modern relaxed consistency hardware
▪ Note: a lock-free design does not eliminate contention

- Compare-and-swap can fail under heavy contention, requiring spins
CMU 15-418/618,
Spring 2020
More reading
▪ Michael and Scott 1996. Simple, Fast and Practical Non-Blocking and Blocking Concurrent
Queue Algorithms
- Multiple reader/writer lock-free queue
▪ Harris 2001. A Pragmatic Implementation of Non-Blocking Linked-Lists

▪ Many good blog posts and articles on the web:
- http://www.drdobbs.com/cpp/lock-free-code-a-false-sense-of-security/210600279
- http://developers.memsql.com/blog/common-pitfalls-in-writing-lock-free-algorithms/
▪ Often students like to implement lock-free data structures for projects

- Linked list, skip-list based maps (Java’s ConcurrentSkipListMap), list-based sets, etc.
- Recommend using CMU Ph.D. student Michael Sullivan’s RMC system to implement
these projects.
CMU 15-418/618,
Spring 2020

18 Lockfree

Uploaded by

Copyright:

Available Formats

18 Lockfree

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

18 Lockfree

Uploaded by

Copyright:

Available Formats

Lecture 18:

Fine-grained synchronization &

Parallel Computer Architecture and Programming

(omitting additional 64 bit and unsigned int versions)

type can be (unsigned) char, short, int, or long

int fadd(int *addr, int x) {

▪ Direct hardware implementation

▪ Exercise: how can you build an atomic fetch+op out of atomicCAS()?

lock prefix (makes operation atomic)

x86 accumulator register

int fxor(int *addr, int x) {

▪ Will be useful if implementing the lock-free programming ideas in C++

N.B. Operations must be made on non-overlapping addresses

Deadlock is a state where a system has

Can arise when each operation has

Computer system examples:

In this example: assume traffic moving left/right (yellow cars) must

Node* n = new Node; // assume case of deleting first element is

// assume case of inserting before head of Node* prev = list->head;

Node* prev = list->head; while (cur) {

prev = cur; prev = cur;

Node* n = new Node; lock(list->lock);

// assume case of inserting before head of Node* prev = list->head;

Node* prev = list->head; while (cur) {

▪ At any time, hold lock on at least one element

Node* n = new Node; Node *prev, *cur, *old_prev;

Node *prev, *cur, *old_prev; prev = list->head;

if (del) delete del;

▪ Challenge: tricky to ensure correctness

▪ An algorithm that uses locks is blocking regardless of whether

tail head CMU 15-418/618,

void init(Queue* q) { q.data[q->tail] = value;

// returns false if queue is empty

pop (returns false... queue empty)

(Leaky) void push(Queue* q, int value) {

Node* n = new Node;

▪ Tail points to last element added

* Assume a sequentially consistent memory system CMU 15-418/618,

pop (returns false... queue empty)

// returns false if queue is empty

▪ Head points to element BEFORE head of queue

* Assume a sequentially consistent memory system CMU 15-418/618,

Node* pop(Stack* s) { test to see if either have changed (in this

▪ Maintain counter of pop operations

▪ Node cannot be recycled or reused if matches any hazard pointer

// insert new node after specified node

For the curious:

Source: Hunt 2011. Characterizing the Performance and Energy

▪ Lock-free data structures: non-blocking solution to avoid overheads

▪ Note: a lock-free design does not eliminate contention

▪ Harris 2001. A Pragmatic Implementation of Non-Blocking Linked-Lists

▪ Often students like to implement lock-free data structures for projects

You might also like

Node* n = new Node; Node prev, cur, *old_prev;

Node prev, cur, *old_prev; prev = list->head;