Recently, I ran into some folks from the VMware Research group and we began talking about some of the work they have been doing toward NUMA aware data structures. Their team published a paper in ASPLOS 2017 detailing a tool that could make any data structure NUMA aware, with nice cache and memory locality properties whilst preserving linearizability. The paper caught my interest partly because I have worked in the past on a prototype for designing NUMA aware (cohort) locks in the vmkernel, but also more generally since I wanted to understand how a general NUMA aware framework could be built.
Node Replication (NR), as the authors point out, outperforms lock-free algorithms by 2.4X and and 3.1X for a priority queue and a dictionary, and by 8X and 30X for lock-based algorithms on the same structures. The principal idea here is that the framework is designed to take any sequential, UMA data structure and be able to convert that to be NUMA aware without the programmer needing to have an understanding for concurrency, cache coherency and optimizations for reducing the bus traffic. NR replicates the data structure across the nodes, each node having its own copy. The replicas are kept in sync by using a shared log that is accessed via flat combining by threads local to the node, and by using lock free appending across nodes.
Flat combining is a technique wherein instead of each thread wishing to make an update going through a synchronization barrier, the operations that need to be carried out are batched up and a leader thread combines all the updates that need to be made to the structure. This approach greatly reduces the overhead of going through a synchronization primitive and the resulting cache coherency bus traffic. The combiner of each node writes the local updates to the shared log, replays the log on the local replica if needed and executes all the outstanding local updates. Only the combiner threads of each node need to perform the CAS on the shared log, so the cross node synchronization is quite limited and scales only with the number of NUMA nodes in the system instead of the number of cores. Read operations can run in parallel and without any cross node communication if the local node’s replica is fresh. Each local node has a pointer to the entry in the shared log which indicates how many updates have been processed by that particular node.
The data structure needs to be represented in the form of 3 methods that NR uses to maintain the consistency of the data structure over its lifetime:
- Create(): Creates the data structure and returns a pointer to the instance.
- Execute(obj, op, args): Executes an operation on instance obj represented by op taking arguments args.
- IsReadOnly(obj, op): Is the operation op a read-only operation. This is used for implementing read-only optimizations for data structures by NR.
Concurrency is hard to design for, and also very hard to verify so the NR framework aims to abstract these concerns out from the design of a data structure. Some drawbacks of the framework apart from the memory overhear are that currently, if a thread blocks to complete an operation it can block the system from making forward progress since other nodes might need to replay entries that the node is updating. In conclusion, NR is a good technique to employ for designing for highly contended data structures if there is no alternative that can be employed to avoid the contention in the first place.