Institute of Theoretical Computer Science

This web page accompanies our submissions Checking and Enforcing robustness against TSO and Lazy TSO Reachability. Here you can

download Trencher's extension for checking TSO reachability that accompanies our FASE submission,
download the full version of the ESOP paper including missing proofs and definitions (updated 28.04.2014),
download Trencher, our prototype tool for checking TSO robustness,
see the examples on which we tested Trencher,
download rdtsc-mfence, our tool for benchmarking memory fence instructions,
see the charts with the results of benchmarking by rdtsc-mfence.

Trencher

Trencher is a tool for checking robustness against the x86-TSO memory model.

Trencher is written in C++ and uses Spin as a back-end model checker. It translates a reachability query to a program in Promela. Spin takes the Promela program and produces C source code of a model checker. This source code is compiled by a C compiler. Finally, running the model checker answers the reachability query. So, each query involves the execution of three external programs. Trencher runs the queries concurrently whenever possible, in particular robustness queries for different attacks are parallelized.

Download

Trencher's sources, documentation, and examples are available on GitHub. The sources, documentation, and examples that accompany our FASE sumission are available on the following GitHub branch of Trencher.

Examples

We executed Trencher (namely, the fence insertion algorithm for enforcing robustness) on a set of examples, using a machine with Intel(R) Core(TM) i5 CPU M 560 @ 2.67GHz (4 cores) running GNU/Linux. The results are summarized in the table below, the description of these examples and the discussion can be found below.

Name	T	L	I	RQ	NR1	NR2	R	F	Spin	CC	Ver	Tre	Real
Peterson	2	14	18	23	2	12	9	2	0.7	7.1	0.5	0.1	2.9
Peterson (Corr)	2	16	20	12	12	0	0	0	0.0	0.0	0.0	0.0	0.0
Dekker	2	24	30	95	12	28	55	4	2.8	28.9	1.8	0.2	14.2
Dekker (Corr)	2	30	36	30	30	0	0	0	0.0	0.0	0.0	0.0	0.0
Lamport	3	33	36	36	9	15	12	6	1.0	13.4	6.0	0.1	5.9
Lamport (Corr)	3	39	42	27	27	0	0	0	0.0	0.0	0.0	0.0	0.0
CLHLock	7	62	58	54	48	6	0	0	0.4	4.6	0.2	0.0	1.6
MCSLock	4	52	50	30	26	4	0	0	0.2	2.7	0.4	0.0	0.9
Lock-Free Stack	4	46	50	14	14	0	0	0	0.0	0.0	0.0	0.0	0.0
Cilk's THE WSQ	5	86	79	152	141	8	3	3	0.8	9.3	18.0	0.1	7.4
NBW: W\|R	2	15	13	3	3	0	0	0	0.0	0.0	0.0	0.0	0.0
NBW: W\|WR	2	21	19	15	9	5	1	1	0.2	2.3	0.2	0.0	0.8
NBW: W\|WR (Corr)	2	22	20	15	15	0	0	0	0.0	0.0	0.0	0.0	0.0
NBW: W\|AR\|RA	3	25	22	9	7	2	0	0	0.1	0.6	0.1	0.0	0.4
NBW: W\|LR\|RL	4	45	45	30	26	4	0	0	0.2	2.5	0.2	0.0	0.7
Parker	2	9	8	2	0	1	1	1	0.1	0.4	0.0	0.0	0.3
Parker (Corr)	2	10	9	2	2	0	0	0	0.0	0.0	0.0	0.0	0.0
Triangular Race	2	5	3	1	0	1	0	0	0.0	0.2	0.0	0.0	0.2

T, L, and I are the numbers of threads, labels, and instructions, respectively.
RQ is the number of reachability queries raised by Trencher.
NR1 is the number of verification queries answered negatively without running the model checker.
NR2 and R are the numbers of queries that are answered negatively/positively by the external model checker.
F is the number of fences inserted.
Spin, CC, and Ver give the total work times in seconds of Spin, Clang (which we used as a C compiler), and the verifier generated by Spin.
Tre is the CPU time used by Trencher itself.
Real is the total (wall-clock) time in seconds of processing an example.

Description of the examples

The first group of examples are mutual exclusion algorithms: Dekker's Algorithm, Peterson's Algorithm, Lamport's Fast Mutex Algorithm, CLH and MCS queue locks. The first three algorithms are executed in an infinite loop. The examples Dekker, Peterson, and Lamport are presented in two versions: original (not fenced) and corrected (properly fenced to ensure robustness). CLH and MCS locks are robust from the start.

The second group of examples are concurrent data structures: a lock-free stack from a concurrent malloc implementation (robust) and a work-stealing queue from the implementation of Cilk 5 programming language (non-robust).

The choice of the third group of examples was inspired by S. Owens' paper Reasoning about the Implementation of Concurrency Abstractions on x86-TSO published in Proc. of ECOOP 2010:

We analyse several instances of the Non-Blocking Write protocol by H. Kopetz and J. Reisinger. Interestingly, the Writer | Lock;Reader | Reader;Lock instance considered by Owens in Section 8 does not require fences for trace-based robustness.
Our tool can discover the known bug in the C++ Parker class from Sun's JVM implementation that is due to TSO relaxations. Section 9 of Owens' paper contains an excellent description of the problem.
Triangular Race is a minimal example of a program which is trace-based robust but not triangular data race-free.

Observations

The analysis of robust algorithms is particularly fast. They typically only have a small number of attacks that have to be checked by a model checker. The robust versions of Dekker and Peterson do not have such attacks at all. In the CLH and MCS locks, their number is less than 20%.
The compiler takes up to 90% of the CPU time on certain examples (DekNR, CLHLock, NBW2, NBW4). Hence, switching to a different model checking back-end that does not require compilation of verifiers may improve the running times significantly.
Parallelization of reachability queries brings substantial performance benefits. For some examples (LamNR, CLHLock), the real working time constitutes 1/3 to 1/4 of the total CPU time (on the 4-core machine).
The trace-based analysis can establish robustness of the spinlock + non-blocking write example (NBW: W | LR | RL), as opposed to the earlier analyses via triangular data races which would have to place a fence (cf. Owens' paper mentioned above).

rdtsc-mfence: Benchmarking memory fences

rdtsc-mfence is a tool for benchmarking the CPU instructions that can act as memory fences on AMD64 architecture. The tool measures the average number of cycles required for executing such an instruction, combined with several write instructions. You can download the archive with sources: rdtsc-mfence.tar.gz.

Charts

We ran the tool on several x86-64 CPUs running in 64-bit mode. We used the following patterns of assembler code:

movl; ... movl; mfence;
movl; ... movl; lock addl $0,(%rsp);
movl; ... movl;

where movl is a 32-bit MOV instruction to a global address. All MOVs in a single pattern access different adjacent addresses. The pattern was executed 10⁸ times, in one or two concurrent threads; number of CPU cycles used was measured using RDTSC instruction and the mean was taken.

Results are presented in the charts below.

Mobile AMD Sempron Processor 3400+ (1 Thread) Intel Xeon E5420 CPU @ 2.50GHz (1 Thread) Intel Core i5 M650 CPU @ 2.67GHz (1 Thread) Intel Core 2 Duo P8700 CPU @ 2.53GHz (1 Thread) Intel Xeon X5650 CPU @ 2.67GHz (1 Thread)

Observations

Based on the diagrams, we can do several empirical observations.

Executing a memory fence instruction takes some significant time even when the TSO write buffer is empty.
Performance of a single thread cannot be improved by executing more fences, even with smaller buffer sizes.

Summing up, to optimize performance of an application we must minimize the number of times memory fence instructions are executed.

Institute of

Theoretical Computer Science

Carl-Friedrich-Gauß-Fakultät

Technische Universität Braunschweig

Trencher

Download

Examples

Description of the examples

Observations

rdtsc-mfence: Benchmarking memory fences

Charts

Observations