4 min

2014-01-04

Listen

Synchronisation in .NET – Part 3: Spinlocks and Interlocks/Atomics

In the previous instalments (Part 1 and Part 2) of this series, we have drawn some conclusions about both .NET itself and CPU architectures. Here is what we know so far:

When there is contention on a single cache line, the lock() method scales very poorly and you get negative scale the moment you leave a single CPU core.
The scale takes a further dip once you leave a single CPU socket
Even when we remove the lock() and do racy operations, scalability is still poor
Going from a standard struct to a padded struct provides a scale boost, though not enough to get linear scale
The maximum theoretical scale we can get with the current technique is around 90K operations/ms.

In this blog entry, I will explore other synchronisation primitives to make the implementation un-racy again. We will be looking at spinlock and Interlocks.

Spinlock Background

Spinlocks are an alternative to blocking synchronisation. The premise of a spinlock is that it is sometimes cheaper to go into a busy waiting loop and poll a resource, instead of blocking when the resources is locked. Context switches are quite expensive, so it is often wise not to yield a scheduler if you expect that the held resource will be available soon.

It is difficult to determine when a spinlock is better than a blocking lock. The MSDN documentation is wisely vague:

A spin lock can be useful in to avoid blocking; however, if you expect a significant amount of blocking, you should probably not use spin locks due to excessive spinning. Spinning can be beneficial when locks are fine-grained and large in number (for example, a lock per node in a linked list) and also when lock hold-times are always extremely short.

While our data structure has a significant amount of contention, it also has a very shortly held lock. We can hypothesis that using a C# SpinLock instead of a lock()/Monitor might yield more throughput at high concurrency.

.NET `Spinlock` (Queued Spinlock) Implementation

Spinlocks can be implemented in several ways. There is a significant amount of controversy about when to use which. Notably, the Linux kernel (at this time of writing) uses what is known as a "Ticket Spinlock", which works really well under low concurrency. It is noteworthy that Google is now lobbying to have this changed to something more like the Windows (introduced in XP/2003) implementation instead. There is significant evidence that the ticket lock is not as scalable as one could wish for: Non-scalable locks are dangerous (PDF).

Let us see if .NET does better.

The spin lock implementation included with the .NET framework is a queued spin lock, similar to the one used in the Windows Kernel. This particular spinlock, when implemented correctly, guarantees fairness and scales well under high concurrency - even on system with high core count.

But how does it scale under our high contention test?

Implementation: Spinlock protecting

Here is an implementation of OperationData using SpinLock



[StructLayout(LayoutKind.Sequential)]
unsafe private struct OperationData
{
  private fixed byte Pad0[64];
  public SpinLock Spin;
  private fixed byte Pad1[64];
  public long TimeMs;
  public long Count;
  public long MaxTimeMs;
  public long MinTimeMs;
  private fixed byte Pad2[64];
}

W e use our knowledge from the last blog about padding to make sure cache lines don’t overlap. Ideally, the SpinLock itself should take up a full cache line (so two spinlocks don’t experience false sharing when spinning). However, when i take sizeof on the SpinLock struct, it looks like it takes up only 4B.

Here is the new implementation of Add():

public void Add(int operation, int timeMs)
{
  fixed (OperationData* s = &_opsData[operation])
  {
    try
    {
      bool lockTaken = false;
      s->Spin.Enter(ref lockTaken);
      s->TimeMs += timeMs;
      s->Count++;
      s->MaxTimeMs = Math.Max(timeMs, s->MaxTimeMs);
      s->MinTimeMs = Math.Min(timeMs, s->MinTimeMs);
    }
    finally
    {
      s->Spin.Exit();
    }
  }
}

Benchmark: .NET SpinLock

And here are the results of the standard .NET spinlock

Graph

This might be quite surprising result - if you consider the .NET description of the spinlock. Even though the lock is held very shortly, using a spinlock is still slower than the heavyweight lock(). However, as the lock becomes increasingly contended, the spinlock achieve that same scale as lock().

Implementation: TTAS Spinlock

As mentioned earlier, there are several ways to implement spinlocks. One very simple variant is called the "Test, Test And Set" (TTAS) spinlock. The advantage of this spinlock is that it is very cheap to acquire and release – which may address the issue we saw with the .NET SpinLock. However, it scales very poorly under high concurrency, perhaps worse than .NET?

Let us see how this lock would do as concurrency increases (and to compare with the .NET implementation, I build my own TTAS spinlock.

Here is the OperationData struct:


[StructLayout(LayoutKind.Sequential)]
unsafe private struct OperationData
{
  private fixed byte Pad0[64];
  public long IsHeld;
  private fixed byte pad1[64];
  public long TimeMs;
  public long Count;
  public long MaxTimeMs;
  public long MinTimeMs;
  private fixed byte Pad2[64];
}

And here is the Add() implementation with the TTAS spin lock:


public void Add(int operation, int timeMs)
{
  fixed (OperationData* s = &_opsData[operation])
  {
    try
    {
       /* Begin: Simple TTAS */
       SpinWait sw = new SpinWait();
       while (Interlocked.CompareExchange(ref s->IsHeld, 1, 0) == 1)
       {
         while (Interlocked.Read(ref s->IsHeld) == 1) sw.SpinOnce();
       }
       /* End: Simple TTAS */
       s->TimeMs += timeMs;
       s->Count++;
       s->MaxTimeMs = Math.Max(timeMs, s->MaxTimeMs);
       s->MinTimeMs = Math.Min(timeMs, s->MinTimeMs);
    }
    finally
    {
      s->IsHeld = 0;
      Thread.MemoryBarrier();
    }
  }
}

Benchmark: TTAS Spin Lock

Let’s see how my custom TTAS spinlock does against .NET’s own implementation:

Graph

As can be seen, the TTAS implementation is actually faster than lock() at low concurrency. However, this graph is getting a little crowded. Allow me to zoom in on just the .NET SpinLock and TTAS Spin lock:

Graph

Now that is interesting isn’t it? While the TTAS is faster at low contention – it completely collapses under high contention. it is clear that the .NET SpinLock has been designed to be scalable under highly contended scenarios.

Interlock/Atomics Background

Advanced synchronisation structures like lock(), SpinLock, Events and Conditional Variables would be very difficult (if not impossible) to implement without explicit support of atomic operations on the CPU itself.

The x86/x64 instruction set contains instructions that allow the compiler to guarantee that some operations will happen atomically – in other words: They will either happen or not. For example, the following:

x86 instruction	Behaviour
`MOV`	Atomically read a word (But can be reordered unless `MFENCE`’d)
`MOV + MFENCE` or `XCHG`	Atomically write a word sized value
`LOCK INCL/ADDL`	Atomically add one or a value to a word
`CMPXCHG`	Compare a word against another value. If they match, swap the word with another

Programming with atomic operations is extremely difficult and error prone. However, for those brave enough, .NET exposes atomics through the System.Threading.Interlocked static class. For those of you coding C++, you can refer to std::atomic.

Implementation: Interlock

Using atomics, it is possible to implement the class without any synchronisation primitive at all. Here is what that looks like:

First, we make sure the struct is padded to avoid false sharing:

unsafe private struct OperationData
{
  private fixed byte Pad0[64];
  public long TimeMs;
  public long Count;
  public long MaxTimeMs;
  public long MinTimeMs;
  private fixed byte Pad1[64];
}

Second, we use atomics (aka: Interlocked in C#) to make sure we change the struct correctly:

public void Add(int statistics, int timeMs)
{
  fixed (OperationData* s = &_opsData[statistics])
  {
    Interlocked.Add(ref s->TimeMs, timeMs);
    Interlocked.Add(ref s->Count, 1);

    long oldMinVal = Interlocked.Read(ref s->MinTimeMs);
    while (oldMinVal > timeMs) {
      oldMinVal = Interlocked.CompareExchange(ref s->MinTimeMs
                                                  , timeMs, oldMinVal);
    }

    long oldMaxVal = Interlocked.Read(ref s->MaxTimeMs);
    while (oldMaxVal < timeMs) {
          oldMaxVal = Interlocked.CompareExchange(ref s->MaxTimeMs
                                                  , timeMs, oldMaxVal);
    }
  }
}

Note that the Interlocked.CompareExchange (CPU instruction: CMPXCHG) often requires a loop to guarantee correctness. You may be able to infer that this could lead to starvation if the some cores loop faster than others (and you would be right).

What we have created here is a true wait free implementation (see: Non Block Algorithm)

Benchmark: Atomics/Interlock Implementation

Here is the wait free algorithm in action

Graph

… Faster than the .NET SpinLock at low scale and just as scalable as we approach 64 cores.

Not quite as fast as the TTAS at low scale though. Why? The atomic instructions require a lock of a cache line (so other cores don’t update it). This is not free and costs at least in the order of tens of CPU cycles, even when the cache line is uncontested. If you think about the TTAS implementation, only a single cache line lock is needed to acquire the lock – the Interlocked.CompareExchange in the spin lock. In the case of the purely atomic implementation, we need at least four locks: one to update each of the fields in the struct, potentially more than one for the min/max depending on whether we are racing against others.

Summary

In this third part of the series, we have seen how even lightweight locks don’t scale well beyond a single core when we compete for cache lines. There really is no such thing as a magic when it comes to synchronisation.

We have also seen how synchronisation methods have to carefully balance the cost of acquiring the lock with the scalability of the lock. The .NET SpinLock and TTAS implementation illustrates this balancing act.

It seems we are lost here. It is clear that a single CPU core can do 90K operations every millisecond if we don’t have any contention – yet even a moderate amount of contention breaks that scale of every implementation we have seen so far. If we were to use an instrumentation framework using the data structures we have seen so far into a high speed system - we would be paying a significant overhead to measure the system.

These numbers are pathetic aren't they? We have to get smarter…

Which is why you will be reading:

Part 4: Partitioned Locks

Synchronisation in .NET – Part 3: Spinlocks and Interlocks/Atomics

Spinlock Background

.NET Spinlock (Queued Spinlock) Implementation

Implementation: Spinlock protecting

Benchmark: .NET SpinLock

Implementation: TTAS Spinlock

Benchmark: TTAS Spin Lock

Interlock/Atomics Background

Implementation: Interlock

Benchmark: Atomics/Interlock Implementation

Summary

.NET `Spinlock` (Queued Spinlock) Implementation