Bounce Metronome Wiki- Bounce helps you keep time

High performance critical sections using InterlockedExchange

From Bounce Metronome

Jump to: navigation, search

Contents

Intro

You will notice that with Tune Smithy and Bounce Metronome you can change any parameters in the user interface while the notes are played and the tune or rhythmn responds instantly with no glitches in the timing of the notes.

The tune needs to be played in its own high priority thread, otherwise it will get interrupted every time the user moves the mouse or interacts with the user interface. So you have to have multi-threading to do something like this - but multi-threading then has the problem that you have to make sure only one thread accesses the data at a time.

Why do you need critical sections?

Suppose the user changes the tuning, simple example. It gets updated in the user interface - but what about the tune playing thread? The code has to set the number of notes in the scale and then, as separate steps, set the pitch of each note individually. (You could change the order in which you do things, it doesn't matter, the main thing is you can only do one assignment at a time).

So then, remember the tune playing thread can interrupt the gui thread at any point. So, it might access the data when the number of notes has been changed to the new value but before the new scale values have been updated. It will then play notes in some garbage scale using whatever left over data there is in the locations where the scale will be saved.

You can get more subtle errors too, since even a single line of code consists of several assembly language instructions, so the thread could also get interrupted mid assignment with e.g. maybe only half of a 32 bit integer updated or whatever, depending how the assembly code works.

The usual way to avoid these errors is to make sure only one thread can access the data at a time.

That is what a critical section is, a way to make sure only one thread at a time can access data.

(NB you could optimise it further if you wanted to, by making a distinction between code that writes to the data and code that reads from it - when the data is being written no other thread should access it but there are no problems with multiple thread reading it at the same time).

So - why not use the standard methods in Windows for this?

In my programs I find that the code requires frequent use of critical sections, but most times the tune playing thread just goes straight through the critical section and gets immediate access.

In this sort of situation, EnterCriticalSection or WaitForSingleObject or mutexes have way too much overhead. But you do still need the critical section as you could get unexpected results or crashes in the rare situations where the critical section is needed. We need a way to restrict access to the data to a single thread at a time, but without the time penalties of the usual windows routines for doing this.

So you have to go through all your code and identify all the places where the code accesses the tuning data - either reads it or writes it. Then you have to wrap all those sections of code in critical sections. The idea is to guarantee that at any time, only one thread is able to access the tuning data.

How much of a difference does it make?

It may make hardly any difference at all on a modern computer. So it's worth testing first to see if it is an issue and what amount of time saving is involved before using this code.

Here is a timing test I did on a modern computer (a Lenovo e520 laptop):

Time for 1000000 loops for:
EnterCriticalSection, LeaveCriticalSection 0.0246134
InterlockedExchange, InterlockedExchange 0.0231181
That's after subtracting time for empty loop of 0.00260681

That's for case of no thread contention.

What is a critical section

A critical section is any section of your code that accesses data that might be needed by another thread. So for instance in the case of a music application, the tuning is an example of data that is needed by the user interface and also by the tune playing thread.

Simple implementation - and why it doesn't work

A simple way to do that is to have a variable, call it say access_tuning_data . Then whenever you have some code that needs to access the data, before you execute that code, set access_tuning_data=1 to request access to the data.

But before you do that, first check to see if the variable access_tuning_data is zero. If it is non zero then some other part of your program already has requested access to the tuning data and got it, so your code needs to wait for it to be set to zero before proceeding.

That is the basic idea of a critical section. But this simple approach requires two lines of code:

if(access_tuning_data==0)
 {
  access_tuning_data=1;
  // Critical code
  // ...
 }
 access_tuning_data=0;

So - our thread could be suspended between the line:

if(access_tuning_data==0)
and the line
access_tuning_data=1;

which lets another thread access the data (since we still have access_tuning_data==0)

Also more subtly, an assignment may set the value in the CPU cache which is only later updated to the memory. On a CPU with out of order execution of statements, there are many steps involved in a simple assignment of a value to memory Out of order execution.

Somehow you need to replace the pair of statements by a single atomic operation. Then secondly, you need to figure out how to wait if the another part of the program has access, and a "busy wait" at that point will take a lot of CPU time.

InterlockedExchange

Instead of the heavy weight Windows routines, I use InterlockedExchange. This does the test for the current value at a memory location, and a change of the value as a single atomic operation.

InterlockedExchange is a high level way to do an atomic Test and Set in Windows.

An atomic assignment is one that can be guaranteed not to be interrupted before the assignment is completed. It is atomic at the memory access level - guaranteed to commit to RAM - so it will work even with multi-core machines.

The idea is simple really. Whenever a thread wants to execute some code that accesses the shared data, it attempts to set a predetermined location in memory to the value 1 in an atomic operation in order to request permission to access the data. If it finds the value was already 1 then another thread is accessing it so it waits. if the value was 0 and the instruction successfully assigned it to 1 then it has received permission, and it is okay to enter.

To replace the busy wait I just do sleeps for 1 ms, then 2 ms and keep increasing the amount of the sleep. This simple method works just fine in the situation where access to the data is usually immediate and contention is rare - normally no sleeps occur and the code then is similar to a simple assignment, no performance penalty at all.

The code

So, you could do it something like this, where I have added a simple sleep if the thread fails to enter the critical section of code

void SleepExact(DWORD dwMs);
 
void FastEnterCriticalSection(__int32 *pspinlock)
{
 int nloops=0;
 UINT usleep=1;
 for(nloops=0;;nloops++)
 {
  if(InterlockedExchange((__int32 *)pspinlock,1)==0)
   //atomic assignment of the memory address to 1 to access the critical section
   // previous value was 0 so no other thread is accessing it so okay to continue
   return;
  // previous value wasn't 0, so another thread still has access, so keep waiting
  // for the other thread to release it
 
  // try longer and longer sleeps each time around. Details here unimportant
  // as typically the tune playing thread never gets here
  // but just in case it does, use high precision sleep to make sure
  // it sleeps for as little time as possible first time around
  SleepExact((DWORD) min(10,nloops));
 }
}
 
void FastLeaveCriticalSection(__int32 *pspinlock)
{
 // Atomic assignment back to 0 to release the section
 if(InterlockedExchange((__int32 *)pspinlock,0)!=1)
  DebugBreak(); // shouldn't happen
}
 
void SleepExact(DWORD dwMs)
{
 UINT uTimerRes=1;// want high precision sleeps max of 1ms error.
 timeBeginPeriod(uTimerRes); // see e.g. http://www.geisswerks.com/ryan/FAQS/timing.html
 Sleep(dwMs);
 timeEndPeriod(uTimerRes);
}

You would call it typically like this:

int permission_to_access_tuning_data;
...
FastEnterCriticalSection(&permission_to_access_tuning_data);
 
// access of data which is shared with other threads
 
FastLeaveCriticalSection(&permission_to_access_tuning_data);

here permission_to_access_tuning_data is a global variable shared between all the threads that require access to the data. If you use this code in C++ you may be wary about using globals like this. But make sure it has a long descriptive name that can't be used anywhere else in your code and you should be okay.

You could make it a static and wrap it in a routine if you want to make sure no other code can possibly access it, perhaps something like this:

void AccessTuningData(int enter)
{
 static int permission_to_access_tuning_data;
 if(enter)
  FastEnterCriticalSection(&permission_to_access_tuning_data);
 else
  FastLeaveCriticalSection(&permission_to_access_tuning_data);
}
void WaitForAccessForTuningData(void)
{
 AccessTuningData(1/*enter*/);
}
 
void ReleaseAccessToTuningData(void)
{
 AccessTuningData(0/*enter*/);
}
 
void BusyThread(void *pvoid)
{
 int i=0, nbusy=100000;
 for(;;)
 {
  WaitForAccessForTuningData();
  // This is where your program can access the tuning data to write to it or read from it
  for(i=0;i<nbusy;i++)
  {
   // in this demo, we don't actually need to do anything, so might as well just sleep
   // to use minimum amount of cpu
   Sleep(1);
  }
  ReleaseAccessToTuningData();
 }
}
 
void ThreadContentionTest(void)
{
// start several threads - all busy with critical sections
// so they have to wait for each other
// 
 UINT uthread1=_beginthread(BusyThread,0,NULL);
 UINT uthread2=_beginthread(BusyThread,0,NULL);
 UINT uthread3=_beginthread(BusyThread,0,NULL);
 UINT uthread4=_beginthread(BusyThread,0,NULL);
}

I've tried playing a fast fractal tune with many notes in it, and adjusting parameters in Tune Smithy, and normally the threads never need to wait at all.

In a situation like this, you can enter / leave critical sections as often as you like with no more performance penalty than a simple assignment.

Test App

Let's put all this together into a console app you can compile and run to see the fast critical sections in action:

#include <windows.h>
#include <math.h>
#include <stdio.h>
#include <process.h> // for _beginthread
 
// you also need to link with  winmm.lib for timeBeginPeriod and timeEndPeriod
 
void SleepExact(DWORD dwMs);
 
void FastEnterCriticalSection(__int32 *pspinlock)
{
 int nloops=0;
 UINT usleep=1;
 for(nloops=0;;nloops++)
 {
  if(InterlockedExchange((__int32 *)pspinlock,1)==0)
   //atomic assignment of the memory address to 1 to access the critical section
   // previous value was 0 so no other thread is accessing it so okay to continue
   return;
  // previous value wasn't 0, so another thread still has access, so keep waiting
  // for the other thread to release it
 
  // try longer and longer sleeps each time around. Details here unimportant
  // as typically the tune playing thread never gets here
  // but just in case it does, use high precision sleep to make sure
  // it sleeps for as little time as possible first time around
  SleepExact((DWORD) min(10,nloops));
 }
}
 
void FastLeaveCriticalSection(__int32 *pspinlock)
{
 // Atomic assignment back to 0 to release the section
 if(InterlockedExchange((__int32 *)pspinlock,0)!=1)
  DebugBreak(); // shouldn't happen
}
 
void SleepExact(DWORD dwMs)
{
 UINT uTimerRes=1;// want high precision sleeps max of 1ms error.
 timeBeginPeriod(uTimerRes); // see e.g. http://www.geisswerks.com/ryan/FAQS/timing.html
 Sleep(dwMs);
 timeEndPeriod(uTimerRes);
}
 
void AccessTuningData(int enter)
{
 static int permission_to_access_tuning_data;
 if(enter)
  FastEnterCriticalSection(&permission_to_access_tuning_data);
 else
  FastLeaveCriticalSection(&permission_to_access_tuning_data);
}
void WaitForAccessForTuningData(void)
{
 AccessTuningData(1/*enter*/);
}
 
void ReleaseAccessToTuningData(void)
{
 AccessTuningData(0/*enter*/);
}
 
void BusyThread(void *pvoid)
{
 int i=0, nbusy=100;
 for(;;)
 {
  WaitForAccessForTuningData();
  // This is where your program can access the tuning data to write to it or read from it
  for(i=0;i<nbusy;i++)
  {
   // in this demo, we don't actually need to do anything, so might as well just sleep
   // to use minimum amount of cpu
   Sleep(1);
  } for(i=0;i<nbusy;i++)
  { 
   // Don't actually need to do anything, so might as well just sleep, to avoid using excessive cpu
   Sleep(1);
  }
  ReleaseAccessToTuningData();
  Sleep(20);// to give another thread a chance to access the data since they are probably all doing long sleeps by now
 }
}
 
void ThreadContentionTest(void)
{
// start several threads - all busy with critical sections
// so they have to wait for each other
// 
 UINT uthread1=_beginthread(BusyThread,0,NULL);
 UINT uthread2=_beginthread(BusyThread,0,NULL);
 UINT uthread3=_beginthread(BusyThread,0,NULL);
 UINT uthread4=_beginthread(BusyThread,0,NULL);
}
 
main(int argc,char *argv[])
{ 
 ThreadContentionTest();
 printf("\npress any char to exit");
 getchar();
 return 0;
}

Compile and run that program under the debugger. Then break at random with your debugger and you will see that only one of the threads is within one of the critical sections at any one time which proves that the critical sections are working succesfully. You can tweak this console app and more tests and diagnostics to get a feel for how it all works.

Stress Testing

This has two identical threads, both of which continually try to access code protected by the same critical section.

When it is in the critical section, the thread spends most of its time sleeping to reduce CPU. When outside, it goes into a busy loop attempting to gain access again (with no sleeps at all).

This creates the toughest possible challenge for a critical section.

It is best to run this code on a multi-core machine as it uses about 100% of CPU on a single core.

Try this first with the line:

// #define cl_simple_assignment

commented out. You should get no error messages. Then try uncommenting it:

#define cl_simple_assignment

It will then probably fail almost immediately with an error message and debug break. To get an idea of how often the errors occur you may want to just show the error messages in your compiler's output window, but not break in the debugger. To do that, set

skip_debug_break=TRUE;

So - here is the code, which should compile okay on any compiler able to compile Windows console apps.

#include <windows.h>
#include <math.h>
#include <stdio.h>
#include <process.h> // for _beginthread
 
#define cl_stress_test // uses 100% cpu of one core most likely
 
// #define cl_simple_assignment // uncomment this line to try a simple assignment in place of InterlockedExchange
 
BOOL skip_debug_break=FALSE;
BOOL show_all_thread_id_changes=FALSE;
 
// you also need to link with  winmm.lib for timeBeginPeriod and timeEndPeriod
 
void SleepExact(DWORD dwMs);
void CallDebugBreak(void)
{
 // Wrapping the debug break like this lets you see the c-code for the line where the break happens
 // if debugging in MSVC 6
 if(skip_debug_break)
  return;
 DebugBreak();
}
 
void FastEnterCriticalSection(__int32 *pspinlock)
{
 int nloops=0;
 UINT usleep=1;
 for(nloops=0;;nloops++)
 {
#ifdef cl_simple_assignment
  if(*pspinlock==0)
  {*pspinlock=1;return;}
#else
  if(InterlockedExchange((__int32 *)pspinlock,1)==0)
   //atomic assignment of the memory address to 1 to access the critical section
   // previous value was 0 so no other thread is accessing it so okay to continue
   return;
#endif
  // previous value wasn't 0, so another thread still has access, so keep waiting
  // for the other thread to release it
 
  // try longer and longer sleeps each time around. Details here unimportant
  // as typically the tune playing thread never gets here
#ifndef cl_stress_test
  SleepExact((DWORD) min(10,nloops));
#endif
  if(nloops>0)
   nloops=nloops;
 }
}
 
void FastLeaveCriticalSection(__int32 *pspinlock)
{
#ifdef cl_simple_assignment
 if(*pspinlock==0)
 {
  OutputDebugString("Another thread has accessed this critical section and also exited from it as well.\n"
   "*pspinlock==0 while still in the critical section\n");
  CallDebugBreak(); // another thread has already entered and left the critical section
 }
 *pspinlock=0;
#else
 // Atomic assignment back to 0 to release the section
 if(InterlockedExchange((__int32 *)pspinlock,0)!=1)
  CallDebugBreak(); // shouldn't happen
#endif
}
 
void SleepExact(DWORD dwMs)
{
 UINT uTimerRes=1;// want high precision sleeps max of 1ms error.
 timeBeginPeriod(uTimerRes); // see e.g. http://www.geisswerks.com/ryan/FAQS/timing.html
 Sleep(dwMs);
 timeEndPeriod(uTimerRes);
}
 
void AccessTuningData(int enter)
{
 static int permission_to_access_tuning_data;
 if(enter)
  FastEnterCriticalSection(&permission_to_access_tuning_data);
 else
  FastLeaveCriticalSection(&permission_to_access_tuning_data);
}
void WaitForAccessForTuningData(void)
{
 AccessTuningData(1/*enter*/);
}
 
void ReleaseAccessToTuningData(void)
{
 AccessTuningData(0/*enter*/);
}
 
UINT uthread_busy;
 
void BusyThread(void *pvoid)
{
 int i=0, nbusy=10;
 UINT uthread_current=GetCurrentThreadId();
 UINT uthread_busy_was=uthread_current;
 for(;;)
 {
  BOOL shown_debug_output=FALSE;
  WaitForAccessForTuningData();
  uthread_busy=uthread_current;
  if(show_all_thread_id_changes)
  if(uthread_current!=uthread_busy_was)
  {
   char sz[100];
   sprintf(sz,"Access of critical section by thread id: %lu\n"
    ,uthread_current
    );
   OutputDebugString(sz);
  }
  uthread_busy_was=uthread_busy;
  for(i=0;i<nbusy;i++)
  { 
   if(!shown_debug_output)
   if(uthread_busy!=GetCurrentThreadId())
   {
    char sz[100];
    sprintf(sz,"Another thread has accessed the critical section of code at the same time as this one\n"
     "uthread_busy %lu!=uthread_current %lu\n",uthread_busy,uthread_current);
    OutputDebugString(sz);
    CallDebugBreak();
    shown_debug_output=1;
   }
  // Don't actually need to do anything, so might as well just sleep, to avoid using excessive cpu
   Sleep(1);
  }
  uthread_busy=0;
  ReleaseAccessToTuningData();
#ifndef cl_stress_test
  Sleep(20);// to give another thread a chance to access the data since they are probably all doing long sleeps by now
#endif
 }
}
 
void ThreadContentionTest(void)
{
// start several threads - all busy with critical sections
// so they have to wait for each other
// 
 UINT uthread1=_beginthread(BusyThread,0,NULL);
 UINT uthread2=_beginthread(BusyThread,0,NULL);
#ifndef cl_stress_test
 UINT uthread3=_beginthread(BusyThread,0,NULL);
 UINT uthread4=_beginthread(BusyThread,0,NULL);
#endif
}
 
main(int argc,char *argv[])
{ 
 ThreadContentionTest();
 printf("\npress RETURN KEY to exit");
 getchar();
 return 0;
}

Optimisation

You want the critical sections of the code to be as short as possible especially if there is anything CPU intensive going on.

Look for the place in the code after you read the data from the user interface, and look for the lines of code where you assign that data to internal variables. That is the only bit of code you need to put in a critical section (normally). Plus of course all the sections in the code where you need to access that data.

The idea is to update the shared data as quickly as possible, and then get out of the critical section. The tune can play notes while you are in the middle of drawing a single frame for the graphics, for instance, or while you are in the middle of parsing data from the user interface, or while you are in the middle of interpreting mouse gestures or whatever.

What not to put in a critical section

Ideally, don't put any graphics code in the critical section if you can help it. Or put as little of it as you can in the critical section.

Don't bracket the code that parses the user input either, at least not if it takes significant time to handle it. Especially, if you are waiting for user input, at that point your code shouldn't be inside a critical section.

This is what you do ideally. In practise on a modern computer graphics drawing is fast enough so you can get by with a bit of graphics drawing in a critical section if it doesn't take too long.

Basically anything which takes a significant amount of time in milliseconds should go outside your critical sections, so try timing the code to see how long it takes and if it takes several milliseconds then it should go outside your critical sections.

You can always find a way to do that if you think hard about it. One way is to copy the data you need to a new location accessed only by the graphics drawing code. So you only need to use the critical section around the short section of code that copies the original data to the new location.

Then from then on your graphics code can access the copy rather than the original, so doesn't need to be in a critical section.

See Also

There doesn't seem to be that much material available online about use of InterlockedExchange like this which is why I wrote this.

But these links may be useful:

Speed Up Synchronization Locks: How and Why?

For more background see this C-tutorial: Writing a simple lock mutex in c assembly

Double checked locking

In the above links, slide 20 shows another idea - where I just have simple sleeps, you can fall back to one of the other thread synchronisation methods instead. So basically you have the atomic assignment normally with all the speed you get that way if there is no contention - but whenever the thread fails to get access to the data that way, then fall back to critical sections or mutexes or whatever.

If you do that however, you will need to make sure you insert appropriate memory barriers to prevent optimisations from re-ordering the memory access operations. For details, see Double checked locking

Pure software solutions - portable to any platform

There are also pure software solutions. You have to use different code for each thread, but that could work fine if e.g. you have just two threads, the gui thread and the tune playing thread.

See Peterson's algorithm and for earlier less satisfactory version, Dekker's algorithm.

Again, as with double checked locking, you will need to use memory barriers to prevent out of order statement execution.

Views
Personal tools
Check out the Astonishing Bounce Metronome Pro

Beginners find the bounce wonderfully easy to work with.

Professional musicians are amazed by its advanced rhythm capabilities

Buy Now

box shot for Bounce Metronome

Or, get your free download
Free 30-day trial
Free taster bounce metronome, yours to keep

Or, find out more:
bouncemetronome.com

(By Robert Walker)