Threading Fun

Posted: May 24th, 2010 | 7 Comments »

So we had an interesting threading bug at work today which I thought I'd write up here as I hadn't seen this specific problem before (note I didn't write this code, I just helped debug it).  The set up was a basic single producer single consumer arrangement something like this:

#include <Windows.h>
#include <cassert>

volatile LONG gAvailable = 0;

// thread 1
	while (1)

// thread 2
	while (1)
		// pull available work with a limit of 5 items per iteration
		LONG work = min(gAvailable, 5);

		// this should never fire.. right?
		assert(work <= 5);

		// update available work
		InterlockedExchangeAdd(&gAvailable, -work);

int main(int argc, char* argv[])
	HANDLE h[2];

	h[0] = CreateThread(0, 0, Consumer, NULL, 0, 0);
	h[1] = CreateThread(0, 0, Producer, NULL, 0, 0);

	WaitForMultipleObjects(2, h, TRUE, INFINITE);

	return 0;

So where's the problem? What would make the assert fire?

We triple-checked the logic and couldn't see anything wrong (it was more complicated than the example above so there were a number of possible culprits) and unlike the example above there were no asserts, just a hung thread at some later stage of execution.

Unfortunately the bug reproduced only once every other week so we knew we had to fix it while I had it in a debugger. We checked all the relevant in-memory data and couldn't see any that had obviously been overwritten ("memory stomp" is usually the first thing called out when these kinds of bugs show up).

It took us a while but eventually we checked the disassembly for the call to min(). Much to our surprise it was performing two loads of gAvailable instead of the one we had expected!

This happened to be on X360 but the same problem occurs on Win32, here's the disassembly for the code above (VS2010 Debug):

// calculate available work with a limit of 5 items per iteration
LONG work = min(gAvailable, 5);

// (1) read gAvailable, compare against 5
002D1457  cmp         dword ptr [gAvailable (2D7140h)],5
002D145E  jge         Consumer+3Dh (2D146Dh)

// (2) read gAvailable again, store on stack
002D1460  mov         eax,dword ptr [gAvailable (2D7140h)]
002D1465  mov         dword ptr [ebp-0D0h],eax
002D146B  jmp         Consumer+47h (2D1477h)
002D146D  mov         dword ptr [ebp-0D0h],5

// (3) store gAvailable from (2) in 'work'
002D1477  mov         ecx,dword ptr [ebp-0D0h]
002D147D  mov         dword ptr [work],ecx

The question is what happens between (1) and (2)? Well the answer is that any other thread can add to gAvailable, meaning that the stored value at (3) is now > 5.

In this case the simple solution was to read gAvailable outside of the call to min():

// pull available work with a limit of 5 items per iteration
LONG available = gAvailable;
LONG work = min(available, 5);

Maybe this is obvious to some people but it sure caused me and some smart people a headache for a few hours :)

Note that you may not see the problem in some build configurations depending on whether or not the compiler generates code to perform the second read of the variable after the comparison. As far as I know there are no guarantees about what it may or may not do in this case, FWIW we had the problem in a release build with optimisations enabled.

Big props to Tom and Ruslan at Lucas for helping track this one down.


7 Comments on “Threading Fun”

  1. 1 gyakoo said at 11:24 pm on May 26th, 2010:

    Maybe it could be because, the std::min implementation (in vs2008) is:

    template inline
    const _Ty& (__CLRCALL_OR_CDECL min)(const _Ty& _Left, const _Ty& _Right, _Pr _Pred)
    { // return smaller of _Left and _Right using _Pred
    return (_DEBUG_LT_PRED(_Pred, _Right, _Left) ? _Right : _Left);

    Being _Left a const reference&, and it is used two times, (1) to call to predicate and (2) then return from function, so two memory loads are performed. In this case, a value parameter pass should be great, but you're actually copying before, outside the method :)

    We need understand better what stl is doing, I think so.

    Good post!

  2. 2 mmack said at 8:10 am on May 27th, 2010:

    Hi Gyakoo, you're right in a sense that the bug appeared because std::min takes a const reference. However this isn't a problem with the stl, any similar implementation would have had the same issue.

    When you pass the address of a variable to a function the compiler is free to generate as many or as few loads as it needs to.

    I think the reason we didn't spot it sooner was simply that we thought min was such a base piece of functionality we didn't consider it being non-atomic.

    If the function had instead been called DoSomeWorkWithThisPtr(int* pData) then we probably would have been more suspicious.

    Also on some compilers you would have to actually cast away the volatile to pass it to std::min (which was being done in our case), that should have been another red flag.

  3. 3 Arseny Kapoulkine said at 2:50 am on May 30th, 2010:

    Please note that this problem has absolutely nothing to do with std::min. Your include and using namespace std; merely confuse the reader.

    Your code uses min() macro as defined in windows.h, which evaluates it's arguments twice; since the argument in this case is a volatile variable, compiler has to generate two loads even in release build.

    std::min can not be called with volatile argument, since volatile long can not be used as a const long& argument. You can easily verify this by defining NOMINMAX.

    If you make your argument non-volatile, you'll get a single load with optimizations enabled; also using a min function which is declared as T min(T, T) will generate single load and will work with volatile. Still, explicitly loading into a local variable is the proper way to write this.

    Say "Hi" to Ruslan for me :)

  4. 4 mmack said at 6:56 am on May 30th, 2010:

    Thanks Arseny, I've updated the post to hopefully be more clear.

    I totally agree the problem has nothing to do with std::min. It just happens our real life case on X360 was using it and casting away volatile to allow it to compile.

    In that case it still generated two loads in optimized builds but as I understand it that's perfectly reasonable behaviour, just a little unexpected.

  5. 5 Daniel said at 9:47 pm on June 2nd, 2010:

    I think there's still a race condition in that code even with pulling the read out to a local variable. The read and write to gAvailable are not atomic. A first thread could read from gAvailable, then the second thread could read and write to it, then the first thread would write to it, and gAvailable would now be negative (or wrap around if unsigned). It may only happen very rarely if threads don't spend much time between the read and the write.

    It's kindof ironic that an interlocked function is used for the subtract when the code right above it is not atomic.

  6. 6 mmack said at 10:30 pm on June 2nd, 2010:

    Hmm maybe I'm being dense but I can't see how gAvailable can become negative.

    Assuming the single read of gAvailable into the local variable on the consumer is atomic (which I believe it is), then it can only ever see a value less than or equal to what the producer sees.

    In the case where the producer performs an increment of gAvailable after the consumer's read (but before the consumers write) the value after subtraction will be > 0.

    In the case that the producer did not perform any increments in the space between the consumer's read and write the minimum value is zero.

    I might have missed something obvious though..

  7. 7 Daniel said at 10:48 pm on June 4th, 2010:

    I was assuming there were multiple consumer threads, which is not in your code. Oops =)

    Your blog is awesome btw, especially that latest post about calculating inscattering analytically. There aren't many places that show the math steps. I just don't understand how you accomplish so much in your free time.

    And congrats on your job at LucasArts.

Leave a Reply