Two threads, one cache line

Posted: January 9th, 2009 | 3 Comments »

An interesting thread going around the GDA mailing list at the moment about multithreaded programming reminded me of a little test app I wrote a while back to measure the cost of two threads accessing memory on the same cache line.

The program basically creates two threads which increment a variable a large number of times measuring the time it takes to complete with different distances between the write addresses. Something like this:

__declspec (align (512)) volatile int B[128]; 


DWORD WINAPI ThreadProc(LPVOID param)
{
	// read/write the address a whole lot
	for (int i=0; i < 10000000; ++i)
	{
		(*(volatile int*)param)++;
	}

	return 0;
}


int main()
{
	volatile int* d1 = &B[0];
	volatile int* d2 = &B[127];

	while (d1 != d2)
	{
		HANDLE threads[2];

		// QPC wrapper
		double start = GetSeconds();

		threads[0] = CreateThread(NULL, 0, ThreadProc, (void*)d1, 0, NULL);
		threads[1] = CreateThread(NULL, 0, ThreadProc, (void*)d2, 0, NULL);

		WaitForMultipleObjects(2, threads, TRUE, INFINITE);

		double end = GetSeconds();

		--d2;

		cout << (d2-d1) * sizeof(int) << "bytes apart: " << (end-start)*1000.0f << "ms" << endl;
	}
	
	int i;
	cin >> i;
	return 0;
}

On my old P4 with a 64 byte cache line these are the results:

128bytes apart: 17.4153ms
124bytes apart: 17.878ms
120bytes apart: 17.4028ms
116bytes apart: 17.3625ms
112bytes apart: 17.959ms
108bytes apart: 18.0241ms
104bytes apart: 17.2938ms
100bytes apart: 17.6643ms
96bytes apart: 17.5377ms
92bytes apart: 19.3156ms
88bytes apart: 17.2013ms
84bytes apart: 17.9361ms
80bytes apart: 17.1321ms
76bytes apart: 17.5997ms
72bytes apart: 17.4634ms
68bytes apart: 17.6562ms
64bytes apart: 17.4704ms
60bytes apart: 17.9947ms
56bytes apart: 149.759ms ***
52bytes apart: 151.64ms
48bytes apart: 150.132ms
44bytes apart: 125.318ms
40bytes apart: 160.33ms
36bytes apart: 147.889ms
32bytes apart: 152.42ms
28bytes apart: 157.003ms
24bytes apart: 149.552ms
20bytes apart: 142.372ms
16bytes apart: 136.908ms
12bytes apart: 145.691ms
8bytes apart: 146.768ms
4bytes apart: 128.408ms
0bytes apart: 125.655ms

You can see when it gets to 56 bytes there is a large penalty (9x slower!) as it brings the cache-coherency protocol into play which forces the processor to reload from main memory.

Actually it turns out this is called "false sharing" and it's quite well known, the common solution is to pad your shared data to be at least one cache line apart.

Refs:

http://software.intel.com/en-us/articles/reduce-false-sharing-in-net/
http://www.ddj.com/embedded/196902836

{ 3 Comments » }


3 Comments on “Two threads, one cache line”

  1. 1 Arseny Kapoulkine said at 2:31 pm on March 8th, 2010:

    1. When difference between two pointers is exactly equal to cache line size, the variables (assuming equal sizes and no cache line striding) are located in adjacent cache lines, not in single one; so it would be correct to expect a difference of 60 bytes

    2. Since you decrement d2 before printing, you output time for the current iteration, but difference for the next one, so you get 56 instead of 60.

  2. 2 mmack said at 7:32 pm on March 8th, 2010:

    Hey Arseny, thanks a lot for pointing out the bug in my code. That explains it!

    Looking forward to some more blog posts from you soon :)

  3. 3 Duckie said at 1:11 pm on January 30th, 2014:

    Hello !

    It seems to me that you should not use the volatile keyword since this can cause the data not being pushed into the cache, right ? Can you explain why you used "volatile" ?


Leave a Reply