Two threads, one cache line

An interesting thread going around the GDA mailing list at the moment about multithreaded programming reminded me of a little test app I wrote a while back to measure the cost of two threads accessing memory on the same cache line.

The program basically creates two threads which increment a variable a large number of times measuring the time it takes to complete with different distances between the write addresses. Something like this:

__declspec (align (512)) volatile int B[128]; 


DWORD WINAPI ThreadProc(LPVOID param)
{
    // read/write the address a whole lot
    for (int i=0; i < 10000000; ++i)
    {
        (*(volatile int*)param)++;
    }

    return 0;
}


int main()
{
    volatile int* d1 = &B[0];
    volatile int* d2 = &B[127];

    while (d1 != d2)
    {
        HANDLE threads[2];

        // QPC wrapper
        double start = GetSeconds();

        threads[0] = CreateThread(NULL, 0, ThreadProc, (void*)d1, 0, NULL);
        threads[1] = CreateThread(NULL, 0, ThreadProc, (void*)d2, 0, NULL);

        WaitForMultipleObjects(2, threads, TRUE, INFINITE);

        double end = GetSeconds();

        --d2;

        cout << (d2-d1) * sizeof(int) << "bytes apart: " << (end-start)*1000.0f << "ms" << endl;
    }
    
    int i;
    cin >> i;
    return 0;
}

On my old P4 with a 64 byte cache line these are the results:


128bytes apart: 17.4153ms
124bytes apart: 17.878ms
120bytes apart: 17.4028ms
116bytes apart: 17.3625ms
112bytes apart: 17.959ms
108bytes apart: 18.0241ms
104bytes apart: 17.2938ms
100bytes apart: 17.6643ms
96bytes apart: 17.5377ms
92bytes apart: 19.3156ms
88bytes apart: 17.2013ms
84bytes apart: 17.9361ms
80bytes apart: 17.1321ms
76bytes apart: 17.5997ms
72bytes apart: 17.4634ms
68bytes apart: 17.6562ms
64bytes apart: 17.4704ms
60bytes apart: 17.9947ms
56bytes apart: 149.759ms ***
52bytes apart: 151.64ms
48bytes apart: 150.132ms
44bytes apart: 125.318ms
40bytes apart: 160.33ms
36bytes apart: 147.889ms
32bytes apart: 152.42ms
28bytes apart: 157.003ms
24bytes apart: 149.552ms
20bytes apart: 142.372ms
16bytes apart: 136.908ms
12bytes apart: 145.691ms
8bytes apart: 146.768ms
4bytes apart: 128.408ms
0bytes apart: 125.655ms

You can see when it gets to 56 bytes there is a large penalty (9x slower!) as it brings the cache-coherency protocol into play which forces the processor to reload from main memory.

Actually it turns out this is called "false sharing" and it's quite well known, the common solution is to pad your shared data to be at least one cache line apart.

Refs:

http://software.intel.com/en-us/articles/reduce-false-sharing-in-net/
http://www.ddj.com/embedded/196902836