Miles Macklin

Two threads, one cache line

January 9, 2009

An interesting thread going around the GDA mailing list at the moment about multithreaded programming reminded me of a little test app I wrote a while back to measure the cost of two threads accessing memory on the same cache line.

The program basically creates two threads which increment a variable a large number of times measuring the time it takes to complete with different distances between the write addresses. Something like this:

__declspec (align (512)) volatile int B[128]; 


DWORD WINAPI ThreadProc(LPVOID param)
{
    // read/write the address a whole lot
    for (int i=0; i < 10000000; ++i)
    {
        (*(volatile int*)param)++;
    }

    return 0;
}


int main()
{
    volatile int* d1 = &B[0];
    volatile int* d2 = &B[127];

    while (d1 != d2)
    {
        HANDLE threads[2];

        // QPC wrapper
        double start = GetSeconds();

        threads[0] = CreateThread(NULL, 0, ThreadProc, (void*)d1, 0, NULL);
        threads[1] = CreateThread(NULL, 0, ThreadProc, (void*)d2, 0, NULL);

        WaitForMultipleObjects(2, threads, TRUE, INFINITE);

        double end = GetSeconds();

        --d2;

        cout << (d2-d1) * sizeof(int) << "bytes apart: " << (end-start)*1000.0f << "ms" << endl;
    }
    
    int i;
    cin >> i;
    return 0;
}

On my old P4 with a 64 byte cache line these are the results:



128bytes apart: 17.4153ms

124bytes apart: 17.878ms

120bytes apart: 17.4028ms

116bytes apart: 17.3625ms

112bytes apart: 17.959ms

108bytes apart: 18.0241ms

104bytes apart: 17.2938ms

100bytes apart: 17.6643ms

96bytes apart: 17.5377ms

92bytes apart: 19.3156ms

88bytes apart: 17.2013ms

84bytes apart: 17.9361ms

80bytes apart: 17.1321ms

76bytes apart: 17.5997ms

72bytes apart: 17.4634ms

68bytes apart: 17.6562ms

64bytes apart: 17.4704ms

60bytes apart: 17.9947ms

56bytes apart: 149.759ms ***

52bytes apart: 151.64ms

48bytes apart: 150.132ms

44bytes apart: 125.318ms

40bytes apart: 160.33ms

36bytes apart: 147.889ms

32bytes apart: 152.42ms

28bytes apart: 157.003ms

24bytes apart: 149.552ms

20bytes apart: 142.372ms

16bytes apart: 136.908ms

12bytes apart: 145.691ms

8bytes apart: 146.768ms

4bytes apart: 128.408ms

0bytes apart: 125.655ms

You can see when it gets to 56 bytes there is a large penalty (9x slower!) as it brings the cache-coherency protocol into play which forces the processor to reload from main memory.

Actually it turns out this is called "false sharing" and it's quite well known, the common solution is to pad your shared data to be at least one cache line apart.

Refs:

http://software.intel.com/en-us/articles/reduce-false-sharing-in-net/
http://www.ddj.com/embedded/196902836

Two threads, one cache line

Recent Posts

Elsewhere