(Almost) realtime GI

Posted: January 21st, 2009 | 1 Comment »

After my initial implementation of surfel based illumination I've extended it to do hierarchical clustering of surfels based on a similar idea to the one presented in GPU Gems 2.

A few differences:

I'm using a k-means clustering to build the approximation hierarchy bottom up. A couple of iterations of Lloyd's algorithm provides pretty good results. Really you could get away with one iteration.

To seed the clustering I'm simply selecting every n'th surfel from the source input. At first I thought I should be choosing a random spatial distribution but it turns out clustering based on regular intervals in the mesh works well. This is because you will end up with more detail in highly tesselated places, which is what you want (assuming your input has been cleaned).

For example, a two tri wall will be clustered into one larger surfel where as a highly tesselated sphere will be broken into more clusters.

The error metric I used to do the clustering is this:

Error = (1.0f + (1.0f-Dot(surfel.normal, cluster.normal)) * Length(surfel.pos-cluster.pos)

So it's a combination of how aligned the surfel and cluster are and how far away they are from each other. You can experiment with weighting each of those metrics individually but just summing seems to give good results.

When summing up the surfels to form your representative cluster surfel you want to:

a) sum the area
b) average the position, normal, emission and albedo weighted by area

Weighting by area is quite important and necessary for the emissive value or you'll basically be adding energy into the simulation.

Then you get to the traversal, Bunnel recommended skipping a sub-tree of surfels if the distance to the query point is > 4*radius of the cluster surfel. That gives results practically identical to the brute force algorithm and I think you can be more agressive without losing much quality at all.

I get between a 4-6x speed up using the hierarchy over brute force. Not quite realtime yet but I haven't optimised the tree structure at all, I'm also not running it on the GPU :)

It seems like every post I make here has to reference Christer Ericson somehow but I really recommend his book for ideas about optimising bounding volumes. Loads of good stuff in there that I have yet to implement.


Real-time Soft Shadows in Dynamic Scenes using Spherical Harmonic Exponentiation
Lightcuts: a scalable approach to illumination

{ 1 Comment » }

Indirect illumination

Posted: January 11th, 2009 | No Comments »

It's been a while since I checked in on the state of the art in global illumination but there is some seriously cool research happening at the moment.

I liked the basic idea Dreamworks used on Shrek2 (An Approximate Global Illumination System for Computer Generated Films) which stores direct illumination in light maps and then runs a final gather pass on that to calculate one bounce of indirect. It might be possible to adapt this to real-time if you could pre-compute and store the sample coordinates for each point..

However the current state of the art seems to be the point based approach that was first presented by Michael Bunnell with his GPU Gems 2 article Dynamic Ambient Occlusion and Indirect Lighting, he approximates the mesh as a set of oriented discs and computes the radiance transfer dynamically on the GPU.

Turns out Pixar took this idea and now use it on all their movies, Pirates of Carribean, Wall-E, etc. The technique is described here in Point Based Approximate Color Bleeding. Bunnell's original algorithm was O(N^2) in the number of surfels but he used an clustered hierarchy to get that down to O(N.log(N)), Pixar use an Octree which stores a more accurate spherical harmonic approximation at each node.

What's really interesting is how far Bunnell has pushed this idea, if you read through Fantasy Lab's recent patent (August 2008), there are some really nice ideas in there that I haven't seen published anywhere.

Here's a summary of what's new since the GPU Gems 2 article:

- Fixed the slightly cumbersome multiple shadow passes by summing 'negative illumination' from back-facing surfels
- Takes advantage of temporal coherence by simply iterating the illumination map each frame
- Added some directional information by subdividing the hemisphere into quadrants
- Threw in some nice subdivision surfaces stuff in at the end

Anyway, I knocked up a prototype of the indirect illumination technique and it seems to work quite well. The patent leaves out loads of information (and spends two pages describing what a GPU is), but it's not too difficult to work out the details (note the form factor calculation is particularly simplified).

Here are the results from a *very* low resolution mesh, in reality you would prime your surfels with direct illumination calculated in the traditional way with shadow mapping / shaders then let the sim bounce it round but in this case I've done the direct lighting using his method as well.



Disclaimer: some artifacts are visible here due to the way illumination is baked back to the mesh and there aren't really enough surfels to capture all the fine detail but it's quite promising.

This seems like a perfect job for CUDA or Larrabee as the whole algorithm can be run in parallel. You can do it purely through DirectX or OpenGL but it's kind've nasty.

{ No Comments » }

Branch free Clamp()

Posted: January 9th, 2009 | 3 Comments »

One of my work mates had some code with a lot of floating point clamps in it the other day so I wrote this little branch free version using the PS3's floating point select intrinsic:

float Clamp(float x, float lower, float upper)
	float t = __fsels(x-lower, x, lower);
	return __fsels(t-upper, upper, t);

__fsels basically does this:

float __fsels(float x, float a, float b)
	return (x >= 0.0f) ? a : b

I measured it to be 8% faster than a standard implementation, not a whole lot but quite fun to write. The SPUs have quite general selection functionality which is more useful, some stuff about it here:


(Not sure about this free WordPress code formatting, I may have to move it to my own host soon)


Two threads, one cache line

Posted: January 9th, 2009 | 3 Comments »

An interesting thread going around the GDA mailing list at the moment about multithreaded programming reminded me of a little test app I wrote a while back to measure the cost of two threads accessing memory on the same cache line.

The program basically creates two threads which increment a variable a large number of times measuring the time it takes to complete with different distances between the write addresses. Something like this:

__declspec (align (512)) volatile int B[128]; 

	// read/write the address a whole lot
	for (int i=0; i < 10000000; ++i)
		(*(volatile int*)param)++;

	return 0;

int main()
	volatile int* d1 = &B[0];
	volatile int* d2 = &B[127];

	while (d1 != d2)
		HANDLE threads[2];

		// QPC wrapper
		double start = GetSeconds();

		threads[0] = CreateThread(NULL, 0, ThreadProc, (void*)d1, 0, NULL);
		threads[1] = CreateThread(NULL, 0, ThreadProc, (void*)d2, 0, NULL);

		WaitForMultipleObjects(2, threads, TRUE, INFINITE);

		double end = GetSeconds();


		cout << (d2-d1) * sizeof(int) << "bytes apart: " << (end-start)*1000.0f << "ms" << endl;
	int i;
	cin >> i;
	return 0;

On my old P4 with a 64 byte cache line these are the results:

128bytes apart: 17.4153ms
124bytes apart: 17.878ms
120bytes apart: 17.4028ms
116bytes apart: 17.3625ms
112bytes apart: 17.959ms
108bytes apart: 18.0241ms
104bytes apart: 17.2938ms
100bytes apart: 17.6643ms
96bytes apart: 17.5377ms
92bytes apart: 19.3156ms
88bytes apart: 17.2013ms
84bytes apart: 17.9361ms
80bytes apart: 17.1321ms
76bytes apart: 17.5997ms
72bytes apart: 17.4634ms
68bytes apart: 17.6562ms
64bytes apart: 17.4704ms
60bytes apart: 17.9947ms
56bytes apart: 149.759ms ***
52bytes apart: 151.64ms
48bytes apart: 150.132ms
44bytes apart: 125.318ms
40bytes apart: 160.33ms
36bytes apart: 147.889ms
32bytes apart: 152.42ms
28bytes apart: 157.003ms
24bytes apart: 149.552ms
20bytes apart: 142.372ms
16bytes apart: 136.908ms
12bytes apart: 145.691ms
8bytes apart: 146.768ms
4bytes apart: 128.408ms
0bytes apart: 125.655ms

You can see when it gets to 56 bytes there is a large penalty (9x slower!) as it brings the cache-coherency protocol into play which forces the processor to reload from main memory.

Actually it turns out this is called "false sharing" and it's quite well known, the common solution is to pad your shared data to be at least one cache line apart.