<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Miles Macklin&#039;s blog</title>
	<atom:link href="http://blog.mmacklin.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.mmacklin.com</link>
	<description>Computer graphics and simulation</description>
	<lastBuildDate>Sun, 12 May 2013 03:08:28 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5</generator>
		<item>
		<title>Position Based Fluids</title>
		<link>http://blog.mmacklin.com/2013/04/24/position-based-fluids/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=position-based-fluids</link>
		<comments>http://blog.mmacklin.com/2013/04/24/position-based-fluids/#comments</comments>
		<pubDate>Wed, 24 Apr 2013 06:33:12 +0000</pubDate>
		<dc:creator>mmack</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[CUDA]]></category>
		<category><![CDATA[Fluid Simulation]]></category>
		<category><![CDATA[GPU]]></category>
		<category><![CDATA[Physics]]></category>
		<category><![CDATA[Position Based Fluids]]></category>
		<category><![CDATA[SPH]]></category>

		<guid isPermaLink="false">http://blog.mmacklin.com/?p=1516</guid>
		<description><![CDATA[Position Based Fluids (PBF) is the title of our paper that has been accepted for presentation at SIGGRAPH 2013. I've set up a project page where you can download the paper and all the related content here: http://blog.mmacklin.com/publications I have continued working on the technique since the submission, mainly improving the rendering, and adding features [...]]]></description>
				<content:encoded><![CDATA[<p>Position Based Fluids (PBF) is the title of our paper that has been accepted for presentation at SIGGRAPH 2013. I've set up a project page where you can download the paper and all the related content here:</p>
<p><a href="http://blog.mmacklin.com/publications">http://blog.mmacklin.com/publications</a></p>
<p>I have continued working on the technique since the submission, mainly improving the rendering, and adding features like spray and foam (based on the excellent paper from the University of Freiburg: <a href="http://cg.informatik.uni-freiburg.de/publications/2012_CGI_sprayFoamBubbles.pdf">Unified Spray, Foam and Bubbles for Particle-Based Fluids</a>). You can see the results in action below, but I recommend checking out the project page and downloading the videos, they look great at full resolution and 60hz.</p>
<p><iframe width="590" height="332" src="http://www.youtube.com/embed/F5KuP6qEuew?feature=oembed" frameborder="0" allowfullscreen></iframe></p>
<p><iframe width="590" height="332" src="http://www.youtube.com/embed/mgYztcjOvRQ?feature=oembed" frameborder="0" allowfullscreen></iframe></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mmacklin.com/2013/04/24/position-based-fluids/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>2D FEM</title>
		<link>http://blog.mmacklin.com/2012/06/27/2d-fem/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=2d-fem</link>
		<comments>http://blog.mmacklin.com/2012/06/27/2d-fem/#comments</comments>
		<pubDate>Wed, 27 Jun 2012 10:40:40 +0000</pubDate>
		<dc:creator>mmack</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[FEM]]></category>
		<category><![CDATA[Meshing]]></category>
		<category><![CDATA[Physics]]></category>

		<guid isPermaLink="false">http://blog.mmacklin.com/?p=1500</guid>
		<description><![CDATA[This post is about generating meshes for finite element simulations. I'll be covering other aspects of FEM based simulation in a later post, until then I recommend checking out Matthias Müller's very good introduction in the SIGGRAPH 2008 Real Time Physics course [1]. After spending the last few weeks reading, implementing and debugging meshing algorithms [...]]]></description>
				<content:encoded><![CDATA[<p>This post is about generating meshes for finite element simulations. I'll be covering other aspects of FEM based simulation in a later post, until then I recommend checking out Matthias Müller's very good introduction in the SIGGRAPH 2008 Real Time Physics course <a href="#ref1">[1]</a>.</p>
<p>After spending the last few weeks reading, implementing and debugging meshing algorithms I have a new-found respect for people in this field. It is amazing how many ways meshes can "go wrong", even the experts have it tough:</p>
<blockquote><p>“I hate meshes. I cannot believe how hard this is. Geometry is hard.”<br />
— David Baraff, Senior Research Scientist, Pixar Animation Studios</p></blockquote>
<p>Meshing algorithms are hard, but unless you are satisfied simulating cantilever beams and simple geometric shapes you will eventually need to deal with them.</p>
<p>My goal was to find an algorithm that would take an image as input, and produce as output a <i>good quality</i> triangle mesh that conformed to the boundary of any non-zero regions in the image. </p>
<p>My first attempt was to perform a coarse grained edge detect and generate a <a href="http://en.wikipedia.org/wiki/Delaunay_triangulation">Delaunay triangulation</a> of the resulting point set. The input image and the result of a low-res edge detect:</p>
<div class="aligncenter" style="width: 440px;">
<img src="http://blog.mmacklin.com/wp-content/uploads/2012/06/armadillo.jpg" alt="" title="armadillo" width="220"/><img src="http://blog.mmacklin.com/wp-content/uploads/2012/06/fem_figure1.png" width="220" alt="" title="Coarse edge detect" />
</div>
<p>This point set can be converted to a mesh by any Delaunay triangulation method, the <a href="http://en.wikipedia.org/wiki/Bowyer%E2%80%93Watson_algorithm">Bowyer-Watson algorithm</a> is probably the simplest. It works by inserting one point at a time, removing any triangles whose circumcircle is encroached by the new point and re-tessellating the surrounding edges. A nice feature is that the algorithm has a direct analogue for tetrahedral meshes, triangles become tetrahedra, edges become faces and circumcircles become circumspheres.</p>
<p>Here's an illustration of how Bowyer/Watson proceeds to insert the point in red into the mesh:</p>
<p><img src="http://blog.mmacklin.com/wp-content/uploads/2012/06/fem_del_1.png" alt="" title="Delaunay_1" width="180" class="alignnone" /><img src="http://blog.mmacklin.com/wp-content/uploads/2012/06/fem_del_1a.png" alt="" title="Delaunay_1a" width="180" class="alignnone" /><img src="http://blog.mmacklin.com/wp-content/uploads/2012/06/fem_del_2.png" alt="" title="Delaunay_2" width="180" class="alignnone" /><img src="http://blog.mmacklin.com/wp-content/uploads/2012/06/fem_del_3.png" alt="" title="Delaunay_3" width="180" class="alignnone" /></p>
<p>And here is the Delaunay triangulation of the Armadillo point set:</p>
<p><img src="http://blog.mmacklin.com/wp-content/uploads/2012/06/fem_figure2.png" alt="" title="Delaunay Triangulation" class="aligncenter" /></p>
<p>As you can see, Delaunay triangulation algorithms generate the convex hull of the input points. But we want a mesh that conforms to the shape boundary - one way to fix this is to sample the image at each triangle's centroid, if the sample lies outside the shape then simply throw away the triangle. This produces:</p>
<p><img src="http://blog.mmacklin.com/wp-content/uploads/2012/06/fem_figure3.png" alt="" title="Trimmed Delaunay" class="aligncenter" /></p>
<p>Much better! Now we have a reasonably good approximation of the input shape. Unfortunately, FEM simulations don't work well with long thin "sliver" triangles. This is due to interpolation error and because a small movement in one of the triangle's vertices leads to large forces, which leads to inaccuracy and small time steps <a href="#ref2">[2]</a>.</p>
<p>Before we look at ways to improve triangle quality it's worth talking about how to measure it. One measure that works well in 2D is the ratio of the triangle's circumradius to it's shortest edge. A smaller ratio indicates a higher quality triangle, which intuitively seems reasonable, long skinny triangles have a large circumradius but one very short edge:</p>
<div class="aligncenter" style="width: 500px;">
<img src="http://blog.mmacklin.com/wp-content/uploads/2012/06/fem_good_quality.png" alt="" title="Good quality triangle" width="250" /><img src="http://blog.mmacklin.com/wp-content/uploads/2012/06/fem_poor_quality.png" alt="" title="fem_poor_quality" width="250" /></div>
<p>The triangle on the left, which is equilateral, has a ratio ~0.5 and is the best possible triangle by this measure. The triangle on the right has a ratio of ~8.7, note the circumcenter of sliver triangles tend to fall outside of the triangle itself.</p>
<h3>Delaunay refinement</h3>
<p>Methods such as <a href="http://en.wikipedia.org/wiki/Chew's_second_algorithm">Chew's algorithm</a> and <a href="http://en.wikipedia.org/wiki/Ruppert%27s_algorithm">Ruppert's algorithm</a> are probably the most well known refinement algorithms. They attempt to improve mesh quality while maintaining the <a href="http://en.wikipedia.org/wiki/Delaunay_triangulation#Properties">Delaunay property</a> (no vertex encroaching a triangle's circumcircle). This is typically done by inserting the circumcenter of low-quality triangles and subdividing edges.</p>
<p>Jonathon Shewchuk's <a href="http://www.cs.berkeley.edu/~jrs/papers/2dj.pdf">"ultimate guide"</a> has everything you need to know and there is <a href="http://www.cs.cmu.edu/~quake/triangle.html">Triangle</a>, an open source tool to generate high quality triangulations. </p>
<p>Unfortunately these algorithms require an accurate polygonal boundary as input as the output is sensitive to the input segment lengths. They are also famously difficult to implement robustly and efficiently, I spent most of my time implementing Ruppert's algorithm only to find the next methods produced better results with much simpler code.</p>
<h3>Variational Methods</h3>
<p>Variational (energy based) algorithms improve the mesh through a series of optimization steps that attempt to minimize a global energy function. I adapted the approach in Variational Tetrahedral Meshing <a href="#ref3">[3]</a> to 2D and found it produced great results, this is the method I settled on so I'll go into some detail.</p>
<p>The algorithm proceeds as follows:</p>
<ol style="font-family: courier; font-size: 12px;">
<li>Generate a set of uniformly distributed points interior to the shape P</li>
<li>Generate a set of points on the boundary of the shape B</li>
<li>Generate a Delaunay triangulation of P</li>
<li>Optimize boundary points by moving them them to the average of their neighbours in B</li>
<li>Optimize interior points by moving them to the centroid of their Voronoi cell (area weighted average of connected triangle circumcenters)</li>
<li>Unless stopping criteria met, go to 3.</li>
<li>Remove boundary sliver triangles</li>
</ol>
<p>The core idea is that of repeated triangulation (3) and relaxation (4,5), it's a somewhat similar process to Lloyd's clustering, conincidentally the same algorithm I had used to generate surfel hierarchies for global illumination sims in the past.</p>
<p>Here's an animation of 7 iterations on the Armadillo, note the number of points stays the same throughout (another nice property):</p>
<p><img src="http://blog.mmacklin.com/wp-content/uploads/2012/06/figure_variational.gif" alt="" title="figure_variational" width="360" height="370" class="aligncenter size-full wp-image-1507" /></p>
<p>It's interesting to see how much the quality improves after the very first step. Although Alliez et al. <a href="#ref3">[3]</a> don't provide any guarantees on the resulting mesh quality I found the algorithm works very well on a variety of input images with a fixed number of iterations.</p>
<p>This is the algorithm I ended up using but I'll quickly cover a couple of alternatives for completeness.</p>
<h3>Structured Methods</h3>
<p>These algorithms typically start by tiling interior space using a BCC (body centered cubic) lattice which is simply two interleaved grids. They then generate a Delaunay triangulation and throw away elements lying completely outside the region of interest.</p>
<p>As usual, handling boundaries is where the real challenge lies, Molino et al. <a href="#ref4">[4]</a> use a force based simulation to push grid points towards the boundary. Isosurface Stuffing <a href="#ref5">[5]</a> refines the boundary by directly moving vertices to the zero-contour of a signed distance field or inserts new vertices if moving the existing lattice nodes would generate a poor quality triangle.</p>
<p>Lattice based methods are typically very fast and don't suffer from the numerical robustness issues of algorithms that rely on triangulation. However if you plan on fracturing the mesh along element boundaries then this regular nature is exposed and looks quite unconvincing.</p>
<h3>Simplification Methods</h3>
<p>Another approach is to start with a very fine-grained mesh and progressively simplify it in the style of Progressive Meshes<a href="#ref6"> [6]</a>. Barbara Cutler's <a href="http://people.csail.mit.edu/bmcutler/PROJECTS/PHD/index.html">thesis</a> and associated <a href="http://people.csail.mit.edu/bmcutler/PROJECTS/SGP04/index.html">paper</a> discusses the details and very helpfully provides <a href="http://people.csail.mit.edu/bmcutler/PROJECTS/SGP04/meshes/index.html">the resulting tetrahedral meshes</a>, but the implementation appears to be considerably more complex than variational methods and relies on quite a few heuristics to get good results.</p>
<h2>Simulation</h2>
<p>Now the mesh is ready it's time for the fun part (apologies if you really love meshing). This simple simulation is using co-rotational linear FEM with a semi-implicit time-stepping scheme:</p>
<p align="middle"><iframe src="http://player.vimeo.com/video/44652965" width="580" height="435" frameborder="0" webkitAllowFullScreen mozallowfullscreen allowFullScreen></iframe></p>
<p>(Armadillo and Bunny images courtesy of the <a href="http://graphics.stanford.edu/data/3Dscanrep/">Stanford Scanning Respository</a>)</p>
<p>Pre-built binaries for OSX/Win32 here: <a href="http://mmacklin.com/fem.zip">http://mmacklin.com/fem.zip</a></p>
<p>Source code is available on Github: <a href="https://github.com/mmacklin/sandbox/tree/master/projects/fem">https://github.com/mmacklin/sandbox/tree/master/projects/fem</a>.</p>
<h3>Refs:</h3>
<p><a name="ref1"></a>[1] Matthias Müller, Jos Stam, Doug James, and Nils Thürey. Real time physics: class notes. In ACM SIGGRAPH 2008 classes <a href="http://www.matthiasmueller.info/realtimephysics/index.html">http://www.matthiasmueller.info/realtimephysics/index.html</a></p>
<p><a name="ref2"></a>[2] Jonathan Richard Shewchuk. 2002. What Is a Good Linear Finite Element? Interpolation, Conditioning, Anisotropy, and Quality Measures, unpublished preprint. <a href="http://www.cs.berkeley.edu/~jrs/papers/elemj.pdf">http://www.cs.berkeley.edu/~jrs/papers/elemj.pdf</a></p>
<p><a name="ref3"></a>[3] Pierre Alliez, David Cohen-Steiner, Mariette Yvinec, and Mathieu Desbrun. 2005. Variational tetrahedral meshing. <a href="ftp://ftp-sop.inria.fr/prisme/alliez/vtm.pdf">ftp://ftp&#8209;sop.inria.fr/prisme/alliez/vtm.pdf</a></p>
<p><a name="ref4"></a>[4] Molino, Bridson, et al. - 2003. A Crystalline, Red Green Strategy for Meshing Highly Deformable Objects with Tetrahedra <a href="http://www.math.ucla.edu/~jteran/papers/MBTF03.pdf">http://www.math.ucla.edu/~jteran/papers/MBTF03.pdf</a></p>
<p><a name="ref5"></a>[5] François Labelle and Jonathan Richard Shewchuk. 2007. Isosurface stuffing: fast tetrahedral meshes with good dihedral angles. In ACM SIGGRAPH 2007 papers <a href="http://www.cs.berkeley.edu/~jrs/papers/stuffing.pdf">http://www.cs.berkeley.edu/~jrs/papers/stuffing.pdf</a></p>
<p><a name="ref6"></a>[6] Hugues Hoppe. 1996. Progressive meshes. <a href="http://research.microsoft.com/en-us/um/people/hoppe/pm.pdf">http://research.microsoft.com/en-us/um/people/hoppe/pm.pdf</a></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mmacklin.com/2012/06/27/2d-fem/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Implicit Springs</title>
		<link>http://blog.mmacklin.com/2012/05/04/implicitsprings/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=implicitsprings</link>
		<comments>http://blog.mmacklin.com/2012/05/04/implicitsprings/#comments</comments>
		<pubDate>Fri, 04 May 2012 11:43:39 +0000</pubDate>
		<dc:creator>mmack</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Physics]]></category>

		<guid isPermaLink="false">http://www.mmacklin.dreamhosters.com/codeblog/?p=1485</guid>
		<description><![CDATA[This is a quick post to document some work I did while writing a mass spring simulation using an implicit integrator. Implicit, or backward Euler integration is well described in David Baraff's Physically Based Modelling SIGGRAPH course and this post assumes some familiarity with it. Springs are a workhorse in physical simulation, once you have [...]]]></description>
				<content:encoded><![CDATA[<p>This is a quick post to document some work I did while writing a mass spring simulation using an implicit integrator. Implicit, or backward Euler integration is well described in David Baraff's <a href="http://www.pixar.com/companyinfo/research/pbm2001/">Physically Based Modelling SIGGRAPH course</a> and this post assumes some familiarity with it.</p>
<p>Springs are a workhorse in physical simulation, once you have unconditionally stable springs you can use them to model just about anything, from rigid bodies to <a href="http://www.rhythm.com/~tae/wichita.pdf">drool and snot</a>. For example, Industrial Light &#038; Magic used a tetrahedral mesh with edge and altitude springs to model the damage to ships in Avatar (see <a target="_blank" href="http://physbam.stanford.edu/~mlentine/images/deformingrigids.pdf"> Avatar: Bending Rigid Bodies</a>).</p>
<p>If you sit down and try and implement an implicit integrator one of the first things you need is the Jacobian of the particle forces with respect to the particle positions and velocities. The rest of this post shows how to derive these Jacobians for a basic Hookean spring in a form ready to be plugged into a linear system solver (I use a hand-rolled conjugate gradient solver, see Jonathon Shewchuk's <a target="_blank" href="http://www.cs.cmu.edu/~quake-papers/painless-conjugate-gradient.pdf">painless introduction</a> for details, it is all of about 20 lines of code to implement).<p style='text-align:center;'><span class='MathJax_Preview'>\[\renewcommand{\v}[1]{\mathbf{#1}} \newcommand{\uv}[1]{\mathbf{\hat{#1}}} \newcommand\ddx[1]{\frac{\partial#1}{\partial \v{x} }} \newcommand\dd[2]{\frac{\partial#1}{\partial #2}}\]</span><script type='math/tex;  mode=display'>\renewcommand{\v}[1]{\mathbf{#1}} \newcommand{\uv}[1]{\mathbf{\hat{#1}}} \newcommand\ddx[1]{\frac{\partial#1}{\partial \v{x} }} \newcommand\dd[2]{\frac{\partial#1}{\partial #2}}</script></p></p>
<h3>Vector Calculus Basics</h3>
<p>In order to calculate the force Jacobians we first need to know how to calculate the derivatives of some basic geometric quantities with respect to a vector.</p>
<p>In general the derivative of a scalar valued function with respect to a vector is defined as the following row vector of partial derivatives:</p>
<p><p style='text-align:center;'><span class='MathJax_Preview'>\[ \ddx{f} = \begin{bmatrix} \dd{f}{x_i} & \dd{f}{x_j} & \dd{f}{x_k} \end{bmatrix}\]</span><script type='math/tex;  mode=display'> \ddx{f} = \begin{bmatrix} \dd{f}{x_i} & \dd{f}{x_j} & \dd{f}{x_k} \end{bmatrix}</script></p></p>
<p>And for a vector valued function with respect to a vector:</p>
<p><p style='text-align:center;'><span class='MathJax_Preview'>\[\ddx{\v{f}} = \begin{bmatrix} \dd{f_i}{x_i} & \dd{f_i}{x_j} & \dd{f_i}{x_k} \\ \dd{f_j}{x_i} & \dd{f_j}{x_j} & \dd{f_j}{x_k} \\ \dd{f_k}{x_i} & \dd{f_k}{x_j} & \dd{f_k}{x_k} \end{bmatrix}\]</span><script type='math/tex;  mode=display'>\ddx{\v{f}} = \begin{bmatrix} \dd{f_i}{x_i} & \dd{f_i}{x_j} & \dd{f_i}{x_k} \\ \dd{f_j}{x_i} & \dd{f_j}{x_j} & \dd{f_j}{x_k} \\ \dd{f_k}{x_i} & \dd{f_k}{x_j} & \dd{f_k}{x_k} \end{bmatrix}</script></p></p>
<p>Applying the first definition to the dot product of two vectors we can calculate the derivative with respect to one of the vectors:</p>
<p><p style='text-align:center;'><span class='MathJax_Preview'>\[\ddx{\v{x}^T \cdot \v{y}} = \v{y}^T \]</span><script type='math/tex;  mode=display'>\ddx{\v{x}^T \cdot \v{y}} = \v{y}^T </script></p></p>
<p>Note that I'll explicitly keep track of whether vectors are row or column vectors as it will help keep things straight later on.</p>
<p>The derivative of a vector magnitude with respect to the vector, gives the normalized vector transposed: </p>
<p><p style='text-align:center;'><span class='MathJax_Preview'>\[\ddx{|\v{x}|} = \left(\frac{\v{x}}{|\v{x}|}\right)^T = \uv{x}^T \]</span><script type='math/tex;  mode=display'>\ddx{|\v{x}|} = \left(\frac{\v{x}}{|\v{x}|}\right)^T = \uv{x}^T </script></p></p>
<p>The derivative of a normalized vector <span class='MathJax_Preview'>\(\v{\hat{x}} = \frac{\v{x}}{|\v{x}|} \)</span><script type='math/tex'>\v{\hat{x}} = \frac{\v{x}}{|\v{x}|} </script> can be obtained using the quotient rule:</p>
<p><p style='text-align:center;'><span class='MathJax_Preview'>\[\ddx{\uv{x}} = \frac{\v{I}|\v{x}| - \v{x}\cdot\uv{x}^T}{|\v{x}|^2}\]</span><script type='math/tex;  mode=display'>\ddx{\uv{x}} = \frac{\v{I}|\v{x}| - \v{x}\cdot\uv{x}^T}{|\v{x}|^2}</script></p></p>
<p>Where <span class='MathJax_Preview'>\(\v{I}\)</span><script type='math/tex'>\v{I}</script> is the <span class='MathJax_Preview'>\(n\)</span><script type='math/tex'>n</script> x <span class='MathJax_Preview'>\(n\)</span><script type='math/tex'>n</script> identity matrix and n is the dimension of <span class='MathJax_Preview'>\(x\)</span><script type='math/tex'>x</script>. The product of a column vector and a row vector <span class='MathJax_Preview'>\(\uv{x}\cdot\uv{x}^T\)</span><script type='math/tex'>\uv{x}\cdot\uv{x}^T</script> is the outer product which is a <span class='MathJax_Preview'>\(n\)</span><script type='math/tex'>n</script> x <span class='MathJax_Preview'>\(n\)</span><script type='math/tex'>n</script> matrix that can be constructed using standard matrix multiplication definition.</p>
<p>Dividing through by <span class='MathJax_Preview'>\(|\v{x}|\)</span><script type='math/tex'>|\v{x}|</script> we have:</p>
<p><p style='text-align:center;'><span class='MathJax_Preview'>\[\ddx{\uv{x}} = \frac{\v{I} - \uv{x}\cdot\uv{x}^T}{\v{|x|}}\]</span><script type='math/tex;  mode=display'>\ddx{\uv{x}} = \frac{\v{I} - \uv{x}\cdot\uv{x}^T}{\v{|x|}}</script></p></p>
<p><br/></p>
<h3>Jacobian of Stretch Force</h3>
<p>Now we are ready to compute the Jacobian of the spring forces. Recall the equation for the elastic force on a particle <span class='MathJax_Preview'>\(i\)</span><script type='math/tex'>i</script> due to an undamped Hookean spring:</p>
<p><p style='text-align:center;'><span class='MathJax_Preview'>\[\v{F_s} = -k_s(|\v{x}_{ij}| - r)\uv{x}_{ij}\]</span><script type='math/tex;  mode=display'>\v{F_s} = -k_s(|\v{x}_{ij}| - r)\uv{x}_{ij}</script></p></p>
<p>Where <span class='MathJax_Preview'>\(\v{x}_{ij} = \v{x}_i - \v{x}_j\)</span><script type='math/tex'>\v{x}_{ij} = \v{x}_i - \v{x}_j</script> is the vector between the two connected particle positions, <span class='MathJax_Preview'>\(r\)</span><script type='math/tex'>r</script> is the rest length and <span class='MathJax_Preview'>\(k_s\)</span><script type='math/tex'>k_s</script> is the stiffness coefficient.</p>
<p>The Jacobian of this force with respect to particle <span class='MathJax_Preview'>\(i\)</span><script type='math/tex'>i</script>'s position is obtained by using the product rule for the two <span class='MathJax_Preview'>\(\v{x}_i\)</span><script type='math/tex'>\v{x}_i</script> dependent terms in <span class='MathJax_Preview'>\(\v{F_s}\)</span><script type='math/tex'>\v{F_s}</script>:</p>
<p><p style='text-align:center;'><span class='MathJax_Preview'>\[\dd{\v{F_s}}{\v{x}_i} = -ks\left[(|\v{x}_{ij}| - r)\dd{\uv{x}_{ij}}{\v{x}_i} + \uv{x}_{ij}\dd{(|\v{x}_{ij}| - r)}{\v{x}_i}\right]\]</span><script type='math/tex;  mode=display'>\dd{\v{F_s}}{\v{x}_i} = -ks\left[(|\v{x}_{ij}| - r)\dd{\uv{x}_{ij}}{\v{x}_i} + \uv{x}_{ij}\dd{(|\v{x}_{ij}| - r)}{\v{x}_i}\right]</script></p></p>
<p>Using the previously derived formulas for the derivative of a vector magnitude and normalized vector we have:</p>
<p><p style='text-align:center;'><span class='MathJax_Preview'>\[\dd{\v{F_s}}{\v{x}_i} = -ks\left[(|\v{x}_{ij}| - r)\left(\frac{\v{I} - \uv{x}_{ij}\cdot \uv{x}_{ij}^T}{|\v{x}_{ij}|}\right) + \uv{x}_{ij}\cdot\uv{x}_{ij}^T\right]\]</span><script type='math/tex;  mode=display'>\dd{\v{F_s}}{\v{x}_i} = -ks\left[(|\v{x}_{ij}| - r)\left(\frac{\v{I} - \uv{x}_{ij}\cdot \uv{x}_{ij}^T}{|\v{x}_{ij}|}\right) + \uv{x}_{ij}\cdot\uv{x}_{ij}^T\right]</script></p></p>
<p>Dividing the first two terms through by <span class='MathJax_Preview'>\(|\v{x}_{ij}|\)</span><script type='math/tex'>|\v{x}_{ij}|</script>:</p>
<p><p style='text-align:center;'><span class='MathJax_Preview'>\[\dd{\v{F_s}}{\v{x}_i} = -ks\left[(1 - \frac{r}{|\v{x}_{ij}|})\left(\v{I} - \uv{x}_{ij}\cdot \uv{x}_{ij}^T\right) + \uv{x}_{ij}\cdot \uv{x}_{ij}^T\right]\]</span><script type='math/tex;  mode=display'>\dd{\v{F_s}}{\v{x}_i} = -ks\left[(1 - \frac{r}{|\v{x}_{ij}|})\left(\v{I} - \uv{x}_{ij}\cdot \uv{x}_{ij}^T\right) + \uv{x}_{ij}\cdot \uv{x}_{ij}^T\right]</script></p></p>
<p>Due to the symmetry in the definition of <span class='MathJax_Preview'>\(\v{x}_{ij}\)</span><script type='math/tex'>\v{x}_{ij}</script> we have the following force derivative with respect to the opposite particle:</p>
<p><p style='text-align:center;'><span class='MathJax_Preview'>\[\dd{\v{F_s}}{\v{x}_j}  = -\dd{\v{F_s}}{\v{x}_i}\]</span><script type='math/tex;  mode=display'>\dd{\v{F_s}}{\v{x}_j}  = -\dd{\v{F_s}}{\v{x}_i}</script></p></p>
<p><br/></p>
<h3>Jacobian of Damping Force</h3>
<p>The equation for the damping force on a particle <span class='MathJax_Preview'>\(i\)</span><script type='math/tex'>i</script> due to a spring is:</p>
<p><p style='text-align:center;'><span class='MathJax_Preview'>\[\v{F_d} = -k_d\cdot\uv{x}(\v{v}_{ij}\cdot \uv{x}_{ij})\]</span><script type='math/tex;  mode=display'>\v{F_d} = -k_d\cdot\uv{x}(\v{v}_{ij}\cdot \uv{x}_{ij})</script></p></p>
<p>Where <span class='MathJax_Preview'>\(\v{v}_{ij} = \v{v}_i-\v{v}_j\)</span><script type='math/tex'>\v{v}_{ij} = \v{v}_i-\v{v}_j</script> is the relative velocities of the two particles. This is the preferred formulation because it damps only relative velocity along the spring axis.</p>
<p>Taking the derivative with respect to <span class='MathJax_Preview'>\(\v{v}_i\)</span><script type='math/tex'>\v{v}_i</script>:</p>
<p><p style='text-align:center;'><span class='MathJax_Preview'>\[\dd{\v{F_d}}{\v{v}_i} = -k_d\cdot\uv{x}\cdot\uv{x}^T\]</span><script type='math/tex;  mode=display'>\dd{\v{F_d}}{\v{v}_i} = -k_d\cdot\uv{x}\cdot\uv{x}^T</script></p></p>
<p>As with stretching, the force on the opposite particle is simply negated:</p>
<p><p style='text-align:center;'><span class='MathJax_Preview'>\[\dd{\v{F_d}}{\v{v}_j} = -\dd{\v{F_d}}{\v{v}_i} \]</span><script type='math/tex;  mode=display'>\dd{\v{F_d}}{\v{v}_j} = -\dd{\v{F_d}}{\v{v}_i} </script></p></p>
<p>Note that implicit integration introduces it's own artificial damping so you might find it's not necessary to add as much additional damping as you would with an explicit integration scheme.</p>
<p>I'll be going into more detail about implicit methods and FEM in subsequent posts, stay tuned!</p>
<h3>Refs</h3>
<ul>
<li><a href="http://www.pixar.com/companyinfo/research/pbm2001">[Baraff Witkin] - Physically Based Modelling, SIGGRAPH course</a></li>
<li><a href="http://run.usc.edu/cs599-s10/cloth/baraff-witkin98.pdf">[Baraff Witkin] - Large Steps in Cloth Simulation</a></li>
<li><a href="http://njoubert.com/teaching/cs184_sp09/section/simulation.pdf">[N. Joubert] - An Introduction to Simulation</a></li>
<li><a href="http://davidpritchard.org/freecloth/docs/report.pdf">[D Prichard] - Implementing Baraff and Witkin's Cloth Simulation</a></li>
<li><a href="http://graphics.snu.ac.kr/~kjchoi/publication/cloth.pdf">[Choi] - Stable but Responsive Cloth</a></li>
<li>Numerical Recipes, 3rd edition 2007 - ch17.5</li>
</ul>
<p></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mmacklin.com/2012/05/04/implicitsprings/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>New Look</title>
		<link>http://blog.mmacklin.com/2012/05/03/new-look/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=new-look</link>
		<comments>http://blog.mmacklin.com/2012/05/03/new-look/#comments</comments>
		<pubDate>Thu, 03 May 2012 00:29:09 +0000</pubDate>
		<dc:creator>mmack</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.mmacklin.com/?p=1491</guid>
		<description><![CDATA[Hi all, welcome to my new site. I've moved to my own hosting and have updated a few things - a new theme and a switch to MathJax for equation rendering. Apologies to RSS readers who will now see only a bunch of Latex code, but it is currently by far the easiest way to [...]]]></description>
				<content:encoded><![CDATA[<p>Hi all, welcome to my new site. I've moved to my own hosting and have updated a few things - a new theme and a switch to MathJax for equation rendering. Apologies to RSS readers who will now see only a bunch of Latex code, but it is currently by far the easiest way to put decent looking equations in a web page.</p>
<p>It's been a little over a year since I started working at NVIDIA and not coincidentally, since my last blog post. I'm really enjoying working more on the simulation side of things, it makes a nice change from pure rendering and the PhysX team is full of über-talented people who I'm learning a lot from.</p>
<p>I've got some simulation related posts (from a graphics programmer's perspective) planned over the next few months, I hope you enjoy them!</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mmacklin.com/2012/05/03/new-look/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Blackbody Rendering</title>
		<link>http://blog.mmacklin.com/2010/12/29/blackbody-rendering/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=blackbody-rendering</link>
		<comments>http://blog.mmacklin.com/2010/12/29/blackbody-rendering/#comments</comments>
		<pubDate>Wed, 29 Dec 2010 08:21:08 +0000</pubDate>
		<dc:creator>Miles</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Code]]></category>
		<category><![CDATA[Fluid Simulation]]></category>
		<category><![CDATA[Graphics]]></category>
		<category><![CDATA[Volume Rendering]]></category>

		<guid isPermaLink="false">http://mmack.wordpress.com/?p=1265</guid>
		<description><![CDATA[In between bouts of festive over-eating I added support for blackbody emission to my fluid simulator and thought I'd describe what was involved. Briefly, a blackbody is an idealised substance that gives off light when heated. Planck's formula describes the intensity of light per-wavelength with units W·sr-1·m-2·m-1 for a given temperature in Kelvins. Radiance has [...]]]></description>
				<content:encoded><![CDATA[<p>In between bouts of festive over-eating I added support for blackbody emission to my fluid simulator and thought I'd describe what was involved.</p>
<p>Briefly, a <a href="http://en.wikipedia.org/wiki/Black_body">blackbody</a> is an idealised substance that gives off light when heated. <a href="http://en.wikipedia.org/wiki/Planck's_law">Planck's formula</a> describes the intensity of light per-wavelength with units <strong>W·sr<sup>-1</sup>·m<sup>-2</sup>·m<sup>-1</sup></strong> for a given temperature in Kelvins.</p>
<p>Radiance has units <strong>W·sr<sup>-1</sup>·m<sup>-2</sup></strong> so we need a way to convert the wavelength dependent power distribution given by Planck's formula to a radiance value in RGB that we can use in our shader / ray-tracer.</p>
<p>The typical way to do this is as follows:</p>
<ol>
<li>Integrate Planck's formula against the CIE XYZ colour matching functions (available as part of <a href="https://github.com/mmp/pbrt-v2/blob/master/src/core/spectrum.cpp">PBRT</a> in 1nm increments)</li>
<li>Convert from <a href="http://en.wikipedia.org/wiki/CIE_1931_color_space">XYZ</a> to linear <a href="http://en.wikipedia.org/wiki/SRGB">sRGB</a> (do not perform gamma correction yet)</li>
<li>Render as normal</li>
<li>Perform tone-mapping / gamma correction</li>
</ol>
<p>We are throwing away spectral information by projecting into XYZ but a quick dimensional analysis shows that now we at least have the correct units (because the integration is with respect to <em>dλ</em> measured in meters the extra <strong>m<sup>-1</sup></strong> is removed).</p>
<p>I was going to write more about the colour conversion process, but I didn't want to add to the confusion out there by accidentally misusing terminology. Instead here are a couple of papers describing the conversion from Spectrum-&gt;RGB and RGB-&gt;Spectrum, questions about these come up all the time on various forums and I think these two papers do a good job of providing background and clarifying the process:</p>
<ul>
<li><a href="http://www.anyhere.com/gward/papers/egwr02/index.html">Picture Perfect RGB Rendering Using Spectral Prefiltering and Sharp Color Primaries</a></li>
<li><a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.40.9608">An RGB to Spectrum Conversion for Reflectances</a></li>
</ul>
<p>And some more general colour space links:</p>
<ul>
<li><a href="http://graphics.stanford.edu/courses/cs148-10-summer/docs/2010--kerr--cie_xyz.pdf">The CIE XYZ and xyY Color Spaces by Douglas Kerr</a> (particularly good)</li>
<li><a href="http://renderwonk.com/publications/s2010-color-course/">SIGGRAPH 2010: Color Enhancement and Rendering in Film and Game Production</a></li>
<li><a href="ftp://rtfm.mit.edu/pub/usenet/news.answers/graphics/colorspace-faq">Color Space FAQ</a></li>
</ul>
<p>Here is a small sample of linear sRGB radiance values for different Blackbody temperatures:<br />
<code><br />
1000K: 1.81e-02, 1.56e-04, 1.56e-04<br />
2000K: 1.71e+03, 4.39e+02, 4.39e+02<br />
4000K: 5.23e+05, 3.42e+05, 3.42e+05<br />
8000K: 9.22e+06, 9.65e+06, 9.65e+06<br />
</code></p>
<p>It's clear from the range of values that we need some sort of exposure control and tone-mapping. I simply picked a temperature in the upper end of my range (around 3000K) and scaled intensities around it before applying Reinhard tone mapping and gamma correction. You can also perform more advanced mapping by taking into account the human visual system adaptation as described in <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.84.3511&amp;rep=rep1&amp;type=pdf">Physically Based Modeling and Animation of Fire</a>.</p>
<p>Again the hardest part was setting up the simulation parameters to get the look you want, here's one I spent at least 4 days tweaking:</p>
<p align="middle"><iframe src="http://player.vimeo.com/video/18232573" width="500" height="375" frameborder="0" webkitAllowFullScreen mozallowfullscreen allowFullScreen></iframe></p>
<p>Simulation time is ~30s a frame (10 substeps) on a 128^3 grid tracking temperature, fuel, smoke and velocity. Most of that time is spent in the tri-cubic interpolation during advection, I've been meaning to try MacCormack advection to see if it's a net win.</p>
<p>There are some pretty obvious artifacts due to the tri-linear interpolation on the GPU, that would be helped by a higher resolution grid or manually performing tri-cubic in the shader.</p>
<p>Inspired by Kevin Beason's work in progress videos I put together a collection of my own failed tests which I think are quite amusing:</p>
<p align="middle"><iframe src="http://player.vimeo.com/video/18232467" width="500" height="375" frameborder="0" webkitAllowFullScreen mozallowfullscreen allowFullScreen></iframe></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mmacklin.com/2010/12/29/blackbody-rendering/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Adventures in Fluid Simulation</title>
		<link>http://blog.mmacklin.com/2010/11/01/adventures-in-fluid-simulation/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=adventures-in-fluid-simulation</link>
		<comments>http://blog.mmacklin.com/2010/11/01/adventures-in-fluid-simulation/#comments</comments>
		<pubDate>Tue, 02 Nov 2010 01:34:27 +0000</pubDate>
		<dc:creator>Miles</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Code]]></category>
		<category><![CDATA[Fluid Simulation]]></category>
		<category><![CDATA[GPU]]></category>
		<category><![CDATA[Graphics]]></category>
		<category><![CDATA[Interactive Frame Rates]]></category>

		<guid isPermaLink="false">http://mmack.wordpress.com/?p=1152</guid>
		<description><![CDATA[I have to admit to being simultaneously fascinated and slightly intimidated by the fluid simulation crowd. I've been watching the videos on Ron Fedkiw's page for years and am still in awe of his results, which sometimes seem little short of magic. Recently I resolved to write my first fluid simulator and purchased a copy [...]]]></description>
				<content:encoded><![CDATA[<p>I have to admit to being simultaneously fascinated and slightly intimidated by the fluid simulation crowd. I've been watching the videos on <a href="http://physbam.stanford.edu/~fedkiw/">Ron Fedkiw's page</a> for years and am still in awe of his results, which sometimes seem little short of magic.</p>
<p>Recently I resolved to write my first fluid simulator and purchased a copy of <a href="http://www.cs.ubc.ca/~rbridson/fluidbook/">Fluid Simulation for Computer Graphics</a> by Robert Bridson.</p>
<p><img src="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2011/08/fluidbook.jpg" alt="" title="fluidbook" width="125" height="185" class="alignright size-full wp-image-1461" /> Like a lot of developers my first exposure to the subject was <a href="http://www.dgp.toronto.edu/people/stam/reality/Research/pdf/ns.pdf">Jos Stam's stable fluids paper</a> and his more accessible <a href="http://www.dgp.toronto.edu/people/stam/reality/Research/pdf/GDC03.pdf">Fluid Dynamics for Games</a> presentation, while the ideas are undeniable great I never came away feeling like I truly understood the concepts or the mathematics behind it.</p>
<p>I'm happy to report that Bridson's book has helped change that. It includes a review of vector calculus in the appendix that is given in a wonderfully straight-forward and concise manner, Bridson takes almost nothing for granted and gives lots of real-world examples which helps for some of the less intuitive concepts.</p>
<p>I'm planning a bigger post on the subject but I thought I'd write a quick update with my progress so far.</p>
<p>I started out with a 2D simulation similar to Stam's demos, having a 2D implementation that you're confident in is really useful when you want to quickly try out different techniques and to sanity check results when things go wrong in 3D (and they will).</p>
<p>Before you write the 3D sim though, you need a way of visualising the data. I spent quite a while on this and implemented a single-scattering model using brute force ray-marching on the GPU.</p>
<p>I did some tests with a procedural pyroclastic cloud model which you can see below, this runs at around 25ms on my MacBook Pro (NVIDIA 320M) but you can dial the sample counts up and down to suit:</p>
<p align="middle"><iframe src="http://player.vimeo.com/video/16159247" width="500" height="375" frameborder="0" webkitAllowFullScreen mozallowfullscreen allowFullScreen></iframe></p>
<p>Here's a simplified GLSL snippet of the volume rendering shader, it's not at all optimised apart from some branches to skip over empty space and an assumption that absorption varies linearly with density:</p>
<pre class="prettyprint linenums">
uniform sampler3D g_densityTex;
uniform vec3 g_lightPos;
uniform vec3 g_lightIntensity;
uniform vec3 g_eyePos;
uniform float g_absorption;

void main()
{
    // diagonal of the cube
    const float maxDist = sqrt(3.0);

    const int numSamples = 128;
    const float scale = maxDist/float(numSamples);

    const int numLightSamples = 32;
    const float lscale = maxDist / float(numLightSamples);

    // assume all coordinates are in texture space
    vec3 pos = gl_TexCoord[0].xyz;
    vec3 eyeDir = normalize(pos-g_eyePos)*scale;

    // transmittance
    float T = 1.0;
    // in-scattered radiance
    vec3 Lo = vec3(0.0);

    for (int i=0; i &lt; numSamples; ++i)
    {
        // sample density
        float density = texture3D(g_densityTex, pos).x;

        // skip empty space
        if (density &gt; 0.0)
        {
            // attenuate ray-throughput
            T *= 1.0-density*scale*g_absorption;
            if (T &lt;= 0.01)
                break;

            // point light dir in texture space
            vec3 lightDir = normalize(g_lightPos-pos)*lscale;

            // sample light
            float Tl = 1.0; // transmittance along light ray
            vec3 lpos = pos + lightDir;

            for (int s=0; s &lt; numLightSamples; ++s)
            {
                float ld = texture3D(g_densityTex, lpos).x;
                Tl *= 1.0-g_absorption*lscale*ld;

                if (Tl &lt;= 0.01)
                    break;

                lpos += lightDir;
            }

            vec3 Li = g_lightIntensity*Tl;

            Lo += Li*T*density*scale;
        }

        pos += eyeDir;
    }

    gl_FragColor.xyz = Lo;
    gl_FragColor.w = 1.0-T;
}
</pre>
<p>I'm pretty sure there's a whole post on the ways this could be optimised but I'll save that for next time.  Also this example shader doesn't have any wavelength dependent variation.  Making your absorption coefficient different for each channel looks much more interesting and having a different coefficient for your primary and shadow rays also helps, you can see this effect in the videos.</p>
<p>To create the cloud like volume texture in OpenGL I use a displaced distance field like this (see the SIGGRAPH course for more details):</p>
<pre class="prettyprint linenums">
// create a volume texture with n^3 texels and base radius r
GLuint CreatePyroclasticVolume(int n, float r)
{
    GLuint texid;
    glGenTextures(1, &amp;texid);

    GLenum target = GL_TEXTURE_3D;
    GLenum filter = GL_LINEAR;
    GLenum address = GL_CLAMP_TO_BORDER;

    glBindTexture(target, texid);

    glTexParameteri(target, GL_TEXTURE_MAG_FILTER, filter);
    glTexParameteri(target, GL_TEXTURE_MIN_FILTER, filter);

    glTexParameteri(target, GL_TEXTURE_WRAP_S, address);
    glTexParameteri(target, GL_TEXTURE_WRAP_T, address);
    glTexParameteri(target, GL_TEXTURE_WRAP_R, address);

    glPixelStorei(GL_UNPACK_ALIGNMENT, 1);

    byte *data = new byte[n*n*n];
    byte *ptr = data;

    float frequency = 3.0f / n;
    float center = n / 2.0f + 0.5f;

    for(int x=0; x &lt; n; x++)
    {
        for (int y=0; y &lt; n; ++y)
        {
            for (int z=0; z &lt; n; ++z)
            {
                float dx = center-x;
                float dy = center-y;
                float dz = center-z;

                float off = fabsf(Perlin3D(x*frequency,
                               y*frequency,
                               z*frequency,
                               5,
                               0.5f));

                float d = sqrtf(dx*dx+dy*dy+dz*dz)/(n);

                *ptr++ = ((d-off) &lt; r)?255:0;
            }
        }
    }

    // upload
    glTexImage3D(target,
                 0,
                 GL_LUMINANCE,
                 n,
                 n,
                 n,
                 0,
                 GL_LUMINANCE,
                 GL_UNSIGNED_BYTE,
                 data);

    delete[] data;

    return texid;
}
</pre>
<p>An excellent introduction to volume rendering is the SIGGRAPH 2010 course, <a href="http://magnuswrenninge.com/volumetricmethods">Volumetric Methods in Visual Effects</a> and Kyle Hayward's <a href="http://graphicsrunner.blogspot.com/2009/01/volume-rendering-101.html">Volume Rendering 101</a> for some GPU specifics.</p>
<p>Once I had the visualisation in place, porting the fluid simulation to 3D was actually not too difficult. I spent most of my time tweaking the initial conditions to get the smoke to behave in a way that looks interesting, you can see one of my more successful simulations below:</p>
<p align="middle"><iframe src="http://player.vimeo.com/video/16357651" width="400" height="600" frameborder="0" webkitAllowFullScreen mozallowfullscreen allowFullScreen></iframe></p>
<p>Currently the simulation runs entirely on the CPU using a 128^3 grid with monotonic tri-cubic interpolation and vorticity confinement as described in <a href="http://graphics.ucsd.edu/~henrik/papers/smoke/smoke.pdf">Visual Simulation of Smoke</a> by Fedkiw.  I'm fairly happy with the result but perhaps I have the vorticity confinement cranked a little high.</p>
<p>Nothing is optimised so its running at about 1.2s a frame on my 2.66ghz Core 2 MacBook.</p>
<p>Future work is to port the simulation to OpenCL and implement some more advanced features.  Specifically I'm interested in <a href="http://physbam.stanford.edu/~fedkiw/papers/stanford2005-01.pdf">A Vortex Particle Method for Smoke, Water and Explosions</a> which <a href="http://www.kevinbeason.com/">Kevin Beason</a> describes on his <a href="http://www.kevinbeason.com/scs/fluid/">fluid page</a> (with some great videos).</p>
<p>On a personal note, I resigned from LucasArts a couple of weeks ago and am looking forward to some time off back in New Zealand with my family and friends.  Just in time for the Kiwi summer!</p>
<h2>Links</h2>
<p><a href="http://http.developer.nvidia.com/GPUGems/gpugems_ch38.html">GPU Gems - Fluid Simulation on the GPU</a><br />
<a href="http://http.developer.nvidia.com/GPUGems3/gpugems3_ch30.html">GPU Gems 3 - Real-Time Rendering and Simulation of 3D Fluids</a><br />
<a href="http://www.colinbraley.com/Pubs/FluidSimColinBraley.pdf">Fluid Simulation For Computer Graphics: A Tutorial in Grid Based and Particle Based Methods</a></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mmacklin.com/2010/11/01/adventures-in-fluid-simulation/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Tracing</title>
		<link>http://blog.mmacklin.com/2010/10/03/tracing/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=tracing</link>
		<comments>http://blog.mmacklin.com/2010/10/03/tracing/#comments</comments>
		<pubDate>Sun, 03 Oct 2010 23:37:47 +0000</pubDate>
		<dc:creator>Miles</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Graphics]]></category>
		<category><![CDATA[Offline]]></category>
		<category><![CDATA[Ray tracing]]></category>

		<guid isPermaLink="false">http://mmack.wordpress.com/?p=1033</guid>
		<description><![CDATA[Gregory Pakosz reminded me to write a follow up on my path tracing efforts since my last post on the subject. It's good timing because the friendly work-place competition between Tom and me has been in full swing. The great thing about ray tracing is that there are many opportunities for optimisation at all levels [...]]]></description>
				<content:encoded><![CDATA[<p>Gregory Pakosz reminded me to write a follow up on my path tracing efforts since my <a href="http://mmack.wordpress.com/2009/12/02/path-tracing/">last post</a> on the subject. </p>
<p>It's good timing because the friendly work-place competition between <a href="http://imdoingitwrong.wordpress.com/">Tom</a> and me has been in full swing. The great thing about ray tracing is that there are many opportunities for optimisation at all levels of computation.  This keeps you "hooked" by constantly offering decent speed increases for relatively little effort.</p>
<p>My competitor had an existing BIH (<a href="http://en.wikipedia.org/wiki/Bounding_interval_hierarchy">bounding interval hierarchy</a>) implementation that was doing a pretty good job, so I had some catching up to do.  Previously I had a positive experience using a BVH (<a href="http://en.wikipedia.org/wiki/Bounding_volume_hierarchy">AABB tree</a>) in a games context so decided to go that route.</p>
<p>Our benchmark scene was <a href="http://www.crytek.com/cryengine/cryengine3/downloads">Crytek's Sponza</a> with the camera positioned in the center of the model looking down the z-axis.  This might not be the most representative case but was good enough for comparing primary ray speeds.</p>
<p><a href="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2010/10/sponza_bench.png"><img class="aligncenter size-full wp-image-1070" title="sponza_bench" src="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2010/10/sponza_bench.png" alt="" width="480" height="267" /></a></p>
<p>Here's a rough timeline of the performance progress (all timings were taken from my 2.6ghz i7 running 8 worker threads):</p>
<table style="text-align:left;" border="1">
<tbody>
<tr>
<th>Optimisation</th>
<th>Rays/second</th>
</tr>
<tr>
<td>Baseline (median split)</td>
<td>91246</td>
</tr>
<tr>
<td>Tweak compiler settings (/fp:fast /sse2 /Ot)</td>
<td>137486</td>
</tr>
<tr>
<td>Non-recursive traversal</td>
<td>145847</td>
</tr>
<tr>
<td>Traverse closest branch first</td>
<td>146822</td>
</tr>
<tr>
<td>Surface area heuristic</td>
<td>1.27589e+006</td>
</tr>
<tr>
<td>Surface area heuristic (exhaustive)</td>
<td>1.9375e+006</td>
</tr>
<tr>
<td>Optimized ray-AABB</td>
<td>2.14232e+006</td>
</tr>
<tr>
<td>VS2008 to VS2010</td>
<td>2.47746e+006</td>
</tr>
</tbody>
</table>
<p>You can see the massive difference tree quality has on performance.  What I found surprising though was the effect switching to VS2010 had, 15% faster is impressive for a single compiler revision.</p>
<p>I played around with a quantized BVH which reduced node size from 32 bytes to 11 but I couldn't get the decrease in cache traffic to outweigh the cost in decoding the nodes.  If anyone has had success with this I'd be interested in the details.</p>
<p>Algorithmically it is a uni-directional path tracer with multiple importance sampling.  Of course importance sampling doesn't make individual samples faster but allows you to take less total samples than you would have to otherwise.</p>
<p>So, time for some pictures:</p>
<p><a href="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2010/10/sponza_plus.png"><img class="aligncenter size-full wp-image-1042" title="Sponza" src="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2010/10/sponza_plus.png" alt="" width="480" height="283" /></a></p>
<p><a href="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2010/10/classroom_neon.png"><img class="aligncenter size-full wp-image-1036" title="Classroom (from LuxRender distibution)" src="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2010/10/classroom_neon.png" alt="" width="480" height="282" /></a></p>
<p><a href="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2010/10/matte_lucy_big.png"><img class="aligncenter size-full wp-image-1040" title="Lucy" src="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2010/10/matte_lucy_big.png" alt="" width="480" height="282" /></a></p>
<p><a href="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2010/10/gold_statuette_exp.png"><img class="aligncenter size-full wp-image-1039" title="Thai Statuette" src="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2010/10/gold_statuette_exp.png" alt="" width="480" height="783" /></a></p>
<p><a href="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2010/10/gold_dragon.png"><img class="aligncenter size-full wp-image-1038" title="Dragon" src="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2010/10/gold_dragon.png" alt="" width="480" height="268" /></a></p>
<p><a href="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2010/10/bunny_fresnel.png"><img class="aligncenter size-full wp-image-1035" title="Bunny" src="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2010/10/bunny_fresnel.png" alt="" width="480" height="267" /></a></p>
<p>Despite being the lowest poly models, Sponza (200k triangles) and the classroom (250k triangles) were by far the most difficult for the renderer; they both took 10+ hours and still have visible noise.  In contrast the gold statuette (10 million triangles) took only 20 mins to converge!</p>
<p>This is mainly because the architectural models have a mixture of very large and very small polygons which creates deep trees with large nodes near the root.  I think a kd-tree which splits or duplicates primitives might be more effective in this case.</p>
<p>A fun way to break your spatial hierarchy is simply to add a ground plane.  Until I performed an exhaustive split search adding a large two triangle ground plane could slow down tracing by as much as 50%.</p>
<p>Of course these numbers are peanuts compared to what people are getting with GPU or SIMD packet tracers, <a href="http://www.tml.tkk.fi/~timo/">Timo Aila</a> reports speeds of 142 million rays/second on similar scenes using a GPU tracer in <a href="http://www.tml.tkk.fi/~timo/publications/aila2009hpg_paper.pdf">this paper</a>.</p>
<p>Writing a path tracer has been a great education for me and I would encourage anyone interested in getting a better grasp on computer graphics to get a copy of PBRT and have a go at it.  It's easy to get started and seeing the finished product is hugely rewarding.</p>
<h3 style="text-align:left;">Links:</h3>
<p>John Carmack <a href="http://twitter.com/id_aa_carmack">tweeting</a> about his experience optimising the offline global illumination calculations in <a href="http://www.rockpapershotgun.com/2009/08/11/carmack-talks-rage-other-stuff/">RAGE</a>.</p>
<p>I was surprised to learn at SIGGRAPH that Arnold (as used by Sony Pictures Imageworks) is at it's core a uni-directional path tracer.  Marcos Fajardo described some details in the <a href="http://www.graphics.cornell.edu/~jaroslav/gicourse2010/">Global Illumination Across Industries</a> talk.</p>
<p>Mental Images <a href="http://www.youtube.com/watch?v=nhXBx8l0iso">iRay</a> (their GPU based cloud renderer) looks impressive and apparently uses a single BSSRDF on all their surfaces which I guess helps simplify their GPU implementation.</p>
<p><a href="http://ompf.org/forum/">Ompf.org</a></p>
<h3 style="text-align:left;">Model credits:</h3>
<p>Sponza - <a href="http://www.crytek.com/cryengine/cryengine3/downloads">Crytek</a><br />
Classroom - <a href="http://src.luxrender.net/luxrays/">LuxRender</a><br />
Thai Statuette, Dragon, Bunny, Lucy - <a href="http://graphics.stanford.edu/data/3Dscanrep/">Stanford scanning repository</a></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mmacklin.com/2010/10/03/tracing/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Faster Fog</title>
		<link>http://blog.mmacklin.com/2010/06/10/faster-fog/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=faster-fog</link>
		<comments>http://blog.mmacklin.com/2010/06/10/faster-fog/#comments</comments>
		<pubDate>Fri, 11 Jun 2010 03:24:11 +0000</pubDate>
		<dc:creator>Miles</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Code]]></category>
		<category><![CDATA[Fog Volumes]]></category>
		<category><![CDATA[Graphics]]></category>

		<guid isPermaLink="false">http://mmack.wordpress.com/?p=856</guid>
		<description><![CDATA[Cedrick at Lucas suggested some nice optimisations for the in-scattering equation I posted last time. I had left off at: \[L_{s} = \frac{\sigma_{s}I}{v}( \tan^{-1}\left(\frac{d+b}{v}\right) - \tan^{-1}\left(\frac{b}{v}\right) )\] But we can remove one of the two inverse trigonometric functions by using the following identity: \[\tan^{-1}x - \tan^{-1}y = \tan^{-1}\frac{x-y}{1+xy}\] Which simplifies the expression for \(L_{s}\) to: [...]]]></description>
				<content:encoded><![CDATA[<p><a href="http://ccollomb.free.fr/blog/">Cedrick</a> at Lucas suggested some nice optimisations for the in-scattering equation I posted <a href="http://mmack.wordpress.com/2010/05/29/in-scattering-demo/">last time</a>.</p>
<p>I had left off at:</p>
<p><p style='text-align:center;'><span class='MathJax_Preview'>\[L_{s} = \frac{\sigma_{s}I}{v}( \tan^{-1}\left(\frac{d+b}{v}\right) - \tan^{-1}\left(\frac{b}{v}\right) )\]</span><script type='math/tex;  mode=display'>L_{s} = \frac{\sigma_{s}I}{v}( \tan^{-1}\left(\frac{d+b}{v}\right) - \tan^{-1}\left(\frac{b}{v}\right) )</script></p></p>
<p>But we can remove one of the two inverse trigonometric functions by using the following identity:</p>
<p><p style='text-align:center;'><span class='MathJax_Preview'>\[\tan^{-1}x - \tan^{-1}y = \tan^{-1}\frac{x-y}{1+xy}\]</span><script type='math/tex;  mode=display'>\tan^{-1}x - \tan^{-1}y = \tan^{-1}\frac{x-y}{1+xy}</script></p></p>
<p>Which simplifies the expression for <span class='MathJax_Preview'>\(L_{s}\)</span><script type='math/tex'>L_{s}</script> to:</p>
<p><p style='text-align:center;'><span class='MathJax_Preview'>\[L_{s} = \frac{\sigma_{s}I}{v}( \tan^{-1}\frac{x-y}{1+xy} )\]</span><script type='math/tex;  mode=display'>L_{s} = \frac{\sigma_{s}I}{v}( \tan^{-1}\frac{x-y}{1+xy} )</script></p></p>
<p>With <span class='MathJax_Preview'>\(x\)</span><script type='math/tex'>x</script> and <span class='MathJax_Preview'>\(y\)</span><script type='math/tex'>y</script> being replaced by:</p>
<p><p style='text-align:center;'><span class='MathJax_Preview'>\[\begin{array}{lcl} x = \frac{d+b}{v} \\ y = \frac{b}{v}\end{array}\]</span><script type='math/tex;  mode=display'>\begin{array}{lcl} x = \frac{d+b}{v} \\ y = \frac{b}{v}\end{array}</script></p></p>
<p>So the updated GLSL snippet looks like:</p>
<pre class="prettyprint linenums">
float InScatter(vec3 start, vec3 dir, vec3 lightPos, float d)
{
   vec3 q = start - lightPos;

   // calculate coefficients
   float b = dot(dir, q);
   float c = dot(q, q);
   float s = 1.0f / sqrt(c - b*b);

   // after a little algebraic re-arrangement
   float x = d*s;
   float y = b*s;
   float l = s * atan( (x) / (1.0+(x+y)*y));

   return l;
}
</pre>
<p>Of course it's always good to verify your 'optimisations', ideally I would take GPU timings but next best is to run it through NVShaderPerf and check the cycle counts:</p>
<p>Original (2x atan()):<br />
<code><br />
Fragment Performance Setup: Driver 174.74, GPU G80, Flags 0x1000<br />
Results 76 cycles, 10 r regs, 2,488,320,064 pixels/s<br />
</code></p>
<p>Updated (1x atan())<br />
<code><br />
Fragment Performance Setup: Driver 174.74, GPU G80, Flags 0x1000<br />
Results 55 cycles, 8 r regs, 3,251,200,103 pixels/s<br />
</code></p>
<p>A tasty 25% reduction in cycle count!</p>
<p>Another idea is to use an approximation of atan(), Robin Green has some great articles about <a href="http://www.research.scea.com/gdc2003/fast-math-functions.html">faster math functions</a> where he discusses how you can range reduce to 0-1 and approximate using <a href="http://mathworld.wolfram.com/MinimaxPolynomial.html">minimax polynomials</a>.</p>
<p>My first attempt was much simpler, looking at it's graph we can see that atan() is almost linear near 0 and asymptotically approaches pi/2.</p>
<p style="text-align:center;"><a href="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2010/06/atan.png"><img class="aligncenter size-full wp-image-878" title="atan" src="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2010/06/atan.png" alt="" width="496" height="421" /></a></p>
<p>Perhaps the simplest approximation we could try would be something like:</p>
<p><p style='text-align:center;'><span class='MathJax_Preview'>\[\tan^{-1}(x) \approx min(x, \frac{\pi}{2})\]</span><script type='math/tex;  mode=display'>\tan^{-1}(x) \approx min(x, \frac{\pi}{2})</script></p></p>
<p>Which looks like:</p>
<p style="text-align:center;"><a href="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2010/06/atan_approx.png"><img class="aligncenter size-full wp-image-879" title="atan_approx" src="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2010/06/atan_approx.png" alt="" width="496" height="419" /></a></p>
<pre class="prettyprint linenums">
float atanLinear(float x)
{
   return clamp(x, -0.5*kPi, 0.5*kPi);
}

// Fragment Performance Setup: Driver 174.74, GPU G80, Flags 0x1000
// Results 34 cycles, 8 r regs, 4,991,999,816 pixels/s
</pre>
<p>Pretty ugly, but even though the maximum error here is huge (~0.43 relative), visually the difference is <a href="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2010/06/linear.png">surprisingly small</a>.</p>
<p>Still I thought I'd try for something more accurate, I used a 3rd degree minimax polynomial to approximate the range 0-1 which gave something practically identical to atan() for my purposes (~0.0052 max relative error):</p>
<p style="text-align:center;"><a href="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2010/06/atan_minimax.png"><img class="aligncenter size-full wp-image-906" title="atan_minimax" src="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2010/06/atan_minimax.png" alt="" width="524" height="511" /></a></p>
<pre class="prettyprint linenums">
float MiniMax3(float x)
{
   return ((-0.130234*x - 0.0954105)*x + 1.00712)*x - 0.00001203333;
}

float atanMiniMax3(float x)
{
   // range reduction
   if (x < 1)
      return MiniMax3(x);
   else
      return kPi*0.5 - MiniMax3(1.0/x);
}

// Fragment Performance Setup: Driver 174.74, GPU G80, Flags 0x1000
// Results 40 cycles, 8 r regs, 4,239,359,951 pixels/s
</pre>
<p><em>Disclaimer: This isn't designed as a general replacement for atan(), for a start it doesn't handle values of x &lt; 0 and it hasn't had anywhere near the love put into other approximations you can find online (optimising for floating point representations for example).</em></p>
<p>As a bonus I found that putting the polynomial evaluation into <a href="http://en.wikipedia.org/wiki/Horner_scheme">Horner form</a> shaved 4 cycles from the shader.</p>
<p>Cedrick also had an idea to use something a little different:</p>
<p><p style='text-align:center;'><span class='MathJax_Preview'>\[\tan^{-1}(x) \approx \frac{\pi}{2}\left(\frac{kx}{1+kx}\right)\]</span><script type='math/tex;  mode=display'>\tan^{-1}(x) \approx \frac{\pi}{2}\left(\frac{kx}{1+kx}\right)</script></p></p>
<p>This might look familiar to some as the basic Reinhard <a href="http://filmicgames.com/archives/category/tonemapping">tone mapping</a> curve!  We eyeballed values for k until we had one that looked close (you can tell I'm being very rigorous here), in the end k=1 was close enough and is one cycle faster <img src='http://blog.mmacklin.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p style="text-align:center;"><a href="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2010/06/atan_rational1.png"><img class="aligncenter size-full wp-image-1003" title="atan_rational" src="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2010/06/atan_rational1.png" alt="" width="523" height="511" /></a></p>
<pre class="prettyprint linenums">
float atanRational(float x)
{
   return kPi*0.5*x / (1.0+x);
}

// Fragment Performance Setup: Driver 174.74, GPU G80, Flags 0x1000
// Results 34 cycles, 8 r regs, 4,869,120,025 pixels/s
</pre>
<p>To get it down to 34 cycles we had to expand out the expression for x and perform some more grouping of terms which shaved another cycle and a register off it.  I was surprised to see the rational approximation be so close in terms of performance to the linear one, I guess the scheduler is doing a good job at hiding some work there.</p>
<p>In the end all three approximations gave pretty good visual results:</p>
<p>Original (cycle count 76):</p>
<p><a href="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2010/06/original.png"><img title="original" src="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2010/06/original.png?w=150" alt="" width="150" height="86" /></a></p>
<p>MiniMax3, Error 8x (cycle count 40):</p>
<p><a href="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2010/06/minimax3.png"> <img title="minimax3" src="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2010/06/minimax3.png?w=150" alt="" width="150" height="86" /></a><a href="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2010/06/minimax3_diff.png"><img title="minimax3_diff" src="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2010/06/minimax3_diff.png?w=150" alt="" width="150" height="86" /></a></p>
<p>Rational, Error 8x (cycle count 34):</p>
<p><a href="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2010/06/rational.png"><img  title="rational" src="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2010/06/rational.png?w=150" alt="" width="150" height="86" /></a><a href="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2010/06/rational_diff2.png"><img alignnone" title="rational_diff" src="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2010/06/rational_diff2.png?w=150" alt="" width="150" height="86" /></a></p>
<p>Linear, Error 8x (cycle count 34):</p>
<p><a href="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2010/06/linear.png"><img title="linear" src="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2010/06/linear.png?w=150" alt="" width="150" height="86" /></a><a href="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2010/06/linear_diff.png"><img title="linear_diff" src="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2010/06/linear_diff.png?w=150" alt="" width="150" height="86" /></a></p>
<p>Links:</p>
<p><a href="http://realtimecollisiondetection.net/blog/?p=9">http://realtimecollisiondetection.net/blog/?p=9</a></p>
<p><a href="http://www.research.scea.com/gdc2003/fast-math-functions.html">http://www.research.scea.com/gdc2003/fast-math-functions.html</a></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mmacklin.com/2010/06/10/faster-fog/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>In-Scattering Demo</title>
		<link>http://blog.mmacklin.com/2010/05/29/in-scattering-demo/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=in-scattering-demo</link>
		<comments>http://blog.mmacklin.com/2010/05/29/in-scattering-demo/#comments</comments>
		<pubDate>Sat, 29 May 2010 23:32:07 +0000</pubDate>
		<dc:creator>Miles</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Code]]></category>
		<category><![CDATA[Demo]]></category>
		<category><![CDATA[Fog Volumes]]></category>
		<category><![CDATA[Graphics]]></category>

		<guid isPermaLink="false">http://mmack.wordpress.com/?p=684</guid>
		<description><![CDATA[This demo shows an analytic solution to the differential in-scattering equation for light in participating media. It's a similar but simplified version of equations found in [1], [2] and as I recently discovered [3]. However I thought showing the derivation might be interesting for some out there, plus it was a good excuse for me [...]]]></description>
				<content:encoded><![CDATA[<p>This demo shows an analytic solution to the differential in-scattering equation for light in participating media. It's a similar but simplified version of equations found in<a href="http://www.eecs.berkeley.edu/~ravir/papers/singlescat/scattering.pdf"> [1]</a>, <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.70.3787&amp;rep=rep1&amp;type=pdf">[2]</a> and as I recently discovered <a href="http://research.microsoft.com/en-us/um/people/johnsny/papers/fogshop-pg.pdf">[3]</a>. However I thought showing the derivation might be interesting for some out there, plus it was a good excuse for me to brush up on my<strong> </strong> <span class='MathJax_Preview'>\(\LaTeX\)</span><script type='math/tex'>\LaTeX</script>.</p>
<p>You might notice I also updated the site's theme, unfortunately you need a white background to make wordpress.com LaTeX rendering play nice with RSS feeds (other than that it's very convenient).</p>
<p><a href="http://mmacklin.dreamhosters.com/FogVolumes.zip">Download the demo here</a></p>
<p>The demo uses GLSL and shows point and spot lights in a basic scene with some tweakable parameters:</p>
<p><a href="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2010/05/fogvolumes1.png"><img class="aligncenter size-full wp-image-712" title="FogVolumes1" src="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2010/05/fogvolumes1.png" alt="" width="493" height="565" /></a></p>
<h2>Background</h2>
<p>Given a view ray defined as:</p>
<p><p style='text-align:center;'><span class='MathJax_Preview'>\[\mathbf{x}(t) = \mathbf{p} + t\mathbf{d}\]</span><script type='math/tex;  mode=display'>\mathbf{x}(t) = \mathbf{p} + t\mathbf{d}</script></p></p>
<p>We would like to know the total amount of light scattered towards the viewer (in-scattered) due to a point light source. For the purposes of this post I will only consider single scattering within isotropic media.</p>
<p>The differential equation that describes the change in radiance due to light scattered into the view direction inside a differential volume is given in <a href="http://www.pbrt.org/">PBRT </a>(p578), if we assume equal scattering in all directions we can write it as:</p>
<p><p style='text-align:center;'><span class='MathJax_Preview'>\[dL_{s}(t) = \sigma_{s}L_{i}(t)\,dt\]</span><script type='math/tex;  mode=display'>dL_{s}(t) = \sigma_{s}L_{i}(t)\,dt</script></p></p>
<p>Where <span class='MathJax_Preview'>\(\sigma_{s}\)</span><script type='math/tex'>\sigma_{s}</script>  is the scattering probability which I will assume includes the normalization term for an isotropic phase funtion of <span class='MathJax_Preview'>\(\frac{1}{4pi}\)</span><script type='math/tex'>\frac{1}{4pi}</script>. For a point light source at distance d with intensity I we can calculate the radiant intensity at a receiving point as:</p>
<p><p style='text-align:center;'><span class='MathJax_Preview'>\[L_{i} = \dfrac{I}{d^2}\]</span><script type='math/tex;  mode=display'>L_{i} = \dfrac{I}{d^2}</script></p></p>
<p>Plugging in the equation for a point along the view ray we have:</p>
<p><p style='text-align:center;'><span class='MathJax_Preview'>\[L_{i}(t) = \dfrac{I}{|\mathbf{x}(t)-\mathbf{s}|^2}\]</span><script type='math/tex;  mode=display'>L_{i}(t) = \dfrac{I}{|\mathbf{x}(t)-\mathbf{s}|^2}</script></p></p>
<p>Where s is the light source position. The solution to (1) is then given by:</p>
<p><p style='text-align:center;'><span class='MathJax_Preview'>\[L_{s} = \int_{0}^{d} \sigma_{s}L_{i}(t) \, dt\]</span><script type='math/tex;  mode=display'>L_{s} = \int_{0}^{d} \sigma_{s}L_{i}(t) \, dt</script></p></p>
<p><p style='text-align:center;'><span class='MathJax_Preview'>\[L_{s} = \int_{0}^{d} \frac{\sigma_{s}I}{|\mathbf{x}(t)-\mathbf{s}|^2}\,dt\]</span><script type='math/tex;  mode=display'>L_{s} = \int_{0}^{d} \frac{\sigma_{s}I}{|\mathbf{x}(t)-\mathbf{s}|^2}\,dt</script></p></p>
<p>To find this integral in closed form we need to expand the distance calculation in the denominator into something we can deal with more easily:</p>
<p><p style='text-align:center;'><span class='MathJax_Preview'>\[L_{s} = \sigma_{s}I\int_0^d{\dfrac{dt}{(\mathbf{p} + t\mathbf{d} - \mathbf{s})\cdot(\mathbf{p} + t\mathbf{d} - \mathbf{s})}}\]</span><script type='math/tex;  mode=display'>L_{s} = \sigma_{s}I\int_0^d{\dfrac{dt}{(\mathbf{p} + t\mathbf{d} - \mathbf{s})\cdot(\mathbf{p} + t\mathbf{d} - \mathbf{s})}}</script></p></p>
<p>Expanding the dot product and gathering terms, we have:</p>
<p><p style='text-align:center;'><span class='MathJax_Preview'>\[L_{s} = \sigma_{s}I\int_{0}^{d}\frac{dt}{(\mathbf{d}\cdot\mathbf{d})t^2 + 2(\mathbf{m}\cdot\mathbf{d})t + \mathbf{m}\cdot\mathbf{m} }\]</span><script type='math/tex;  mode=display'>L_{s} = \sigma_{s}I\int_{0}^{d}\frac{dt}{(\mathbf{d}\cdot\mathbf{d})t^2 + 2(\mathbf{m}\cdot\mathbf{d})t + \mathbf{m}\cdot\mathbf{m} }</script></p></p>
<p>Where <span class='MathJax_Preview'>\(\mathbf{m} = (\mathbf{p}-\mathbf{s})\)</span><script type='math/tex'>\mathbf{m} = (\mathbf{p}-\mathbf{s})</script>.</p>
<p>Now we have something a bit more familiar, because the direction vector is unit length we can remove the coefficient from the quadratic term and we have:</p>
<p><p style='text-align:center;'><span class='MathJax_Preview'>\[L_{s} = \sigma_{s}I\int_{0}^{d}\frac{dt}{t^2 + 2bt + c}\]</span><script type='math/tex;  mode=display'>L_{s} = \sigma_{s}I\int_{0}^{d}\frac{dt}{t^2 + 2bt + c}</script></p></p>
<p>At this point you could look up the integral in standard tables but I'll continue to simplify it for completeness.  Completing the square we obtain:</p>
<p><p style='text-align:center;'><span class='MathJax_Preview'>\[L_{s} = \sigma_{s}I\int_{0}^{d}\frac{dt}{ (t^2 + 2bt + b^2) + (c-b^2)}\]</span><script type='math/tex;  mode=display'>L_{s} = \sigma_{s}I\int_{0}^{d}\frac{dt}{ (t^2 + 2bt + b^2) + (c-b^2)}</script></p></p>
<p>Making the substitution <span class='MathJax_Preview'>\(u = (t + b)\)</span><script type='math/tex'>u = (t + b)</script>, <span class='MathJax_Preview'>\(v = (c-b^2)^{1/2}\)</span><script type='math/tex'>v = (c-b^2)^{1/2}</script> and updating our limits of integration, we have:</p>
<p><p style='text-align:center;'><span class='MathJax_Preview'>\[L_{s} = \sigma_{s}I\int_{b}^{b+d}\frac{du}{ u^2 + v^2}\]</span><script type='math/tex;  mode=display'>L_{s} = \sigma_{s}I\int_{b}^{b+d}\frac{du}{ u^2 + v^2}</script></p></p>
<p><p style='text-align:center;'><span class='MathJax_Preview'>\[L_{s} = \sigma_{s}I \left[ \frac{1}{v}\tan^{-1}\frac{u}{v} \right]_b^{b+d}\]</span><script type='math/tex;  mode=display'>L_{s} = \sigma_{s}I \left[ \frac{1}{v}\tan^{-1}\frac{u}{v} \right]_b^{b+d}</script></p></p>
<p>Finally giving:</p>
<p><p style='text-align:center;'><span class='MathJax_Preview'>\[L_{s} = \frac{\sigma_{s}I}{v}( \tan^{-1}\frac{d+b}{v} - \tan^{-1}\frac{b}{v} )\]</span><script type='math/tex;  mode=display'>L_{s} = \frac{\sigma_{s}I}{v}( \tan^{-1}\frac{d+b}{v} - \tan^{-1}\frac{b}{v} )</script></p></p>
<p>This is what we will evaluate in the pixel shader, here's the GLSL snippet for the integral evaluation (direct translation of the equation above):</p>
<pre class="prettyprint linenums">
float InScatter(vec3 start, vec3 dir, vec3 lightPos, float d)
{
// light to ray origin
vec3 q = start - lightPos;

// coefficients
float b = dot(dir, q);
float c = dot(q, q);

// evaluate integral
float s = 1.0f / sqrt(c - b*b);
float l = s * (atan( (d + b) * s) - atan( b*s ));

return l;
}
</pre>
<p>Where d is the distance traveled, computed by finding the entry / exit points of the ray with the volume.</p>
<p>To make the effect more interesting it is possible to incorporate a particle system, I apply the same scattering shader to each particle and treat it as a thin slab to obtain an approximate depth, then simply multiply by a noise texture at the end.</p>
<p style="text-align: center;"><a href="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2010/05/fogvolumes2.png"><img class="aligncenter" title="FogVolumes2" src="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2010/05/fogvolumes2.png" alt="" width="490" height="550" /></a></p>
<p style="text-align: center;"><a href="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2010/05/fogvolumes4.png"><img class="aligncenter" title="FogVolumes4" src="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2010/05/fogvolumes4.png" alt="" width="492" height="562" /></a></p>
<h2 style="text-align: left;">Optimisations</h2>
<ul>
<li>As it is above the code only supports lights with infinite extent, this implies drawing the entire frame for each light.  It would be possible to limit it to a volume but you'd want to add a falloff to the effect to avoid a sharp transition at the boundary.</li>
</ul>
<ul>
<li>Performing the full evaluation per-pixel for the particles is probably unnecessary, doing it at a lower frequency, per-vertex or even per-particle would probably look acceptable.</li>
</ul>
<h2>Notes</h2>
<ul>
<li>Generally objects appear to have wider specular highlights and more ambient lighting in the presence of particpating media.  <a href="http://www.eecs.berkeley.edu/%7Eravir/papers/singlescat/scattering.pdf">[1]</a> Discusses this in detail but you can fudge it by lowering the specular power in your materials as the scattering coefficient increases.</li>
</ul>
<ul>
<li>According to <a href="http://en.wikipedia.org/wiki/Rayleigh_scattering">Rayliegh scattering</a> blue light at the lower end of the spectrum is scattered considerably more than red light.  It's simple to account for this wavelength dependence by making the scattering coefficient a constant vector weighted towards the blue component.  I found this helps add to the realism of the effect.</li>
</ul>
<ul>
<li>I'm curious to know how the torch light was done in Alan Wake as it seems to be high quality (not just billboards) with multiple light shafts.. maybe someone out there knows?</li>
</ul>
<h2 style="text-align: left;">References</h2>
<p style="text-align: left;"><a href="http://www.eecs.berkeley.edu/~ravir/papers/singlescat/scattering.pdf">[1] Sun, B., Ramamoorthi, R., Narasimhan, S. G., and Nayar, S. K. 2005. A practical analytic single scattering model for real time rendering. </a></p>
<p><a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.70.3787&amp;rep=rep1&amp;type=pdf">[2] Wenzel, C. 2006. Real-time atmospheric effects in games. </a></p>
<p><a href="http://research.microsoft.com/en-us/um/people/johnsny/papers/fogshop-pg.pdf">[3] Zhou, K., Hou, Q., Gong, M., Snyder, J., Guo, B., and Shum, H. 2007. Fogshop: Real-Time Design and Rendering of Inhomogeneous, Single-Scattering Media. </a></p>
<h2>Related</h2>
<p><a href="http://www.vis.uni-stuttgart.de/eng/research/pub/pub2010/espmss10.pdf">[4] Engelhardt, T. and Dachsbacher, C. 2010. Epipolar sampling for shadows and crepuscular rays in participating media with single scattering.</a></p>
<p><a href="http://www.cse.chalmers.se/~billeter/pub/volumetric/index.html">[5] Volumetric Shadows using Polygonal Light Volumes</a> (upcoming HPG2010)</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mmacklin.com/2010/05/29/in-scattering-demo/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Threading Fun</title>
		<link>http://blog.mmacklin.com/2010/05/24/threading-fun/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=threading-fun</link>
		<comments>http://blog.mmacklin.com/2010/05/24/threading-fun/#comments</comments>
		<pubDate>Tue, 25 May 2010 06:01:41 +0000</pubDate>
		<dc:creator>Miles</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://mmack.wordpress.com/?p=641</guid>
		<description><![CDATA[So we had an interesting threading bug at work today which I thought I'd write up here as I hadn't seen this specific problem before (note I didn't write this code, I just helped debug it).  The set up was a basic single producer single consumer arrangement something like this: #include &#60;Windows.h&#62; #include &#60;cassert&#62; volatile [...]]]></description>
				<content:encoded><![CDATA[<p>So we had an interesting threading bug at work today which I thought I'd write up here as I hadn't seen this specific problem before (note I didn't write this code, I just helped debug it).  The set up was a basic single producer single consumer arrangement something like this:</p>
<pre class="prettyprint linenums">
#include &lt;Windows.h&gt;
#include &lt;cassert&gt;

volatile LONG gAvailable = 0;

// thread 1
DWORD WINAPI Producer(LPVOID)
{
	while (1)
	{
		InterlockedIncrement(&amp;gAvailable);
	}
}

// thread 2
DWORD WINAPI Consumer(LPVOID)
{
	while (1)
	{
		// pull available work with a limit of 5 items per iteration
		LONG work = min(gAvailable, 5);

		// this should never fire.. right?
		assert(work &lt;= 5);

		// update available work
		InterlockedExchangeAdd(&amp;gAvailable, -work);
	}
}

int main(int argc, char* argv[])
{
	HANDLE h[2];

	h[0] = CreateThread(0, 0, Consumer, NULL, 0, 0);
	h[1] = CreateThread(0, 0, Producer, NULL, 0, 0);

	WaitForMultipleObjects(2, h, TRUE, INFINITE);

	return 0;
}
</pre>
<p>So where's the problem?  What would make the assert fire?</p>
<p>We triple-checked the logic and couldn't see anything wrong (it was more complicated than the example above so there were a number of possible culprits) and unlike the example above there were no asserts, just a hung thread at some later stage of execution.</p>
<p>Unfortunately the bug reproduced only once every other week so we knew we had to fix it while I had it in a debugger.  We checked all the relevant in-memory data and couldn't see any that had obviously been overwritten ("memory stomp" is usually the first thing called out when these kinds of bugs show up).</p>
<p>It took us a while but eventually we checked the disassembly for the call to min().  Much to our surprise it was performing two loads of gAvailable instead of the one we had expected!</p>
<p>This happened to be on X360 but the same problem occurs on Win32, here's the disassembly for the code above (VS2010 Debug):</p>
<pre class="prettyprint linenums">
// calculate available work with a limit of 5 items per iteration
LONG work = min(gAvailable, 5);

// (1) read gAvailable, compare against 5
002D1457  cmp         dword ptr [gAvailable (2D7140h)],5
002D145E  jge         Consumer+3Dh (2D146Dh)

// (2) read gAvailable again, store on stack
002D1460  mov         eax,dword ptr [gAvailable (2D7140h)]
002D1465  mov         dword ptr [ebp-0D0h],eax
002D146B  jmp         Consumer+47h (2D1477h)
002D146D  mov         dword ptr [ebp-0D0h],5

// (3) store gAvailable from (2) in 'work'
002D1477  mov         ecx,dword ptr [ebp-0D0h]
002D147D  mov         dword ptr [work],ecx
</pre>
<p>The question is what happens between (1) and (2)?  Well the answer is that any other thread can add to gAvailable, meaning that the stored value at (3) is now &gt; 5.</p>
<p>In this case the simple solution was to read gAvailable outside of the call to min():</p>
<pre class="prettyprint linenums">
// pull available work with a limit of 5 items per iteration
LONG available = gAvailable;
LONG work = min(available, 5);
</pre>
<p>Maybe this is obvious to some people but it sure caused me and some smart people a headache for a few hours <img src='http://blog.mmacklin.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>Note that you may not see the problem in some build configurations depending on whether or not the compiler generates code to perform the second read of the variable after the comparison.  As far as I know there are no guarantees about what it may or may not do in this case, FWIW we had the problem in a release build with optimisations enabled.</p>
<p>Big props to Tom and <a href="http://twitter.com/aruslan">Ruslan</a> at Lucas for helping track this one down.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mmacklin.com/2010/05/24/threading-fun/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>GOW III: Shadows</title>
		<link>http://blog.mmacklin.com/2010/03/11/gow-iii-shadows/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=gow-iii-shadows</link>
		<comments>http://blog.mmacklin.com/2010/03/11/gow-iii-shadows/#comments</comments>
		<pubDate>Fri, 12 Mar 2010 06:09:31 +0000</pubDate>
		<dc:creator>Miles</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[GDC2010]]></category>
		<category><![CDATA[Graphics]]></category>

		<guid isPermaLink="false">http://mmack.wordpress.com/?p=624</guid>
		<description><![CDATA[I checked out this session at GDC today - I'll try and sum up the main takeaways (at least for me): Artist controlled cascaded shadow maps, each cascade is accumulated into a 'white buffer' (new term coined?) in deferred style passes using standard PCF filtering Shadow accumulation pass re-projects world space position from an FP32 [...]]]></description>
				<content:encoded><![CDATA[<p>I checked out this session at GDC today - I'll try and sum up the main takeaways (at least for me):</p>
<ul>
<li>Artist controlled cascaded shadow maps, each cascade is accumulated into a 'white buffer' (new term coined?) in deferred style passes using standard PCF filtering</li>
<li>Shadow accumulation pass re-projects world space position from an FP32 depth buffer (separate from the main depth buffer).  The motivation for the separate depth buffer is performance so I assume they store linear depth which means they can reconstruct the world position using just a single multiply-add (saving a reciprocal).</li>
<li>They have the ability to tile individual cascades to achieve arbitrary levels of sampling within a fixed size memory (render cascade tile, apply into white buffer, repeat)</li>
<li>Often up to 9 mega-texel resolution used for in game scenes</li>
<li>White buffer is blended to using MIN blend mode to avoid double darkening (old school)</li>
<li>Invisible 'caster only' geometry to make baked shadows match on dynamic objects</li>
<li>Stencil bits used to mask off baked geometry, fore-ground, back-ground characters</li>
</ul>
<p>&nbsp;<br />
The most interesting part (in my opinion) was the optimisation work, Ben creates a light direction aligned 8x8x4 grid that he renders extruded bounding spheres into (on the SPUs).  Each cell records whether or not it is in shadow and the rough bounds of that shadow.  To take advantage of this information the accumulation pass (where the expensive filtering is done) breaks the screen up into tiles, checks the tile against the volume and adjusts it's depth and 2D bounds accordingly, potentially rejecting entire tiles.</p>
<p>Looking forward to the the rest of the talks, this is my first year at GDC and it's pretty great <img src='http://blog.mmacklin.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mmacklin.com/2010/03/11/gow-iii-shadows/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Stochastic Pruning (2)</title>
		<link>http://blog.mmacklin.com/2010/02/07/stochastic-pruning-2/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=stochastic-pruning-2</link>
		<comments>http://blog.mmacklin.com/2010/02/07/stochastic-pruning-2/#comments</comments>
		<pubDate>Sun, 07 Feb 2010 07:39:41 +0000</pubDate>
		<dc:creator>Miles</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Graphics]]></category>

		<guid isPermaLink="false">http://mmack.wordpress.com/?p=594</guid>
		<description><![CDATA[A quick update for anyone who was having problems running my stochastic pruning demo on NVIDIA cards, I've updated the demo with a fix (I had forgotten to disable a vertex array). While I was at it I added some grass: The grass uses stochastic pruning but still generates a lot of geometry, it's just [...]]]></description>
				<content:encoded><![CDATA[<p>A quick update for anyone who was having problems running my stochastic pruning demo on NVIDIA cards, I've updated <a href="http://mmacklin.dreamhosters.com/Plant.zip">the demo</a> with a fix (I had forgotten to disable a vertex array).</p>
<p>While I was at it I added some grass: </p>
<p><a href="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2010/02/tree_large.png"><img src="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2010/02/tree_large.png" alt="" title="tree_large" width="510" height="302" class="aligncenter size-full wp-image-595" /></a></p>
<p>The grass uses stochastic pruning but still generates a lot of geometry, it's just one grass tile flipped around and rendered multiple times.  I wanted to see if it would be practical for games to render grass using pure geometry but really you'd need to be much more aggressive with the LOD (Update: apparently the same technique was used in Flower, see comments).</p>
<p>Kevin Boulanger has done some impressive real time <a href="http://www.kevinboulanger.net/grass.html">grass rendering</a> using 3 levels of detail with transitions.  Cool stuff and quite practical by the looks of it.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mmacklin.com/2010/02/07/stochastic-pruning-2/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Stochastic Pruning for Real-Time LOD</title>
		<link>http://blog.mmacklin.com/2010/01/12/stochastic-pruning-for-real-time-lod/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=stochastic-pruning-for-real-time-lod</link>
		<comments>http://blog.mmacklin.com/2010/01/12/stochastic-pruning-for-real-time-lod/#comments</comments>
		<pubDate>Tue, 12 Jan 2010 07:28:31 +0000</pubDate>
		<dc:creator>Miles</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Code]]></category>
		<category><![CDATA[Demo]]></category>
		<category><![CDATA[Graphics]]></category>
		<category><![CDATA[Trees]]></category>

		<guid isPermaLink="false">http://mmack.wordpress.com/?p=480</guid>
		<description><![CDATA[Rendering plants efficiently has always been a challenge in computer graphics, a relatively new technique to address this is Pixar's stochastic pruning algorithm. Originally developed for rendering the desert scenes in Cars, Weta also claim to have used the same technique on Avatar. Although designed with offline rendering in mind it maps very naturally to [...]]]></description>
				<content:encoded><![CDATA[<p>Rendering plants efficiently has always been a challenge in computer graphics, a relatively new technique to address this is <a href="http://graphics.pixar.com/library/StochasticPruning/paper.pdf">Pixar's stochastic pruning algorithm</a>.  Originally developed for rendering the desert scenes in Cars, Weta also <a href="http://www.cgw.com/Publications/CGW/2009/Volume-32-Issue-12-Dec-2009-/CG-In-Another-World.aspx">claim </a>to have used the same technique on Avatar.</p>
<p>Although designed with offline rendering in mind it maps very naturally to the GPU and real-time rendering.  The basic algorithm is this:</p>
<ol>
<li>Build your mesh of N elements (in the case of a tree the elements would be leaves, usually represented by quads)</li>
<li>Sort the elements in random order (a robust way of doing this is to use the <a href="http://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle">Fisher-Yates shuffle</a>)</li>
<li>Calculate the proportion U of elements to render based on distance to the object.</li>
<li>Draw N*U unpruned elements with area scaled by 1/U</li>
</ol>
<p>So putting this onto the GPU is straightforward, pre-shuffle your index buffer (element wise), when you come to draw you can calculate the unpruned element count using something like:</p>
<p>[sourcecode language="cpp"]<br />
// calculate scaled distance to viewer<br />
float z = max(1.0f, Length(viewerPos-objectPos)/pruneStartDistance);<br />
// distance at which half the leaves will be pruned<br />
float h = 2.0f;<br />
// proportion of elements unpruned<br />
float u = powf(z, -Log(h, 2));<br />
// actual element count<br />
int m = ceil(numElements * u);<br />
// scale factor<br />
float s = 1.0f / u;<br />
[/sourcecode]</p>
<p>Then just submit a modified draw call for m quads:</p>
<p>[sourcecode language="cpp"]<br />
glDrawElements(GL_QUADS, m*4, GL_UNSIGNED_SHORT, 0);<br />
[/sourcecode]</p>
<p>The scale factor computed above preserves the total global surface area of all elements, this ensures consistent pixel coverage at any distance.  The scaling by area can be performed efficiently in the vertex shader meaning no CPU involvement is necessary (aside from setting up the parameters of course).  In a basic implementation you would see elements pop in and out as you change distance but this can be helped by having a transition window that scales elements down before they become pruned (discussed in the original paper).</p>
<div id="attachment_535" class="wp-caption aligncenter" style="width: 500px"><a href="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2010/01/tree_unpruned.png"><img class="size-full wp-image-535" title="tree_unpruned" src="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2010/01/tree_unpruned.png" alt="" width="490" height="476" /></a><p class="wp-caption-text">Tree unpruned</p></div>
<div id="attachment_534" class="wp-caption aligncenter" style="width: 500px"><a href="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2010/01/tree_pruned.png"><img class="size-full wp-image-534" title="tree_pruned" src="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2010/01/tree_pruned.png" alt="" width="490" height="476" /></a><p class="wp-caption-text">Tree pruned to 10% of original</p></div>
<p>Billboards still have their place but it seems like this kind of technique could have applications for many effects, grass and particle systems being obvious ones.</p>
<p>I've updated my previous tree demo with an implementation of stochastic pruning and a few other changes:</p>
<ul>
<li>Fixed some bugs with ATI driver compatability</li>
<li>Preetham based sky-dome</li>
<li>Optimised shadow map generation</li>
<li>Some new example plants</li>
<li>Tweaked leaf and branch shaders</li>
</ul>
<p>You can download the demo <a href="http://mmacklin.dreamhosters.com/Plant.zip">here</a></p>
<p>I use the <a href="http://algorithmicbotany.org/papers/selforg.sig2009.html">Self-organising tree models for image synthesis</a> algorithm (from SIGGRAPH09) to generate the trees which I have posted about <a href="http://mmack.wordpress.com/2009/09/28/trees-the-green-kind/">previously</a>.</p>
<p>While I was researching I also came across <a href="http://www.cg.tuwien.ac.at/research/publications/2009/Habel_09_PGT/">Physically Guided Animation of Trees</a> from Eurographics 2009, they have some great videos of real-time animated trees.</p>
<p>I've also posted <a href="http://www.mendeley.com/collections/729981/Algorithmic-Botany/">my collection of plant modelling papers</a> onto Mendeley (great tool for organising pdfs!).</p>
<div id="attachment_527" class="wp-caption aligncenter" style="width: 500px"><a href="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2010/01/tree_lowsym1.png"><img class="size-full wp-image-527" title="tree_lowsym" src="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2010/01/tree_lowsym1.png" alt="" width="490" height="429" /></a><p class="wp-caption-text">Tree pruned to 70% of original</p></div>
]]></content:encoded>
			<wfw:commentRss>http://blog.mmacklin.com/2010/01/12/stochastic-pruning-for-real-time-lod/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Sky</title>
		<link>http://blog.mmacklin.com/2009/12/31/sky/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=sky</link>
		<comments>http://blog.mmacklin.com/2009/12/31/sky/#comments</comments>
		<pubDate>Thu, 31 Dec 2009 22:46:48 +0000</pubDate>
		<dc:creator>Miles</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Graphics]]></category>

		<guid isPermaLink="false">http://mmack.wordpress.com/?p=447</guid>
		<description><![CDATA[I had been meaning to implement Preetham's analytic sky model ever since I first came across it years ago. Well I finally got around to it and was pleased to find it's one of those few papers that gives you pretty much everything you need to put together an implementation (although with over 50 unique [...]]]></description>
				<content:encoded><![CDATA[<p>I had been meaning to implement <a href="http://www.cs.utah.edu/~shirley/papers/sunsky/sunsky.pdf">Preetham's analytic sky model</a> ever since I first came across it years ago.  Well I finally got around to it and was pleased to find it's one of those few papers that gives you pretty much everything you need to put together an implementation (although with over 50 unique constants you need to be careful with your typing).</p>
<p>I integrated it into my path tracer which made for some nice images:</p>
<p><a href="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2009/12/sky_t2.png"><img src="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2009/12/sky_t2.png" alt="" title="sky_t2" width="510" height="352" class="aligncenter size-full wp-image-459" /></a></p>
<p><a href="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2009/12/sky_t2_am.png"><img src="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2009/12/sky_t2_am.png" alt="" title="sky_t2_am" width="510" height="352" class="aligncenter size-full wp-image-456" /></a></p>
<p>Also a small <a href="http://www.youtube.com/watch?v=Ptrq16x20rk">video</a>.</p>
<p>It looks like the technique has been surpassed now by <a href="http://hal.archives-ouvertes.fr/docs/00/28/87/58/PDF/article.pdf">Precomputed Atmospheric Scattering</a> but it's still useful for generating environment maps / SH lights.</p>
<p>I also fixed a load of bugs in my path tracer, I was surprised to find that on my new i7 quad-core (8 logical threads) renders with 8 worker threads were only twice as fast as with a single worker, given the embarrassingly parallel nature of path-tracing you would expect at least a factor of 4 decrease in render time.</p>
<p>It turns out the problem was contention in the OS allocator, as I allocate BRDF objects per-intersection there was a lot of overhead there (more than I had expected).  I added a per-thread memory arena where each worker thread has a pool of memory to allocate from linearly during a trace, allocations are never freed and the pool is just reset per-path.</p>
<p>This had the following effect on render times:</p>
<p><code>1 thread:  128709ms-&gt;35553ms  (3.6x faster)<br />
8 threads: 54071ms-&gt;8235ms   (6.5x faster!)</code></p>
<p>You might also notice that the total speed up is not linear with the number of workers.  It tails off as the 4 'real' execution units are used up, so hyper-threading doesn't seem to be too effective here, I suspect this is due to such simple scenes not providing enough opportunity for swapping the thread states.</p>
<p>The HT numbers seems to roughly agree with what people are <a href="http://ompf.org/forum/viewtopic.php?f=1&amp;t=1076&amp;p=10626&amp;hilit=hyperthreading#p10626">reporting</a> on the Ompf forums (~20% improvement).</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mmacklin.com/2009/12/31/sky/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Path Tracing</title>
		<link>http://blog.mmacklin.com/2009/12/02/path-tracing/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=path-tracing</link>
		<comments>http://blog.mmacklin.com/2009/12/02/path-tracing/#comments</comments>
		<pubDate>Thu, 03 Dec 2009 06:59:48 +0000</pubDate>
		<dc:creator>Miles</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Graphics]]></category>

		<guid isPermaLink="false">http://mmack.wordpress.com/?p=409</guid>
		<description><![CDATA[A few of us at work have been having a friendly path-tracing competition (greets to Tom &#38; Dom). It's been a lot of fun and comparing images in the office each Monday morning is a great motivation to get features in and renders out. I thought I'd write a post about it to record my [...]]]></description>
				<content:encoded><![CDATA[<p>A few of us at work have been having a friendly path-tracing competition (greets to Tom &amp; Dom).  It's been a lot of fun and comparing images in the office each Monday morning is a great motivation to get features in and renders out.  I thought I'd write a post about it to record my progress and gather links to some reference material.</p>
<p>Here's a list of features I've implemented so far and some pics below:</p>
<ul>
<li>Monte-Carlo path tracing with explicit area light sampling at each step</li>
<li>Stratified image sampling</li>
<li>Importance sampled Lambert and Blinn BRDFs</li>
<li>Sphere, Plane, Disc, Metaball and Distance Field primitives (no triangles yet)</li>
<li>Multi-threaded tile renderer</li>
<li>Cross-compiles for PS3 on Linux (runs on SPUs)</li>
<li>Quite general shade-trees with Perlin noise etc</li>
</ul>
<p>
<br />
<a href="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2009/12/shinysphere1.jpg"><img src="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2009/12/shinysphere1.jpg" alt="" title="shinysphere" width="510" height="352" class="aligncenter size-full wp-image-424" /></a><br />
<a href="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2009/12/shinyblob.jpg"><img src="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2009/12/shinyblob.jpg" alt="" title="shinyblob" width="510" height="352" class="aligncenter size-full wp-image-415" /></a><br />
<a href="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2009/12/diffuse_spheres.jpg"><img src="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2009/12/diffuse_spheres.jpg" alt="" title="diffuse_spheres" width="510" height="352" class="aligncenter size-full wp-image-429" /></a>
</p>
<p>Sphere-tracing the distance fields produced some cool effects (the blobby sphere above).  I first heard about the technique from <a href="http://www.iquilezles.org/www/">Inigo Quilez</a> who used it to generate an amazing image in his <a href="http://www.iquilezles.org/www/articles/raymarchingdf/raymarchingdf.htm">slisesix</a> demo, he has some good descriptions on his page but for the details I would check out these papers:</p>
<ul>
<li><a href="http://graphics.cs.uiuc.edu/~jch/papers/zeno.pdf">Sphere tracing: a geometric method for the antialiased ray tracing of implicit surfaces</a></li>
<li><a href="http://graphics.cs.uiuc.edu/~jch/papers/dl.pdf">A Lipschitz Method for Accelerated Volume Rendering</a></li>
</ul>
<p>
<br />
And for global illumination and path-tracing in general:</p>
<ul>
<li><a href="http://graphics.pixar.com/library/HQRenderingCourse/paper.pdf">High Quality Rendering using Ray Tracing and Photon Mapping (SIGGRAPH 2007)</a></li>
<li><a href="http://www.pbrt.org/">Physically Based Rendering</a></li>
<li><a href="http://graphics.stanford.edu/papers/veach_thesis/">Robust Monte Carlo Methods for Light Transport Simulation</a></li>
<li><a href="http://www.cs.ucl.ac.uk/teaching/4074/Jesper/Matt_Pharr_reading.htm">Notes from Matt Pharr on implementing your first path tracer</a></li>
<li><a href="http://www.kevinbeason.com/scs/pane/">Kevin Beason's renderer Pane</a></li>
</ul>
<p>
<br />
Also, this is what happens when you push Perlin too far:</p>
<p><a href="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2009/12/devilmusic.jpg"><img src="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2009/12/devilmusic.jpg" alt="" title="devilmusic" width="256" height="256" class="aligncenter size-full wp-image-416" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mmacklin.com/2009/12/02/path-tracing/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Trees (the green kind)</title>
		<link>http://blog.mmacklin.com/2009/09/28/trees-the-green-kind/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=trees-the-green-kind</link>
		<comments>http://blog.mmacklin.com/2009/09/28/trees-the-green-kind/#comments</comments>
		<pubDate>Mon, 28 Sep 2009 17:13:26 +0000</pubDate>
		<dc:creator>Miles</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Graphics]]></category>

		<guid isPermaLink="false">http://mmack.wordpress.com/?p=320</guid>
		<description><![CDATA[I've always had an interest in computer generated plants so I was pleased to read Self-organising tree models for image synthesis from Siggraph this year. The paper basically pulls together a bunch of techniques that have been around for a while and uses them to generate some really good looking tree models. Seeing as I've [...]]]></description>
				<content:encoded><![CDATA[<p>I've always had an interest in computer generated plants so I was pleased to read <a href="http://algorithmicbotany.org/papers/selforg.sig2009.html">Self-organising tree models for image synthesis</a> from Siggraph this year.</p>
<p>The paper basically pulls together a bunch of techniques that have been around for a while and uses them to generate some really good looking tree models.</p>
<p>Seeing as I've had a bit of time on my hands between Batman and before I start at LucasArts I decided to put together an implementation in OpenGL (being a games programmer I want realtime feedback).  </p>
<p>Some screenshots below and a Win32 executable available - <a href="http://mmacklin.dreamhosters.com/Plant.zip">Plant.zip</a></p>
<p><img src="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2009/09/tree_1.jpg" alt="tree_1" title="tree_1" width="480" height="480" class="aligncenter size-full wp-image-323" /><br />
<img src="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2009/09/tree_2_bare.jpg" alt="tree_2_bare" title="tree_2_bare" width="480" height="480" class="aligncenter size-full wp-image-325" /><br />
<img src="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2009/09/tree_2_leaves.jpg" alt="tree_2_leaves" title="tree_2_leaves" width="480" height="480" class="aligncenter size-full wp-image-326" /></p>
<p>Some Notes:</p>
<p>I implemented both the space colonisation and shadow propagation methods.  The space colonisation is nice in that you can draw where the plant should grow by placing space samples with the mouse, this allows some pretty funky topiary but I found it difficult to grow convincing real-world plants with this method.  The demo only uses the shadow propagation method.</p>
<p>Creating the branch geometry from generalised cylinders requires generating a continuous coordinate frame along a curve without any twists or knots.  I used a parallel transport frame for this which worked out really nicely, these two papers describe the technique and the problem:</p>
<p><a href="http://www.cs.indiana.edu/pub/techreports/TR425.pdf">Parallel Transport Approach to Curve Framing</a><br />
<a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.50.8658">Quaternion Gauss Maps and Optimal Framings of Curves and Surfaces (1998) </a></p>
<p>Getting the lighting and leaf materials to look vaguely realistic took quite a lot of tweaking and I'm not totally happy with it.  Until I implemented self-shadowing on the trunk and leaves it looked very weird.  Also you need to account for the transmission you get through the leaves when looking toward the light:</p>
<p><img src="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2009/09/tree_inside.jpg" alt="tree_inside" title="tree_inside" width="480" height="470" class="aligncenter size-full wp-image-327" /></p>
<p>There is a nice <a href="http://http.developer.nvidia.com/GPUGems3/gpugems3_ch04.html">article</a> in GPU Gems 3 on how SpeedTree do this.</p>
<p>The leaves are normal mapped with a simple Phong specular, I messed about with various modified diffuse models like half-Lambert but eventually just went with standard Lambert.  It would be interesting to use a more sophisticated ambient term.</p>
<p>Still a lot of scope for performance optimisation, the leaves are alpha-tested right now so it's doing loads of redundant fragment shader work (something like <a href="http://www.humus.name/index.php?page=Cool&amp;ID=8">Emil Persson's particle trimmer</a> would be useful here).</p>
<p>If you want to take a look at the source code drop me an <a href="http://mmack.wordpress.com/about/">email</a>.</p>
<p>Known issues:</p>
<p>On my NVIDIA card when the vert count is &gt; 10^6 it runs like a dog, I need to break it up into smaller vertex buffers.</p>
<p>Some ATI mobile drivers don't like the variable number of shadow mapping samples.  If that's your card then I recommend hacking the shaders to disable it.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mmacklin.com/2009/09/28/trees-the-green-kind/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Atomic float+</title>
		<link>http://blog.mmacklin.com/2009/08/19/atomic-float/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=atomic-float</link>
		<comments>http://blog.mmacklin.com/2009/08/19/atomic-float/#comments</comments>
		<pubDate>Wed, 19 Aug 2009 12:50:56 +0000</pubDate>
		<dc:creator>Miles</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://mmack.wordpress.com/?p=305</guid>
		<description><![CDATA[No hardware I know of has atomic floating point operations but here's a handy little code snippet from Matt Pharr over on the PBRT mailing list which emulates the same functionality using an atomic compare and swap: inline float AtomicAdd(volatile float *val, float delta) { union bits { float f; int32_t i; }; bits oldVal, [...]]]></description>
				<content:encoded><![CDATA[<p>No hardware I know of has atomic floating point operations but here's a handy little code snippet from Matt Pharr over on the <a href="http://groups.google.com/group/pbrt">PBRT mailing list</a> which emulates the same functionality using an atomic compare and swap:</p>
<pre class="prettyprint linenums">
inline float AtomicAdd(volatile float *val, float delta) {
     union bits { float f; int32_t i; };
     bits oldVal, newVal;
     do {
         oldVal.f = *val;
         newVal.f = oldVal.f + delta;
     } while (AtomicCompareAndSwap(*((AtomicInt32 *)val),
         newVal.i, oldVal.i) != oldVal.i);
     return newVal.f;
}
</pre>
<p>In unrelated news, I've taken a job at LucasArts which I'll be starting soon, sad to say goodbye to Rocksteady they're a great company to work for and I'll miss the team there.</p>
<p>Looking forward to San Francisco though, 12 hours closer to my home town (Auckland, New Zealand) and maybe now I can finally get along to Siggraph or GDC.  If anyone has some advice on where to live there please let me know!</p>
<p>Also a few weeks in between jobs so hopefully time to write some code and finish off all the tourist activities we never got around to in London.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mmacklin.com/2009/08/19/atomic-float/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Fin!</title>
		<link>http://blog.mmacklin.com/2009/08/14/fin/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=fin</link>
		<comments>http://blog.mmacklin.com/2009/08/14/fin/#comments</comments>
		<pubDate>Fri, 14 Aug 2009 12:49:11 +0000</pubDate>
		<dc:creator>Miles</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://mmack.wordpress.com/?p=297</guid>
		<description><![CDATA[Batman: Arkham Asylum is finished and the demo is up on PSN and Xbox Live. I was pretty much responsible for the PS3 version on the engineering side so anything wrong with it is ultimately my fault. I think most PS3 engineers working on a cross platform title will tell you that there is always [...]]]></description>
				<content:encoded><![CDATA[<p><a href="http://www.batmanarkhamasylum.com">Batman: Arkham Asylum</a> is finished and the demo is up on PSN and Xbox Live.  I was pretty much responsible for the PS3 version on the engineering side so anything wrong with it is ultimately my fault.  I think most PS3 engineers working on a cross platform title will tell you that there is always some apprehension of the 'side by side comparisons' which are so popular these days.  This one popped up pretty quickly after the demo was released:</p>
<p><a href="http://www.eurogamer.net/articles/digitalfoundry-batman-demo-showdown-blog-entry">http://www.eurogamer.net/articles/digitalfoundry-batman-demo-showdown-blog-entry</a></p>
<p>The article is quite accurate (unlike some of the comments) and it was generally very positive which is great to see as we put a lot of effort into getting parity between the two console versions.</p>
<p>The game has been getting a good <a href="http://img229.imageshack.us/img229/8589/batmanaagireview.jpg">reception </a>which is especially nice given that Batman games have a long tradition of being terrible.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mmacklin.com/2009/08/14/fin/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Tim Sweeney&#039;s HPG talk</title>
		<link>http://blog.mmacklin.com/2009/08/14/tim-sweeneys-hpg-talk/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=tim-sweeneys-hpg-talk</link>
		<comments>http://blog.mmacklin.com/2009/08/14/tim-sweeneys-hpg-talk/#comments</comments>
		<pubDate>Fri, 14 Aug 2009 12:14:37 +0000</pubDate>
		<dc:creator>Miles</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Graphics]]></category>

		<guid isPermaLink="false">http://mmack.wordpress.com/?p=293</guid>
		<description><![CDATA[This link was going round our office, a discussion over at Lambda the Ultimate regarding Tim Sweeney's HPG talk. http://lambda-the-ultimate.org/node/3560 Tim chimes in a bit further down in the comments.]]></description>
				<content:encoded><![CDATA[<p>This link was going round our office, a discussion over at Lambda the Ultimate regarding Tim Sweeney's HPG talk.</p>
<p><a href="http://lambda-the-ultimate.org/node/3560">http://lambda-the-ultimate.org/node/3560</a></p>
<p>Tim chimes in a bit further down in the comments.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mmacklin.com/2009/08/14/tim-sweeneys-hpg-talk/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Handy hints for Bovine occlusion</title>
		<link>http://blog.mmacklin.com/2009/07/24/handy-hints-for-bovine-occlusion/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=handy-hints-for-bovine-occlusion</link>
		<comments>http://blog.mmacklin.com/2009/07/24/handy-hints-for-bovine-occlusion/#comments</comments>
		<pubDate>Fri, 24 Jul 2009 10:42:55 +0000</pubDate>
		<dc:creator>Miles</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Graphics]]></category>

		<guid isPermaLink="false">http://mmack.wordpress.com/?p=289</guid>
		<description><![CDATA[Code517E recently reminded me of a site I've used before when looking up form factors for various geometric configurations. One I had missed the first time though is the differential element on ceiling, floor or wall to cow. http://www.me.utexas.edu/~howell/sectionb/B-68.html Very handy if you're writing a farmyard simulator I'm sure.]]></description>
				<content:encoded><![CDATA[<p>Code517E <a href="http://c0de517e.blogspot.com/2009/07/analytic-diffuse-shading.html">recently </a> reminded me of a site I've used before when looking up form factors for various geometric configurations.</p>
<p>One I had missed the first time though is the differential element on ceiling, floor or wall to cow.</p>
<p><a href="http://www.me.utexas.edu/~howell/sectionb/B-68.html">http://www.me.utexas.edu/~howell/sectionb/B-68.html</a></p>
<p>Very handy if you're writing a farmyard simulator I'm sure.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mmacklin.com/2009/07/24/handy-hints-for-bovine-occlusion/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Particle lighting</title>
		<link>http://blog.mmacklin.com/2009/06/15/particle-lighting/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=particle-lighting</link>
		<comments>http://blog.mmacklin.com/2009/06/15/particle-lighting/#comments</comments>
		<pubDate>Mon, 15 Jun 2009 22:39:50 +0000</pubDate>
		<dc:creator>Miles</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Graphics]]></category>

		<guid isPermaLink="false">http://mmack.wordpress.com/?p=278</guid>
		<description><![CDATA[I put together an implementation of the particle shadowing technique NVIDIA showed off a while ago. My original intention was to do a survey of particle lighting techniques, in the end I just tried out two different methods that I thought sounded promising. The first was the one ATI used in the Ruby White Out [...]]]></description>
				<content:encoded><![CDATA[<p>I put together an implementation of the particle shadowing technique NVIDIA <a href="http://www.youtube.com/watch?v=xh2q_p6hQEo">showed off</a> a while ago.  My original intention was to do a survey of particle lighting techniques, in the end I just tried out two different methods that I thought sounded promising.</p>
<p>The first was the one ATI used in the Ruby White Out demo, the best take away from it is that they write out the min distance, max distance and density in one pass.  You can do this by setting your RGB blend mode to GL_MIN, your alpha blend mode to GL_ADD and writing out r=z, g=1-z, b=0, a=density for each particle (you can reconstruct the max depth from min(1-z), think of it as the minimum distance from an end point).  Here's a screen:</p>
<p><img src="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2009/06/smoke2.jpg" alt="smoke2" title="smoke2" width="510" height="309" class="aligncenter size-full wp-image-281" /></p>
<p>The technique needs a bit of fudging to look OK.  Blur the depths, add some smoothing functions, it only works for mostly convex objects, good for amorphous blobs (clouds maybe).  Performance wise it is probably the best candidate for current-gen consoles.<br />
<a href="http://ati.amd.com/developer/gdc/2007/ArtAndTechnologyOfWhiteout(Siggraph07).pdf"></p>
<p>http://ati.amd.com/developer/gdc/2007/ArtAndTechnologyOfWhiteout(Siggraph07).pdf</a></p>
<p>IMO the NVIDIA technique is much nicer visually, it gives you fairly accurate self shadowing which looks great but is considerably more expensive.  I won't go into the implementation details too much as the paper does a pretty good job at describing it.<br />
<a href="http://developer.download.nvidia.com/compute/cuda/sdk/website/projects/smokeParticles/doc/smokeParticles.pdf">http://developer.download.nvidia.com/compute/cuda/sdk/website/projects/smokeParticles/doc/smokeParticles.pdf</a></p>
<p>The Nvidia demo uses 32k particles and 32 slices but you can get pretty decent results with much less.  Here's a pic of my implementation, this is running on my trusty 7600 with 1000 particles and 10 slices through the volume:</p>
<p><img src="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2009/06/smoke1.jpg" alt="smoke1" title="smoke1" width="510" height="337" class="aligncenter size-full wp-image-279" /></p>
<p>Unfortunately you need quite a lot of quite transparent particles otherwise there are noticeable artifacts as particles change order and end up in different slices.  You can improve this by using a non-linear distribution of slices so that you use more slices up front (which works nicely because the extinction for light in participating media is exponential).</p>
<p>Looking forward to tackling some surface shaders next.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mmacklin.com/2009/06/15/particle-lighting/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Code charity</title>
		<link>http://blog.mmacklin.com/2009/04/01/code-charity/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=code-charity</link>
		<comments>http://blog.mmacklin.com/2009/04/01/code-charity/#comments</comments>
		<pubDate>Wed, 01 Apr 2009 23:11:17 +0000</pubDate>
		<dc:creator>Miles</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://mmack.wordpress.com/?p=240</guid>
		<description><![CDATA[A friend just sent me this: http://playpower.org/ It's a non-profit organisation with the goal of developing educational games for developing countries that run on 8bit NES hardware. The old Nintedo chips are now patent-free and clones are very common: They're trying to recruit programmers with a social conscience, I'm not old-school enough to know 8bit [...]]]></description>
				<content:encoded><![CDATA[<p>A friend just sent me this:</p>
<p><a href="http://playpower.org/">http://playpower.org/</a></p>
<p>It's a non-profit organisation with the goal of developing educational games for developing countries that run on 8bit NES hardware.  The old Nintedo chips are now patent-free and clones are very common:</p>
<p><a href="http://picasaweb.google.co.in/dereklomas/TVComputer"><img src="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2009/03/nes.jpg" alt="nes" title="nes" width="509" height="382" class="aligncenter size-full wp-image-241" /></a></p>
<p>They're trying to recruit programmers with a social conscience, I'm not old-school enough to know 8bit assembly but then I wouldn't mind learning.. who needs GPUs anyway!</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mmacklin.com/2009/04/01/code-charity/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Red balls</title>
		<link>http://blog.mmacklin.com/2009/04/01/red-balls/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=red-balls</link>
		<comments>http://blog.mmacklin.com/2009/04/01/red-balls/#comments</comments>
		<pubDate>Wed, 01 Apr 2009 23:00:13 +0000</pubDate>
		<dc:creator>Miles</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Graphics]]></category>

		<guid isPermaLink="false">http://mmack.wordpress.com/?p=245</guid>
		<description><![CDATA[A small update on my global illumination renderer, I've ported the radiance transfer to the GPU. It was fairly straight forward as my CPU tree structure was already set up for easy GPU traversal, basically just a matter of converting array offsets into texture coordinates and packing into an indices texture. The hardest part is [...]]]></description>
				<content:encoded><![CDATA[<p>A small update on my global illumination renderer, I've ported the radiance transfer to the GPU. It was fairly straight forward as my CPU tree structure was already set up for easy GPU traversal, basically just a matter of converting array offsets into texture coordinates and packing into an indices texture.</p>
<p>The hardest part is of course wrangling OpenGL to do what you want and give you a proper error message.  This site is easily the best starting point I found for GPGPU stuff:<br />
<a href="http://www.mathematik.uni-dortmund.de/~goeddeke/gpgpu/tutorial.html"></p>
<p>http://www.mathematik.uni-dortmund.de/~goeddeke/gpgpu/tutorial.html</a></p>
<p>So here's an image, there are 7850 surfels, it runs about 20ms on my old school NVidia 7600, so it's still at least an order of magnitude or two slower than you would need for typical game scenes.  But besides that it's fun to pull area lights around in real time.</p>
<p><img src="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2009/04/balls2.jpg" alt="balls2" title="balls2" width="506" height="508" class="aligncenter size-full wp-image-249" /></p>
<p>Not as much colour bleeding as you might expect, there is some but it is subtle.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mmacklin.com/2009/04/01/red-balls/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Tree traversals</title>
		<link>http://blog.mmacklin.com/2009/02/22/tree-traversals/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=tree-traversals</link>
		<comments>http://blog.mmacklin.com/2009/02/22/tree-traversals/#comments</comments>
		<pubDate>Sun, 22 Feb 2009 21:37:27 +0000</pubDate>
		<dc:creator>Miles</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Misc]]></category>

		<guid isPermaLink="false">http://mmack.wordpress.com/?p=235</guid>
		<description><![CDATA[I changed my surfel renderer over to use a pre-order traversal layout for the nodes, this generally gives better cache utilisation and I did see a small speed up from using it. The layout is quite nice because to traverse your tree you just linearly iterate over your array and whenever you find a subtree [...]]]></description>
				<content:encoded><![CDATA[<p>I changed my surfel renderer over to use a pre-order traversal layout for the nodes, this generally gives better cache utilisation and I did see a small speed up from using it.  The layout is quite nice because to traverse your tree you just linearly iterate over your array and whenever you find a subtree you want to skip you just increment your node pointer by the size of that subtree (which is precomputed, see Real Time Collision Detection 6.6.2).</p>
<p>The best optimisation though comes from compacting the size of the surfel data, which again improves the cache performance.  As some parts of the traversal don't need all of the surfel data it seems to make sense to split things out, for instance to store the hierarchy information and the area seperately from the colour/irradiance information.</p>
<p>In fact it seems like when generalised, this idea leads you to the <a href="http://software.intel.com/en-us/articles/how-to-manipulate-data-structure-to-optimize-memory-use-on-32-bit-intel-architecture/">structure of arrays</a> (SOA) layout, which essentially provides the finest grained breakdown where you only pull into the cache what you use and for all the nodes that you skip over there is no added cost.</p>
<p>I haven't done any timings to see how much of a win this would actually be, mainly because dealing with SOA data is so damn cumbersome.</p>
<p>It definitely seems like something you should do after you've done all your hierarchy building and node shuffling which is just so much more intuitive with structures.  Then you can just 'bake' it down to SOA format and throw it at the GPU/SIMD.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mmacklin.com/2009/02/22/tree-traversals/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>PBRT</title>
		<link>http://blog.mmacklin.com/2009/02/22/pbrt/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=pbrt</link>
		<comments>http://blog.mmacklin.com/2009/02/22/pbrt/#comments</comments>
		<pubDate>Sun, 22 Feb 2009 20:25:55 +0000</pubDate>
		<dc:creator>Miles</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Misc]]></category>

		<guid isPermaLink="false">http://mmack.wordpress.com/?p=232</guid>
		<description><![CDATA[I just bought a copy of Physically Based Rendering, I've been meaning to get one for ages as it's often recommended and I thought it might be useful given my recent interest in global illumination. I'm also hoping to get a more formal background in rendering rather than the hacktastic world of real time. The [...]]]></description>
				<content:encoded><![CDATA[<p>I just bought a copy of <a href="http://www.pbrt.org/">Physically Based Rendering</a>, I've been meaning to get one for ages as it's often recommended and I thought it might be useful given my recent interest in global illumination. I'm also hoping to get a more formal background in rendering rather than the hacktastic world of real time.</p>
<p>The subjects covered are broad and it's very readable.  It's my first exposure to <a href="http://en.wikipedia.org/wiki/Literate_programming">literate programming</a>, where essentially the book describes and contains the full implementation of a program.  In fact the source code for their renderer is generated (tangled) from the definition of the book before compilation.</p>
<p>The only problem is the amount it weighs, I like to read on the tube but 1000 page hard backs aren't exactly light reading.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mmacklin.com/2009/02/22/pbrt/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>(Almost) realtime GI</title>
		<link>http://blog.mmacklin.com/2009/01/21/almost-realtime-gi/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=almost-realtime-gi</link>
		<comments>http://blog.mmacklin.com/2009/01/21/almost-realtime-gi/#comments</comments>
		<pubDate>Wed, 21 Jan 2009 23:40:01 +0000</pubDate>
		<dc:creator>Miles</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://mmack.wordpress.com/?p=226</guid>
		<description><![CDATA[After my initial implementation of surfel based illumination I've extended it to do hierarchical clustering of surfels based on a similar idea to the one presented in GPU Gems 2. A few differences: I'm using a k-means clustering to build the approximation hierarchy bottom up. A couple of iterations of Lloyd's algorithm provides pretty good [...]]]></description>
				<content:encoded><![CDATA[<p>After my initial implementation of surfel based illumination I've extended it to do hierarchical clustering of surfels based on a similar idea to the one presented in GPU Gems 2.</p>
<p>A few differences:</p>
<p>I'm using a k-means clustering to build the approximation hierarchy bottom up.  A couple of iterations of Lloyd's algorithm provides pretty good results.  Really you could get away with one iteration.</p>
<p>To seed the clustering I'm simply selecting every n'th surfel from the source input.  At first I thought I should be choosing a random spatial distribution but it turns out clustering based on regular intervals in the mesh works well.  This is because you will end up with more detail in highly tesselated places, which is what you want (assuming your input has been cleaned). </p>
<p>For example, a two tri wall will be clustered into one larger surfel where as a highly tesselated sphere will be broken into more clusters.</p>
<p>The error metric I used to do the clustering is this:</p>
<p><code>Error = (1.0f + (1.0f-Dot(surfel.normal, cluster.normal)) * Length(surfel.pos-cluster.pos)<br />
</code><br />
So it's a combination of how aligned the surfel and cluster are and how far away they are from each other.  You can experiment with weighting each of those metrics individually but just summing seems to give good results. </p>
<p>When summing up the surfels to form your representative cluster surfel you want to:</p>
<p>a) sum the area<br />
b) average the position, normal, emission and albedo weighted by area</p>
<p>Weighting by area is quite important and necessary for the emissive value or you'll basically be adding energy into the simulation.</p>
<p>Then you get to the traversal, Bunnel recommended skipping a sub-tree of surfels if the distance to the query point is &gt; 4*radius of the cluster surfel.  That gives results practically identical to the brute force algorithm and I think you can be more agressive without losing much quality at all.</p>
<p>I get between a 4-6x speed up using the hierarchy over brute force.  Not quite realtime yet but I haven't optimised the tree structure at all, I'm also not running it on the GPU <img src='http://blog.mmacklin.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>It seems like every post I make here has to reference Christer Ericson somehow but I really recommend his <a href="http://realtimecollisiondetection.net/books/rtcd/">book</a> for ideas about optimising bounding volumes.  Loads of good stuff in there that I have yet to implement.</p>
<p>Links:</p>
<p><a href="http://www1.cs.columbia.edu/~ravir/6160/papers/SHExp.pdf">Real-time Soft Shadows in Dynamic Scenes using Spherical Harmonic Exponentiation</a><br />
<a href="http://www.cs.cornell.edu/~kb/projects/lightcuts/">Lightcuts: a scalable approach to illumination</a></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mmacklin.com/2009/01/21/almost-realtime-gi/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Indirect illumination</title>
		<link>http://blog.mmacklin.com/2009/01/11/indirect-illumination/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=indirect-illumination</link>
		<comments>http://blog.mmacklin.com/2009/01/11/indirect-illumination/#comments</comments>
		<pubDate>Sun, 11 Jan 2009 23:44:57 +0000</pubDate>
		<dc:creator>Miles</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://mmack.wordpress.com/?p=205</guid>
		<description><![CDATA[It's been a while since I checked in on the state of the art in global illumination but there is some seriously cool research happening at the moment. I liked the basic idea Dreamworks used on Shrek2 (An Approximate Global Illumination System for Computer Generated Films) which stores direct illumination in light maps and then [...]]]></description>
				<content:encoded><![CDATA[<p>It's been a while since I checked in on the state of the art in global illumination but there is some seriously cool research happening at the moment. </p>
<p>I liked the basic idea Dreamworks used on Shrek2 (<a href="http://www.tabellion.org/et/paper/siggraph_2004_gi_for_films.pdf">An Approximate Global Illumination System for Computer Generated Films</a>) which stores direct illumination in light maps and then runs a final gather pass on that to calculate one bounce of indirect.  It might be possible to adapt this to real-time if you could pre-compute and store the sample coordinates for each point..</p>
<p>However the current state of the art seems to be the point based approach that was first presented by Michael Bunnell with his GPU Gems 2 article <a href="http://download.nvidia.com/developer/GPU_Gems_2/GPU_Gems2_ch14.pdf">Dynamic Ambient Occlusion and Indirect Lighting</a>, he approximates the mesh as a set of oriented discs and computes the radiance transfer dynamically on the GPU.</p>
<p>Turns out Pixar took this idea and now use it on all their movies, Pirates of Carribean, Wall-E, etc.  The technique is described here in <a href="http://graphics.pixar.com/library/PointBasedColorBleeding/paper.pdf">Point Based Approximate Color Bleeding</a>.  Bunnell's original algorithm was O(N^2) in the number of surfels but he used an clustered hierarchy to get that down to O(N.log(N)), Pixar use an Octree which stores a more accurate spherical harmonic approximation at each node.</p>
<p>What's really interesting is how far Bunnell has pushed this idea, if you read through Fantasy Lab's <a href="http://www.freepatentsonline.com/7408550.html">recent patent</a> (August 2008), there are some really nice ideas in there that I haven't seen published anywhere.</p>
<p>Here's a summary of what's new since the GPU Gems 2 article:</p>
<p>- Fixed the slightly cumbersome multiple shadow passes by summing 'negative illumination' from back-facing surfels<br />
- Takes advantage of temporal coherence by simply iterating the illumination map each frame<br />
- Added some directional information by subdividing the hemisphere into quadrants<br />
- Threw in some nice subdivision surfaces stuff in at the end</p>
<p>Anyway, I knocked up a prototype of the indirect illumination technique and it seems to work quite well.  The patent leaves out loads of information (and spends two pages describing what a GPU is), but it's not too difficult to work out the details (note the form factor calculation is particularly simplified).</p>
<p>Here are the results from a *very* low resolution mesh, in reality you would prime your surfels with direct illumination calculated in the traditional way with shadow mapping / shaders then let the sim bounce it round but in this case I've done the direct lighting using his method as well.</p>
<p><img src="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2009/01/gi_shot1.jpg" alt="gi_shot1" title="gi_shot1" width="502" height="498" class="aligncenter size-full wp-image-206" /></p>
<p><img src="http://www.mmacklin.dreamhosters.com/codeblog/wp-content/uploads/2009/01/gi_shot2.jpg" alt="gi_shot2" title="gi_shot2" width="501" height="501" class="aligncenter size-full wp-image-207" /></p>
<p>Disclaimer: some artifacts are visible here due to the way illumination is baked back to the mesh and there aren't really enough surfels to capture all the fine detail but it's quite promising.</p>
<p>This seems like a perfect job for CUDA or Larrabee as the whole algorithm can be run in parallel.  You can do it purely through DirectX or OpenGL but it's kind've nasty.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mmacklin.com/2009/01/11/indirect-illumination/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Branch free Clamp()</title>
		<link>http://blog.mmacklin.com/2009/01/09/branch-free-clamp/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=branch-free-clamp</link>
		<comments>http://blog.mmacklin.com/2009/01/09/branch-free-clamp/#comments</comments>
		<pubDate>Fri, 09 Jan 2009 23:03:37 +0000</pubDate>
		<dc:creator>Miles</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://mmack.wordpress.com/?p=193</guid>
		<description><![CDATA[One of my work mates had some code with a lot of floating point clamps in it the other day so I wrote this little branch free version using the PS3's floating point select intrinsic: float Clamp(float x, float lower, float upper) { float t = __fsels(x-lower, x, lower); return __fsels(t-upper, upper, t); } __fsels [...]]]></description>
				<content:encoded><![CDATA[<p>One of my work mates had some code with a lot of floating point clamps in it the other day so I wrote this little branch free version using the PS3's floating point select intrinsic:</p>
<pre class="prettyprint linenums">
float Clamp(float x, float lower, float upper)
{
	float t = __fsels(x-lower, x, lower);
	return __fsels(t-upper, upper, t);
}
</pre>
<p>__fsels basically does this:</p>
<pre class="prettyprint linenums">
float __fsels(float x, float a, float b)
{
	return (x >= 0.0f) ? a : b
}
</pre>
<p>I measured it to be 8% faster than a standard implementation, not a whole lot but quite fun to write.  The SPUs have quite general selection functionality which is more useful, some stuff about it here:</p>
<p><a href="http://realtimecollisiondetection.net/blog/?p=90">http://realtimecollisiondetection.net/blog/?p=90</a></p>
<p>(Not sure about this free WordPress code formatting, I may have to move it to my own host soon)</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mmacklin.com/2009/01/09/branch-free-clamp/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Two threads, one cache line</title>
		<link>http://blog.mmacklin.com/2009/01/09/two-threads-one-cache-line/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=two-threads-one-cache-line</link>
		<comments>http://blog.mmacklin.com/2009/01/09/two-threads-one-cache-line/#comments</comments>
		<pubDate>Fri, 09 Jan 2009 22:20:30 +0000</pubDate>
		<dc:creator>Miles</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://mmack.wordpress.com/?p=180</guid>
		<description><![CDATA[An interesting thread going around the GDA mailing list at the moment about multithreaded programming reminded me of a little test app I wrote a while back to measure the cost of two threads accessing memory on the same cache line. The program basically creates two threads which increment a variable a large number of [...]]]></description>
				<content:encoded><![CDATA[<p>An interesting thread going around the <a href="https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list">GDA mailing list</a> at the moment about multithreaded programming reminded me of a little test app I wrote a while back to measure the cost of two threads accessing memory on the same cache line.</p>
<p>The program basically creates two threads which increment a variable a large number of times measuring the time it takes to complete with different distances between the write addresses.  Something like this:</p>
<pre class="prettyprint linenums">
__declspec (align (512)) volatile int B[128]; 


DWORD WINAPI ThreadProc(LPVOID param)
{
	// read/write the address a whole lot
	for (int i=0; i &lt; 10000000; ++i)
	{
		(*(volatile int*)param)++;
	}

	return 0;
}


int main()
{
	volatile int* d1 = &amp;B[0];
	volatile int* d2 = &amp;B[127];

	while (d1 != d2)
	{
		HANDLE threads[2];

		// QPC wrapper
		double start = GetSeconds();

		threads[0] = CreateThread(NULL, 0, ThreadProc, (void*)d1, 0, NULL);
		threads[1] = CreateThread(NULL, 0, ThreadProc, (void*)d2, 0, NULL);

		WaitForMultipleObjects(2, threads, TRUE, INFINITE);

		double end = GetSeconds();

		--d2;

		cout &lt;&lt; (d2-d1) * sizeof(int) &lt;&lt; &quot;bytes apart: &quot; &lt;&lt; (end-start)*1000.0f &lt;&lt; &quot;ms&quot; &lt;&lt; endl;
	}
	
	int i;
	cin &gt;&gt; i;
	return 0;
}
</pre>
<p>On my old P4 with a 64 byte cache line these are the results:<br />
<code><br />
128bytes apart: 17.4153ms<br />
124bytes apart: 17.878ms<br />
120bytes apart: 17.4028ms<br />
116bytes apart: 17.3625ms<br />
112bytes apart: 17.959ms<br />
108bytes apart: 18.0241ms<br />
104bytes apart: 17.2938ms<br />
100bytes apart: 17.6643ms<br />
96bytes apart: 17.5377ms<br />
92bytes apart: 19.3156ms<br />
88bytes apart: 17.2013ms<br />
84bytes apart: 17.9361ms<br />
80bytes apart: 17.1321ms<br />
76bytes apart: 17.5997ms<br />
72bytes apart: 17.4634ms<br />
68bytes apart: 17.6562ms<br />
64bytes apart: 17.4704ms<br />
60bytes apart: 17.9947ms<br />
56bytes apart: 149.759ms ***<br />
52bytes apart: 151.64ms<br />
48bytes apart: 150.132ms<br />
44bytes apart: 125.318ms<br />
40bytes apart: 160.33ms<br />
36bytes apart: 147.889ms<br />
32bytes apart: 152.42ms<br />
28bytes apart: 157.003ms<br />
24bytes apart: 149.552ms<br />
20bytes apart: 142.372ms<br />
16bytes apart: 136.908ms<br />
12bytes apart: 145.691ms<br />
8bytes apart: 146.768ms<br />
4bytes apart: 128.408ms<br />
0bytes apart: 125.655ms<br />
</code><br />
You can see when it gets to 56 bytes there is a large penalty (9x slower!) as it brings the cache-coherency protocol into play which forces the processor to reload from main memory. </p>
<p>Actually it turns out this is called "false sharing" and it's quite well known, the common solution is to pad your shared data to be at least one cache line apart.</p>
<p>Refs:</p>
<p><a href="http://software.intel.com/en-us/articles/reduce-false-sharing-in-net/">http://software.intel.com/en-us/articles/reduce-false-sharing-in-net/</a><br />
<a href="http://www.ddj.com/embedded/196902836">http://www.ddj.com/embedded/196902836</a></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mmacklin.com/2009/01/09/two-threads-one-cache-line/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>More Metaballs</title>
		<link>http://blog.mmacklin.com/2008/12/28/more-metaballs/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=more-metaballs</link>
		<comments>http://blog.mmacklin.com/2008/12/28/more-metaballs/#comments</comments>
		<pubDate>Sun, 28 Dec 2008 16:46:09 +0000</pubDate>
		<dc:creator>Miles</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://mmack.wordpress.com/?p=171</guid>
		<description><![CDATA[So after running my Metaballs demo on my girlfriends laptop it appears to be GPU limited due to fillrate. This is mainly due to the overkill number of particles in my test setup and the fact they hang round for so long, but the technique is fillrate heavy so it might be a problem. It'd [...]]]></description>
				<content:encoded><![CDATA[<p>So after running my Metaballs demo on my girlfriends laptop it appears to be GPU limited due to fillrate.  This is mainly due to the overkill number of particles in my test setup and the fact they hang round for so long, but the technique is fillrate heavy so it might be a problem.</p>
<p>It'd be nice to do a multithreaded CPU implementation to see how that compares, but the advantage of the current method is that it keeps the CPU free to do other things.</p>
<p>You could probably get more performance in some cases by uploading all the metaballs as shader parameters (or as a texture) and evaluating them directly in the pixel shader.  Also I just realised I could probably render the 'density' texture at a lower resolution for a cheap speedup.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mmacklin.com/2008/12/28/more-metaballs/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
