FLEX is the name of the new GPU physics solver I have been working on at NVIDIA. It was announced at the Montreal editor's day a few weeks ago, and today we have released some more information in the form of a video trailer and a Q&A with the PhysX fan site.

The solver builds on my Position Based Fluids work, but adds many new features such as granular materials, clothing, pressure constraints, lift + drag model, rigid bodies with plastic deformation, and more. Check out the video below and see the article for more details on what FLEX can do.

During the presentation I showed videos of some more recent results including two-way coupling of fluids with clothing and rigid bodies. They're embedded below:

Overall it has been a great SIGGRAPH, I met tons of new people who provided lots of inspiration for new research ideas. Thanks!

Working on a distributed team means that often the best way to share new results is via video captures of simulations. Previously I would do this by dumping uncompressed frames from OpenGL to disk, and then compressing with FFmpeg. I prefer this over tools like Fraps because it gives more control over compression quality, and has no watermarking or time limits.

The problem with this method is simply that saving uncompressed frames generates a large amount of data that quickly fills up the write cache and slows down the whole system during capture, it also makes FFmpeg disk bound on reads during encoding.

Thankfully there is a better alternative, by using a direct pipe between the app and FFmpeg you can avoid this disk IO entirely. I couldn't find a concise example of this on the web, so here's how to do it in a Win32 GLUT app.

At startup:

#include <stdio.h>
// start ffmpeg telling it to expect raw rgba 720p-60hz frames
// -i - tells it to read frames from stdin
const char* cmd = "ffmpeg -r 60 -f rawvideo -pix_fmt rgba -s 1280x720 -i - "
"-threads 0 -preset fast -y -crf 21 -vf vflip output.mp4";
// open pipe to ffmpeg's stdin in binary write mode
FILE* ffmpeg = _popen(cmd, "wb");
int* buffer = new int[width*height];

After rendering each frame, grab back the framebuffer and send it straight to the encoder:

Position Based Fluids (PBF) is the title of our paper that has been accepted for presentation at SIGGRAPH 2013. I've set up a project page where you can download the paper and all the related content here:

I have continued working on the technique since the submission, mainly improving the rendering, and adding features like spray and foam (based on the excellent paper from the University of Freiburg: Unified Spray, Foam and Bubbles for Particle-Based Fluids). You can see the results in action below, but I recommend checking out the project page and downloading the videos, they look great at full resolution and 60hz.

This post is about generating meshes for finite element simulations. I'll be covering other aspects of FEM based simulation in a later post, until then I recommend checking out Matthias Müller's very good introduction in the SIGGRAPH 2008 Real Time Physics course [1].

After spending the last few weeks reading, implementing and debugging meshing algorithms I have a new-found respect for people in this field. It is amazing how many ways meshes can "go wrong", even the experts have it tough:

“I hate meshes. I cannot believe how hard this is. Geometry is hard.”
— David Baraff, Senior Research Scientist, Pixar Animation Studios

Meshing algorithms are hard, but unless you are satisfied simulating cantilever beams and simple geometric shapes you will eventually need to deal with them.

My goal was to find an algorithm that would take an image as input, and produce as output a good quality triangle mesh that conformed to the boundary of any non-zero regions in the image.

My first attempt was to perform a coarse grained edge detect and generate a Delaunay triangulation of the resulting point set. The input image and the result of a low-res edge detect:

This point set can be converted to a mesh by any Delaunay triangulation method, the Bowyer-Watson algorithm is probably the simplest. It works by inserting one point at a time, removing any triangles whose circumcircle is encroached by the new point and re-tessellating the surrounding edges. A nice feature is that the algorithm has a direct analogue for tetrahedral meshes, triangles become tetrahedra, edges become faces and circumcircles become circumspheres.

Here's an illustration of how Bowyer/Watson proceeds to insert the point in red into the mesh:

And here is the Delaunay triangulation of the Armadillo point set:

As you can see, Delaunay triangulation algorithms generate the convex hull of the input points. But we want a mesh that conforms to the shape boundary - one way to fix this is to sample the image at each triangle's centroid, if the sample lies outside the shape then simply throw away the triangle. This produces:

Much better! Now we have a reasonably good approximation of the input shape. Unfortunately, FEM simulations don't work well with long thin "sliver" triangles. This is due to interpolation error and because a small movement in one of the triangle's vertices leads to large forces, which leads to inaccuracy and small time steps [2].

Before we look at ways to improve triangle quality it's worth talking about how to measure it. One measure that works well in 2D is the ratio of the triangle's circumradius to it's shortest edge. A smaller ratio indicates a higher quality triangle, which intuitively seems reasonable, long skinny triangles have a large circumradius but one very short edge:

The triangle on the left, which is equilateral, has a ratio ~0.5 and is the best possible triangle by this measure. The triangle on the right has a ratio of ~8.7, note the circumcenter of sliver triangles tend to fall outside of the triangle itself.

Delaunay refinement

Methods such as Chew's algorithm and Ruppert's algorithm are probably the most well known refinement algorithms. They attempt to improve mesh quality while maintaining the Delaunay property (no vertex encroaching a triangle's circumcircle). This is typically done by inserting the circumcenter of low-quality triangles and subdividing edges.

Jonathon Shewchuk's "ultimate guide" has everything you need to know and there is Triangle, an open source tool to generate high quality triangulations.

Unfortunately these algorithms require an accurate polygonal boundary as input as the output is sensitive to the input segment lengths. They are also famously difficult to implement robustly and efficiently, I spent most of my time implementing Ruppert's algorithm only to find the next methods produced better results with much simpler code.

Variational Methods

Variational (energy based) algorithms improve the mesh through a series of optimization steps that attempt to minimize a global energy function. I adapted the approach in Variational Tetrahedral Meshing [3] to 2D and found it produced great results, this is the method I settled on so I'll go into some detail.

The algorithm proceeds as follows:

Generate a set of uniformly distributed points interior to the shape P

Generate a set of points on the boundary of the shape B

Generate a Delaunay triangulation of P

Optimize boundary points by moving them them to the average of their neighbours in B

Optimize interior points by moving them to the centroid of their Voronoi cell (area weighted average of connected triangle circumcenters)

Unless stopping criteria met, go to 3.

Remove boundary sliver triangles

The core idea is that of repeated triangulation (3) and relaxation (4,5), it's a somewhat similar process to Lloyd's clustering, conincidentally the same algorithm I had used to generate surfel hierarchies for global illumination sims in the past.

Here's an animation of 7 iterations on the Armadillo, note the number of points stays the same throughout (another nice property):

It's interesting to see how much the quality improves after the very first step. Although Alliez et al. [3] don't provide any guarantees on the resulting mesh quality I found the algorithm works very well on a variety of input images with a fixed number of iterations.

This is the algorithm I ended up using but I'll quickly cover a couple of alternatives for completeness.

Structured Methods

These algorithms typically start by tiling interior space using a BCC (body centered cubic) lattice which is simply two interleaved grids. They then generate a Delaunay triangulation and throw away elements lying completely outside the region of interest.

As usual, handling boundaries is where the real challenge lies, Molino et al. [4] use a force based simulation to push grid points towards the boundary. Isosurface Stuffing [5] refines the boundary by directly moving vertices to the zero-contour of a signed distance field or inserts new vertices if moving the existing lattice nodes would generate a poor quality triangle.

Lattice based methods are typically very fast and don't suffer from the numerical robustness issues of algorithms that rely on triangulation. However if you plan on fracturing the mesh along element boundaries then this regular nature is exposed and looks quite unconvincing.

Simplification Methods

Another approach is to start with a very fine-grained mesh and progressively simplify it in the style of Progressive Meshes [6]. Barbara Cutler's thesis and associated paper discusses the details and very helpfully provides the resulting tetrahedral meshes, but the implementation appears to be considerably more complex than variational methods and relies on quite a few heuristics to get good results.

Simulation

Now the mesh is ready it's time for the fun part (apologies if you really love meshing). This simple simulation is using co-rotational linear FEM with a semi-implicit time-stepping scheme:

[2] Jonathan Richard Shewchuk. 2002. What Is a Good Linear Finite Element? Interpolation, Conditioning, Anisotropy, and Quality Measures, unpublished preprint. http://www.cs.berkeley.edu/~jrs/papers/elemj.pdf

[5] François Labelle and Jonathan Richard Shewchuk. 2007. Isosurface stuffing: fast tetrahedral meshes with good dihedral angles. In ACM SIGGRAPH 2007 papers http://www.cs.berkeley.edu/~jrs/papers/stuffing.pdf

This is a quick post to document some work I did while writing a mass spring simulation using an implicit integrator. Implicit, or backward Euler integration is well described in David Baraff's Physically Based Modelling SIGGRAPH course and this post assumes some familiarity with it.

Springs are a workhorse in physical simulation, once you have unconditionally stable springs you can use them to model just about anything, from rigid bodies to drool and snot. For example, Industrial Light & Magic used a tetrahedral mesh with edge and altitude springs to model the damage to ships in Avatar (see Avatar: Bending Rigid Bodies).

If you sit down and try and implement an implicit integrator one of the first things you need is the Jacobian of the particle forces with respect to the particle positions and velocities. The rest of this post shows how to derive these Jacobians for a basic Hookean spring in a form ready to be plugged into a linear system solver (I use a hand-rolled conjugate gradient solver, see Jonathon Shewchuk's painless introduction for details, it is all of about 20 lines of code to implement).

In order to calculate the force Jacobians we first need to know how to calculate the derivatives of some basic geometric quantities with respect to a vector.

In general the derivative of a scalar valued function with respect to a vector is defined as the following row vector of partial derivatives:

Where \(\v{I}\) is the \(n\) x \(n\) identity matrix and n is the dimension of \(x\). The product of a column vector and a row vector \(\uv{x}\cdot\uv{x}^T\) is the outer product which is a \(n\) x \(n\) matrix that can be constructed using standard matrix multiplication definition.

Now we are ready to compute the Jacobian of the spring forces. Recall the equation for the elastic force on a particle \(i\) due to an undamped Hookean spring:

\[\v{F_s} = -k_s(|\v{x}_{ij}| - r)\uv{x}_{ij}\]

Where \(\v{x}_{ij} = \v{x}_i - \v{x}_j\) is the vector between the two connected particle positions, \(r\) is the rest length and \(k_s\) is the stiffness coefficient.

The Jacobian of this force with respect to particle \(i\)'s position is obtained by using the product rule for the two \(\v{x}_i\) dependent terms in \(\v{F_s}\):

Where \(\v{v}_{ij} = \v{v}_i-\v{v}_j\) is the relative velocities of the two particles. This is the preferred formulation because it damps only relative velocity along the spring axis.

Taking the derivative with respect to \(\v{v}_i\):

Note that implicit integration introduces it's own artificial damping so you might find it's not necessary to add as much additional damping as you would with an explicit integration scheme.

I'll be going into more detail about implicit methods and FEM in subsequent posts, stay tuned!

Hi all, welcome to my new site. I've moved to my own hosting and have updated a few things - a new theme and a switch to MathJax for equation rendering. Apologies to RSS readers who will now see only a bunch of Latex code, but it is currently by far the easiest way to put decent looking equations in a web page.

It's been a little over a year since I started working at NVIDIA and not coincidentally, since my last blog post. I'm really enjoying working more on the simulation side of things, it makes a nice change from pure rendering and the PhysX team is full of über-talented people who I'm learning a lot from.

I've got some simulation related posts (from a graphics programmer's perspective) planned over the next few months, I hope you enjoy them!

In between bouts of festive over-eating I added support for blackbody emission to my fluid simulator and thought I'd describe what was involved.

Briefly, a blackbody is an idealised substance that gives off light when heated. Planck's formula describes the intensity of light per-wavelength with units W·sr^{-1}·m^{-2}·m^{-1} for a given temperature in Kelvins.

Radiance has units W·sr^{-1}·m^{-2} so we need a way to convert the wavelength dependent power distribution given by Planck's formula to a radiance value in RGB that we can use in our shader / ray-tracer.

The typical way to do this is as follows:

Integrate Planck's formula against the CIE XYZ colour matching functions (available as part of PBRT in 1nm increments)

Convert from XYZ to linear sRGB (do not perform gamma correction yet)

Render as normal

Perform tone-mapping / gamma correction

We are throwing away spectral information by projecting into XYZ but a quick dimensional analysis shows that now we at least have the correct units (because the integration is with respect to dλ measured in meters the extra m^{-1} is removed).

I was going to write more about the colour conversion process, but I didn't want to add to the confusion out there by accidentally misusing terminology. Instead here are a couple of papers describing the conversion from Spectrum->RGB and RGB->Spectrum, questions about these come up all the time on various forums and I think these two papers do a good job of providing background and clarifying the process:

Here is a small sample of linear sRGB radiance values for different Blackbody temperatures:
1000K: 1.81e-02, 1.56e-04, 1.56e-04
2000K: 1.71e+03, 4.39e+02, 4.39e+02
4000K: 5.23e+05, 3.42e+05, 3.42e+05
8000K: 9.22e+06, 9.65e+06, 9.65e+06

It's clear from the range of values that we need some sort of exposure control and tone-mapping. I simply picked a temperature in the upper end of my range (around 3000K) and scaled intensities around it before applying Reinhard tone mapping and gamma correction. You can also perform more advanced mapping by taking into account the human visual system adaptation as described in Physically Based Modeling and Animation of Fire.

Again the hardest part was setting up the simulation parameters to get the look you want, here's one I spent at least 4 days tweaking:

Simulation time is ~30s a frame (10 substeps) on a 128^3 grid tracking temperature, fuel, smoke and velocity. Most of that time is spent in the tri-cubic interpolation during advection, I've been meaning to try MacCormack advection to see if it's a net win.

There are some pretty obvious artifacts due to the tri-linear interpolation on the GPU, that would be helped by a higher resolution grid or manually performing tri-cubic in the shader.

Inspired by Kevin Beason's work in progress videos I put together a collection of my own failed tests which I think are quite amusing:

I have to admit to being simultaneously fascinated and slightly intimidated by the fluid simulation crowd. I've been watching the videos on Ron Fedkiw's page for years and am still in awe of his results, which sometimes seem little short of magic.

Like a lot of developers my first exposure to the subject was Jos Stam's stable fluids paper and his more accessible Fluid Dynamics for Games presentation, while the ideas are undeniable great I never came away feeling like I truly understood the concepts or the mathematics behind it.

I'm happy to report that Bridson's book has helped change that. It includes a review of vector calculus in the appendix that is given in a wonderfully straight-forward and concise manner, Bridson takes almost nothing for granted and gives lots of real-world examples which helps for some of the less intuitive concepts.

I'm planning a bigger post on the subject but I thought I'd write a quick update with my progress so far.

I started out with a 2D simulation similar to Stam's demos, having a 2D implementation that you're confident in is really useful when you want to quickly try out different techniques and to sanity check results when things go wrong in 3D (and they will).

Before you write the 3D sim though, you need a way of visualising the data. I spent quite a while on this and implemented a single-scattering model using brute force ray-marching on the GPU.

I did some tests with a procedural pyroclastic cloud model which you can see below, this runs at around 25ms on my MacBook Pro (NVIDIA 320M) but you can dial the sample counts up and down to suit:

Here's a simplified GLSL snippet of the volume rendering shader, it's not at all optimised apart from some branches to skip over empty space and an assumption that absorption varies linearly with density:

uniform sampler3D g_densityTex;
uniform vec3 g_lightPos;
uniform vec3 g_lightIntensity;
uniform vec3 g_eyePos;
uniform float g_absorption;
void main()
{
// diagonal of the cube
const float maxDist = sqrt(3.0);
const int numSamples = 128;
const float scale = maxDist/float(numSamples);
const int numLightSamples = 32;
const float lscale = maxDist / float(numLightSamples);
// assume all coordinates are in texture space
vec3 pos = gl_TexCoord[0].xyz;
vec3 eyeDir = normalize(pos-g_eyePos)*scale;
// transmittance
float T = 1.0;
// in-scattered radiance
vec3 Lo = vec3(0.0);
for (int i=0; i < numSamples; ++i)
{
// sample density
float density = texture3D(g_densityTex, pos).x;
// skip empty space
if (density > 0.0)
{
// attenuate ray-throughput
T *= 1.0-density*scale*g_absorption;
if (T <= 0.01)
break;
// point light dir in texture space
vec3 lightDir = normalize(g_lightPos-pos)*lscale;
// sample light
float Tl = 1.0; // transmittance along light ray
vec3 lpos = pos + lightDir;
for (int s=0; s < numLightSamples; ++s)
{
float ld = texture3D(g_densityTex, lpos).x;
Tl *= 1.0-g_absorption*lscale*ld;
if (Tl <= 0.01)
break;
lpos += lightDir;
}
vec3 Li = g_lightIntensity*Tl;
Lo += Li*T*density*scale;
}
pos += eyeDir;
}
gl_FragColor.xyz = Lo;
gl_FragColor.w = 1.0-T;
}

I'm pretty sure there's a whole post on the ways this could be optimised but I'll save that for next time. Also this example shader doesn't have any wavelength dependent variation. Making your absorption coefficient different for each channel looks much more interesting and having a different coefficient for your primary and shadow rays also helps, you can see this effect in the videos.

To create the cloud like volume texture in OpenGL I use a displaced distance field like this (see the SIGGRAPH course for more details):

// create a volume texture with n^3 texels and base radius r
GLuint CreatePyroclasticVolume(int n, float r)
{
GLuint texid;
glGenTextures(1, &texid);
GLenum target = GL_TEXTURE_3D;
GLenum filter = GL_LINEAR;
GLenum address = GL_CLAMP_TO_BORDER;
glBindTexture(target, texid);
glTexParameteri(target, GL_TEXTURE_MAG_FILTER, filter);
glTexParameteri(target, GL_TEXTURE_MIN_FILTER, filter);
glTexParameteri(target, GL_TEXTURE_WRAP_S, address);
glTexParameteri(target, GL_TEXTURE_WRAP_T, address);
glTexParameteri(target, GL_TEXTURE_WRAP_R, address);
glPixelStorei(GL_UNPACK_ALIGNMENT, 1);
byte *data = new byte[n*n*n];
byte *ptr = data;
float frequency = 3.0f / n;
float center = n / 2.0f + 0.5f;
for(int x=0; x < n; x++)
{
for (int y=0; y < n; ++y)
{
for (int z=0; z < n; ++z)
{
float dx = center-x;
float dy = center-y;
float dz = center-z;
float off = fabsf(Perlin3D(x*frequency,
y*frequency,
z*frequency,
5,
0.5f));
float d = sqrtf(dx*dx+dy*dy+dz*dz)/(n);
*ptr++ = ((d-off) < r)?255:0;
}
}
}
// upload
glTexImage3D(target,
0,
GL_LUMINANCE,
n,
n,
n,
0,
GL_LUMINANCE,
GL_UNSIGNED_BYTE,
data);
delete[] data;
return texid;
}

Once I had the visualisation in place, porting the fluid simulation to 3D was actually not too difficult. I spent most of my time tweaking the initial conditions to get the smoke to behave in a way that looks interesting, you can see one of my more successful simulations below:

Currently the simulation runs entirely on the CPU using a 128^3 grid with monotonic tri-cubic interpolation and vorticity confinement as described in Visual Simulation of Smoke by Fedkiw. I'm fairly happy with the result but perhaps I have the vorticity confinement cranked a little high.

Nothing is optimised so its running at about 1.2s a frame on my 2.66ghz Core 2 MacBook.

On a personal note, I resigned from LucasArts a couple of weeks ago and am looking forward to some time off back in New Zealand with my family and friends. Just in time for the Kiwi summer!

Gregory Pakosz reminded me to write a follow up on my path tracing efforts since my last post on the subject.

It's good timing because the friendly work-place competition between Tom and me has been in full swing. The great thing about ray tracing is that there are many opportunities for optimisation at all levels of computation. This keeps you "hooked" by constantly offering decent speed increases for relatively little effort.

Tom had an existing BIH (bounding interval hierarchy) implementation that was doing a pretty good job, so I had some catching up to do. Previously I had a positive experience using a BVH (AABB tree) in a games context so decided to go that route.

Our benchmark scene was Crytek's Sponza with the camera positioned in the center of the model looking down the z-axis. This might not be the most representative case but was good enough for comparing primary ray speeds.

Here's a rough timeline of the performance progress (all timings were taken from my 2.6ghz i7 running 8 worker threads):

Optimisation

Rays/second

Baseline (median split)

91246

Tweak compiler settings (/fp:fast /sse2 /Ot)

137486

Non-recursive traversal

145847

Traverse closest branch first

146822

Surface area heuristic

1.27589e+006

Surface area heuristic (exhaustive)

1.9375e+006

Optimized ray-AABB

2.14232e+006

VS2008 to VS2010

2.47746e+006

You can see the massive difference tree quality has on performance. What I found surprising though was the effect switching to VS2010 had, 15% faster is impressive for a single compiler revision.

I played around with a quantized BVH which reduced node size from 32 bytes to 11 but I couldn't get the decrease in cache traffic to outweigh the cost in decoding the nodes. If anyone has had success with this I'd be interested in the details.

Algorithmically it is a uni-directional path tracer with multiple importance sampling. Of course importance sampling doesn't make individual samples faster but allows you to take less total samples than you would have to otherwise.

So, time for some pictures:

Despite being the lowest poly models, Sponza (200k triangles) and the classroom (250k triangles) were by far the most difficult for the renderer; they both took 10+ hours and still have visible noise. In contrast the gold statuette (10 million triangles) took only 20 mins to converge!

This is mainly because the architectural models have a mixture of very large and very small polygons which creates deep trees with large nodes near the root. I think a kd-tree which splits or duplicates primitives might be more effective in this case.

A fun way to break your spatial hierarchy is simply to add a ground plane. Until I performed an exhaustive split search adding a large two triangle ground plane could slow down tracing by as much as 50%.

Of course these numbers are peanuts compared to what people are getting with GPU or SIMD packet tracers, Timo Aila reports speeds of 142 million rays/second on similar scenes using a GPU tracer in this paper.

Writing a path tracer has been a great education for me and I would encourage anyone interested in getting a better grasp on computer graphics to get a copy of PBRT and have a go at it. It's easy to get started and seeing the finished product is hugely rewarding.

Links:

John Carmack tweeting about his experience optimising the offline global illumination calculations in RAGE.

I was surprised to learn at SIGGRAPH that Arnold (as used by Sony Pictures Imageworks) is at it's core a uni-directional path tracer. Marcos Fajardo described some details in the Global Illumination Across Industries talk.

Mental Images iRay (their GPU based cloud renderer) looks impressive and apparently uses a single BSSRDF on all their surfaces which I guess helps simplify their GPU implementation.

## Recent Comments