I changed my surfel renderer over to use a pre-order traversal layout for the nodes, this generally gives better cache utilisation and I did see a small speed up from using it. The layout is quite nice because to traverse your tree you just linearly iterate over your array and whenever you find a subtree you want to skip you just increment your node pointer by the size of that subtree (which is precomputed, see Real Time Collision Detection 6.6.2).
The best optimisation though comes from compacting the size of the surfel data, which again improves the cache performance. As some parts of the traversal don't need all of the surfel data it seems to make sense to split things out, for instance to store the hierarchy information and the area seperately from the colour/irradiance information.
In fact it seems like when generalised, this idea leads you to the structure of arrays (SOA) layout, which essentially provides the finest grained breakdown where you only pull into the cache what you use and for all the nodes that you skip over there is no added cost.
I haven't done any timings to see how much of a win this would actually be, mainly because dealing with SOA data is so damn cumbersome.
It definitely seems like something you should do after you've done all your hierarchy building and node shuffling which is just so much more intuitive with structures. Then you can just 'bake' it down to SOA format and throw it at the GPU/SIMD.