CPU Profiling a Small Path Tracer on M1 Max
This note records a small CPU profiling pass on the bunny scene. The goal was not to do a full renderer rewrite, but to use Instruments to identify real bottlenecks and make small, measurable improvements.
The profiling setup was intentionally simple:
1 |
I used single-threaded runs first, with useBVH=false, so the profile focused on the bunny mesh’s internal BVH, i.e. the BLAS. The scene-level BVH was intentionally disabled because this scene only has a few top-level objects, so the TLAS overhead can hide the mesh traversal cost.
Baseline
Initial timing:
1 | Rendering time: 15.2775 seconds |
The first surprising hotspot was not BVH traversal. It was the sampler. Owen-scrambled Halton sampling dominated the single-threaded profile through gl::permutationElement() and gl::owenScrambledRadicalInverse().
Disabling Owen scrambling gave a much cleaner CPU baseline:
1 | Rendering time: 11.3048 seconds |
This is not necessarily the best final image-quality setting. RandomStrategy::None can introduce visible banding/noise patterns. But for CPU profiling, removing the expensive sampler made the geometry/intersection bottlenecks much easier to see.
Ray and AABB Optimizations
The next hotspot was AABB::intersect(), with a lot of time attributed to gl::vec3::operator[]. The original AABB slab test repeatedly called ray.getOrigin() and ray.getDirection() inside a loop. Earlier, those getters returned by value, so the hot path was doing unnecessary vector copies.
The first cleanup was to return ray origin/direction by const reference and cache per-ray inverse direction:
1 | Ray stores: |
This matters because one ray can test many BVH nodes, but 1.0f / direction is invariant for that ray. Moving that work into Ray::setDirection() avoids recomputing it for every AABB.
After this ray-side optimization:
1 | Rendering time: 10.7534 seconds |
Then I unrolled the AABB slab test from a small for (i = 0; i < 3; ++i) loop into explicit x/y/z code. This reduced the sampled cost of operator[] substantially, roughly from about 1.0s to about 0.5s in the tested trace. Total render time only moved slightly, but the profile became cleaner and the indexed vector access hotspot was reduced.
TLAS vs BLAS
One useful observation: enabling the scene-level BVH was not always faster for this scene.
With a single bunny plus a few walls/lights, the top-level object count is small. The bunny mesh already has its own internal BVH. So:
1 | useBVH=false: |
For this scene, the scene-level BVH can reduce primitive hit counts while still being slower, because its AABB traversal and pointer chasing overhead outweigh the savings at such a small top-level object count.
Near-First BVH Traversal
The mesh BVH originally traversed children in a fixed order:
1 | intersect left subtree |
This is correct, but not always efficient. If the ray hits something close in the right subtree, traversing the left subtree first may do unnecessary work.
The traversal was changed to test both child AABBs first, collect their t_enter, and visit the nearer child first:
1 | left_t = ray enters left child box |
This is a heuristic: entering a box earlier does not guarantee the closest triangle is inside that box, but it often finds a closer hit earlier and makes pruning more effective.
Flattening the Mesh BVH
The original mesh BVH is a pointer tree:
1 | std::unique_ptr<MeshBVHNode> left; |
This causes recursive calls and pointer chasing. To improve locality, the recursive BVH was flattened into an array:
1 | struct LinearMeshBVHNode { |
This is not a heap layout. The children are not addressed by 2*i+1 and 2*i+2, because the BVH is not a perfect binary tree. Instead, each linear node explicitly stores the array index of its children.
Leaf triangles are stored separately:
1 | linear_bvh_tri_indices |
A leaf stores:
1 | first_tri |
which points to a contiguous segment of linear_bvh_tri_indices. Those values are triangle indices into the actual mesh data.
The runtime traversal then became iterative:
1 | push root node index |
At first, using std::vector as the traversal stack introduced allocation overhead. After switching to a fixed local stack:
1 | std::array<StackEntry, 128> stack; |
the flat traversal became slightly faster. The improvement was modest, but it reduced recursive overhead and made the BVH representation more suitable for future SIMD or SoA experiments.
Where the Time Went
After these changes, the main hotspots were still:
1 | TriangleMesh::intersectFlat |
This makes sense. The current flat BVH improves node traversal locality, but triangle data is still indirect:
1 | leaf -> linear_bvh_tri_indices -> mesh.indices -> mesh.positions |
So the next meaningful step is probably triangle data precomputation rather than more traversal micro-optimizations.
Next Steps
Potential future work:
- Precompute triangle data:
1 | p0 |
This would reduce repeated work inside hitTriangleFlat().
- Reorder triangles by BVH leaf order:
1 | leaf triangles become contiguous in memory |
This improves cache locality for leaf intersection.
- Consider SIMD after the data layout is ready.
Single AABB SIMD, i.e. one ray vs one box over x/y/z, is possible but probably not very profitable. A better SIMD target is one ray vs multiple boxes, such as a 4-wide BVH node layout (QBVH). That would require grouping child AABBs in a more SIMD-friendly structure.
Summary
The main lesson from this pass: profiling first mattered. The initial hotspot was the sampler, not BVH. After removing that profiling noise, the actual geometry work showed up clearly. Small changes like ray inverse-direction caching, near-first traversal, flat BVH layout, fixed traversal stack, and AABB unrolling gave incremental wins and made the code better prepared for future layout/SIMD work.