CPU Profiling a Small Path Tracer on M1 Max

CPU Profiling a Small Path Tracer on M1 Max

May 10, 2026

This note records a small CPU profiling pass on the bunny scene. The goal was not to do a full renderer rewrite, but to use Instruments to identify real bottlenecks and make small, measurable improvements.

The profiling setup was intentionally simple:

1
2
3
4
5
6
7
8
9
10
11
#define LIGHT_SAMPLE_X 1
#define LIGHT_SAMPLE_Y 1
#define MAX_RAY_DEPTH 10
#define SPP_X 3
#define SPP_Y 3
#define WIDTH 500
#define HEIGHT 500
#define useBVH false // disable scene-level BVH / TLAS
#define USE_MAXDEPTH_NEE
#define OVERRIDE_LOCAL_RENDER_VAL TRUE
#define NUM_THREADS 1

I used single-threaded runs first, with useBVH=false, so the profile focused on the bunny mesh’s internal BVH, i.e. the BLAS. The scene-level BVH was intentionally disabled because this scene only has a few top-level objects, so the TLAS overhead can hide the mesh traversal cost.

Baseline

Initial timing:

1
2
Rendering time: 15.2775 seconds
Total hit count: 717420294

The first surprising hotspot was not BVH traversal. It was the sampler. Owen-scrambled Halton sampling dominated the single-threaded profile through gl::permutationElement() and gl::owenScrambledRadicalInverse().

Disabling Owen scrambling gave a much cleaner CPU baseline:

1
2
Rendering time: 11.3048 seconds
Total hit count: 193636618

This is not necessarily the best final image-quality setting. RandomStrategy::None can introduce visible banding/noise patterns. But for CPU profiling, removing the expensive sampler made the geometry/intersection bottlenecks much easier to see.

Ray and AABB Optimizations

The next hotspot was AABB::intersect(), with a lot of time attributed to gl::vec3::operator[]. The original AABB slab test repeatedly called ray.getOrigin() and ray.getDirection() inside a loop. Earlier, those getters returned by value, so the hot path was doing unnecessary vector copies.

The first cleanup was to return ray origin/direction by const reference and cache per-ray inverse direction:

1
2
3
4
Ray stores:
origin
direction
inv_direction

This matters because one ray can test many BVH nodes, but 1.0f / direction is invariant for that ray. Moving that work into Ray::setDirection() avoids recomputing it for every AABB.

After this ray-side optimization:

1
2
Rendering time: 10.7534 seconds
Total hit count: 193642185

Then I unrolled the AABB slab test from a small for (i = 0; i < 3; ++i) loop into explicit x/y/z code. This reduced the sampled cost of operator[] substantially, roughly from about 1.0s to about 0.5s in the tested trace. Total render time only moved slightly, but the profile became cleaner and the indexed vector access hotspot was reduced.

TLAS vs BLAS

One useful observation: enabling the scene-level BVH was not always faster for this scene.

With a single bunny plus a few walls/lights, the top-level object count is small. The bunny mesh already has its own internal BVH. So:

1
2
3
4
5
6
7
useBVH=false:
ObjectList linearly tests a few scene objects
TriangleMesh still uses internal MeshBVH

useBVH=true:
ray first traverses a scene-level BVH
then TriangleMesh still uses internal MeshBVH

For this scene, the scene-level BVH can reduce primitive hit counts while still being slower, because its AABB traversal and pointer chasing overhead outweigh the savings at such a small top-level object count.

Near-First BVH Traversal

The mesh BVH originally traversed children in a fixed order:

1
2
intersect left subtree
then intersect right subtree

This is correct, but not always efficient. If the ray hits something close in the right subtree, traversing the left subtree first may do unnecessary work.

The traversal was changed to test both child AABBs first, collect their t_enter, and visit the nearer child first:

1
2
3
4
5
left_t  = ray enters left child box
right_t = ray enters right child box

visit the smaller t_enter first
if it finds a closer triangle, skip far child when far_t > closest

This is a heuristic: entering a box earlier does not guarantee the closest triangle is inside that box, but it often finds a closer hit earlier and makes pruning more effective.

Flattening the Mesh BVH

The original mesh BVH is a pointer tree:

1
2
std::unique_ptr<MeshBVHNode> left;
std::unique_ptr<MeshBVHNode> right;

This causes recursive calls and pointer chasing. To improve locality, the recursive BVH was flattened into an array:

1
2
3
4
5
6
7
struct LinearMeshBVHNode {
AABB box;
int left;
int right;
int first_tri;
int tri_count;
};

This is not a heap layout. The children are not addressed by 2*i+1 and 2*i+2, because the BVH is not a perfect binary tree. Instead, each linear node explicitly stores the array index of its children.

Leaf triangles are stored separately:

1
linear_bvh_tri_indices

A leaf stores:

1
2
first_tri
tri_count

which points to a contiguous segment of linear_bvh_tri_indices. Those values are triangle indices into the actual mesh data.

The runtime traversal then became iterative:

1
2
3
4
5
6
push root node index
while stack not empty:
pop node index
test AABB
if leaf: test triangles
else: push children

At first, using std::vector as the traversal stack introduced allocation overhead. After switching to a fixed local stack:

1
std::array<StackEntry, 128> stack;

the flat traversal became slightly faster. The improvement was modest, but it reduced recursive overhead and made the BVH representation more suitable for future SIMD or SoA experiments.

Where the Time Went

After these changes, the main hotspots were still:

1
2
3
4
TriangleMesh::intersectFlat
AABB::intersect
gl::vec3::operator[]
TriangleMesh::hitTriangleFlat

This makes sense. The current flat BVH improves node traversal locality, but triangle data is still indirect:

1
leaf -> linear_bvh_tri_indices -> mesh.indices -> mesh.positions

So the next meaningful step is probably triangle data precomputation rather than more traversal micro-optimizations.

Next Steps

Potential future work:

  1. Precompute triangle data:
1
2
3
4
p0
e1 = p1 - p0
e2 = p2 - p0
geometric normal

This would reduce repeated work inside hitTriangleFlat().

  1. Reorder triangles by BVH leaf order:
1
leaf triangles become contiguous in memory

This improves cache locality for leaf intersection.

  1. Consider SIMD after the data layout is ready.

Single AABB SIMD, i.e. one ray vs one box over x/y/z, is possible but probably not very profitable. A better SIMD target is one ray vs multiple boxes, such as a 4-wide BVH node layout (QBVH). That would require grouping child AABBs in a more SIMD-friendly structure.

Summary

The main lesson from this pass: profiling first mattered. The initial hotspot was the sampler, not BVH. After removing that profiling noise, the actual geometry work showed up clearly. Small changes like ray inverse-direction caching, near-first traversal, flat BVH layout, fixed traversal stack, and AABB unrolling gave incremental wins and made the code better prepared for future layout/SIMD work.