Our LiDAR detector spent 40% of its time in voxelization, not convs
Researchers profiling a LiDAR object detector discovered that the voxelization and scatter-to-pillars steps, not the 3D convolutional backbone, consumed approximately 40% of the per-frame latency. By moving the voxelization process to the GPU and optimizing the scatter operation into a single fused kernel, they reduced the processing time from 31ms to 19ms. This optimization primarily benefited from overlapping CPU and GPU work, rather than making individual kernels faster. A similar bottleneck was found in their auto-labeling loop, which was addressed by implementing a failover gateway for VLM API calls. AI
IMPACT Optimizing data preprocessing steps like voxelization can significantly improve inference speed for AI models, especially in real-time applications.