by Jesse Ranjit (ed. Andrew Draganov)
About three months ago, I started my internship at Expedition Technology Inc. My short time here was loaded with valuable industry experience not taught in the University environment. I had the opportunity to work in the agile software development cycle, to play an active role in code reviews and to truly contribute to an existing code base. My project relied on cutting-edge Deep Learning research, which meant there was something new and exciting to learn every day.
Specifically, I worked on leveraging anchorless object detection for point-cloud based 3D data. This consisted of understanding the existing 3D object detection neural network, modifying it to remove the anchor proposals, and then thoroughly testing it to understand the results.
Original Object Detection Framework
Object detection frameworks traditionally operate through a feature extractor and region proposal network (RPN). The feature extractor first characterizes patterns in the input before passing the consolidated information to the RPN, which outputs detection class probabilities. These have usually been developed for 2D applications and require a few subtle modifications before they can work on our 3D data. First, point clouds are not sampled along a neat grid like 2D; their position is instead randomly scattered across the scene. This requires a pre-processing step of organizing the data into tidy volumetric pixels (voxels) that can be uniformly processed with standard deep learning techniques. This is accomplished with a per-voxel feature extractor that compiles all the points into a single feature vector. The second necessary modification is the transition from 2D to 3D convolution and pooling operations. With inspiration from Voxelnet (Zhou, Y., 2008), the team I joined had already developed a 3D learned feature (3DLF) model that brought all these components together into an end-to-end deep learning system.
The 3DLF model falls under the umbrella of one-shot object detection algorithms, which outputs both class probabilities and object localizations in a single forward pass through the data. These one-shot models accomplish this by scattering predefined anchors across the entire scene, which then serve as priors for potential object positions and sizes. I was asked to help with removing these anchors, as they are annoying to deal with for multiple reasons: first, anchors require a slow non-maximum suppression (NMS) post-processing step to handle inevitable overlap between predicted objects; second, anchors introduce additional hyperparameters that must be tuned to the expected data distributions; third, anchors impose a software dependency between the input and output tensors that is tedious to maintain.
For these reasons, the team I worked with suggested extending the CenterNet paper (Zhou, X, 2019) to our 3DLF model in hopes that it would remove the dependency on pesky anchor proposals in a clean, concise format.
CenterNet does away with anchors by instead casting object detection as a key point identification problem, representing the object by its center and extent. For smooth calculations, the authors represent these predictions as Gaussian distributions over the output map. Each of these Gaussians can be considered a singular anchor for object detection, eliminating the need for NMS during post processing. By removing the anchors and post-processing, CenterNet speeds up the network while maintaining competitive performance on metrics such as mAP.
The first attempt at running the merged 3DLF/CenterNet model consisted of training on trivial data to ensure it could learn anything at all. To do this, we trained it on a tiny subset of the data and evaluated on the same subset. This immediately uncovered an issue: our legacy softmax needed to be replaced by a sigmoid activation at the end of the network. The reason for this had to do with the removal of a background class. Unlike other one-shot object detection methods, CenterNet does not predict a catch-all background class. This meant performing softmax over class predictions always resulted in a value above the threshold, causing a mess of predicted bounding boxes scattered across the scene. After discovering this and validating that the model can learn on trivial data, we let it loose on the full dataset. The resulting model obtained an mAP boost from 0.57 to 0.59, as well as a 2 steps-per-second speed boost during the training time: a modest but robust solution improving an already mature system.
As an intern at EXP, I received valuable deep learning experience. Following best practices for building a custom neural network, I learned a lot about common errors and setbacks in the process. I worked through every step of developing a neural network, and as a result, had an insightful, applicable experience under the mentorship of passionate coworkers. I really enjoyed the non-technical aspects of my time at Expedition as well. There were many informative events and seminars to attend every week where people gave presentations on new developments in software and machine learning. The regular lunchtime and breakfast chats served as a great opportunity to network with fellow colleagues and get help on technical issues. As an intern, my mentors ensured I was working on a project that was relevant to the existing projects at the organization. As a result of this, I was able to find purpose behind the work I was doing which is a testament to how Expedition welcomes their interns and provides a positive, inspiring environment.
Zhou, X., Wang, D., & Krähenbühl, P. (2019). Objects as Points. arXiv preprint arXiv:1904.07850 (pdf)
Zhou, Y., & Tuzel, O. (2018). Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4490-4499) (pdf)