{"id":1180,"date":"2022-01-27T11:12:48","date_gmt":"2022-01-27T15:12:48","guid":{"rendered":"https:\/\/ece.ncsu.edu\/?p=245069"},"modified":"2022-01-27T11:12:48","modified_gmt":"2022-01-27T15:12:48","slug":"technique-improves-ai-ability-to-understand-3d-space-using-2d-images","status":"publish","type":"post","link":"https:\/\/my.ece.ncsu.edu\/communications\/2022\/technique-improves-ai-ability-to-understand-3d-space-using-2d-images\/","title":{"rendered":"Technique Improves AI Ability to Understand 3D Space Using 2D Images"},"content":{"rendered":"<p><img decoding=\"async\" width=\"1024\" height=\"576\" src=\"https:\/\/ece.ncsu.edu\/wp-content\/uploads\/2022\/01\/tianfu-wu-monocon-header-1024x576.jpg\" class=\"attachment-large size-large wp-post-image\" alt=\"photo shows cars on a street, with each of them surrounded by lines indicating a bounding box\" loading=\"lazy\" srcset=\"https:\/\/ece.ncsu.edu\/wp-content\/uploads\/2022\/01\/tianfu-wu-monocon-header-980x551.jpg 980w, https:\/\/ece.ncsu.edu\/wp-content\/uploads\/2022\/01\/tianfu-wu-monocon-header-480x270.jpg 480w\" sizes=\"auto, (min-width: 0px) and (max-width: 480px) 480px, (min-width: 481px) and (max-width: 980px) 980px, (min-width: 981px) 1024px, 100vw\" \/><\/p>\n<p>Researchers have developed a new technique, called MonoCon, that improves the ability of artificial intelligence (AI) programs to identify three-dimensional (3D) objects, and how those objects relate to each other in space, using two-dimensional (2D) images. For example, the work would help the AI used in autonomous vehicles navigate in relation to other vehicles using the 2D images it receives from an onboard camera.<\/p>\n<p>\u201cWe live in a 3D world, but when you take a picture, it records that world in a 2D image,\u201d says Tianfu Wu, corresponding author of a paper on the work and an assistant professor of electrical and computer engineering at North Carolina State University.<\/p>\n<p>\u201cAI programs receive visual input from cameras. So if we want AI to interact with the world, we need to ensure that it is able to interpret what 2D images can tell it about 3D space. In this research, we are focused on one part of that challenge: how we can get AI to accurately recognize 3D objects \u2013 such as people or cars \u2013 in 2D images, and place those objects in space.\u201d<\/p>\n<p>While the work may be important for autonomous vehicles, it also has applications for manufacturing and robotics.<\/p>\n<p>In the context of autonomous vehicles, most existing systems rely on lidar \u2013 which uses lasers to measure distance \u2013 to navigate 3D space. However, lidar technology is expensive. And because lidar is expensive, autonomous systems don\u2019t include much redundancy. For example, it would be too expensive to put dozens of lidar sensors on a mass-produced driverless car.<\/p>\n<p>\u201cBut if an autonomous vehicle could use visual inputs to navigate through space, you could build in redundancy,\u201d Wu says. \u201cBecause cameras are significantly less expensive than lidar, it would be economically feasible to include additional cameras \u2013 building redundancy into the system and making it both safer and more robust.<\/p>\n<p>\u201cThat\u2019s one practical application. However, we\u2019re also excited about the fundamental advance of this work: that it is possible to get 3D data from 2D objects.\u201d<\/p>\n<p>Specifically, MonoCon is capable of identifying 3D objects in 2D images and placing them in a \u201cbounding box,\u201d which effectively tells the AI the outermost edges of the relevant object.<\/p>\n<p>MonoCon builds on a substantial amount of existing work aimed at helping AI programs extract 3D data from 2D images. Many of these efforts train the AI by \u201cshowing\u201d it 2D images and placing 3D bounding boxes around objects in the image. These boxes are cuboids, which have eight points \u2013 think of the corners on a shoebox. During training, the AI is given 3D coordinates for each of the box\u2019s eight corners, so that the AI \u201cunderstands\u201d the height, width and length of the \u201cbounding box,\u201d as well as the distance between each of those corners and the camera. The training technique uses this to teach the AI how to estimate the dimensions of each bounding box and instructs the AI to predict the distance between the camera and the car. After each prediction, the trainers \u201ccorrect\u201d the AI, giving it the correct answers. Over time, this allows the AI to get better and better at identifying objects, placing them in a bounding box, and estimating the dimensions of the objects.<\/p>\n<p>\u201cWhat sets our work apart is how we train the AI, which builds on previous training techniques,\u201d Wu says. \u201cLike the previous efforts, we place objects in 3D bounding boxes while training the AI. However, in addition to asking the AI to predict the camera-to-object distance and the dimensions of the bounding boxes, we also ask the AI to predict the locations of each of the box\u2019s eight points and its distance from the center of the bounding box in two dimensions. We call this \u2018auxiliary context,\u2019 and we found that it helps the AI more accurately identify and predict 3D objects based on 2D images.<\/p>\n<p>\u201cThe proposed method is motivated by a well-known theorem in measure theory, the Cram\u00e9r\u2013Wold theorem. It is also potentially applicable to other structured-output prediction tasks in computer vision.\u201d<\/p>\n<p>The researchers tested MonoCon using a widely used benchmark data set called KITTI.<\/p>\n<p>\u201cAt the time we submitted this paper, MonoCon performed better than any of the dozens of other AI programs aimed at extracting 3D data on automobiles from 2D images,\u201d Wu says. MonoCon performed well at identifying pedestrians and bicycles, but was not the best AI program at those identification tasks.<\/p>\n<p>\u201cMoving forward, we are scaling this up and working with larger datasets to evaluate and fine-tune MonoCon for use in autonomous driving,\u201d Wu says. \u201cWe also want to explore applications in manufacturing, to see if we can improve the performance of tasks such as the use of robotic arms.\u201d<\/p>\n<p>The paper, \u201c<a href=\"https:\/\/arxiv.org\/pdf\/2112.04628.pdf\"  rel=\"noreferrer noopener\">Learning Auxiliary Monocular Contexts Helps Monocular 3D Object Detection<\/a>,\u201d will be presented at the Association for the Advancement of Artificial Intelligence Conference on Artificial Intelligence, being held virtually from Feb. 22 to March 1. First author of the paper is Xienpeng Lu, a Ph.D. student at NC State. The paper was co-authored by Nan Xue of Wuhan University.<\/p>\n<p>The work was done with support from the National Science Foundation, under grants 1909644, 1822477, 2024688 and 2013451; the Army Research Office, under grant W911NF1810295; and the U.S. Department of Health and Human Services, Administration for Community Living, under grant 90IFDV0017-01-00.<\/p>\n<p class=\"has-text-align-center\">-shipman-<\/p>\n<p><strong>Note to Editors:<\/strong> The study abstract follows.<\/p>\n<p><strong>\u201cLearning Auxiliary Monocular Contexts Helps Monocular 3D Object Detection\u201d<\/strong><\/p>\n<p><em>Authors<\/em>: Xianpeng Liu and Tianfu Wu, North Carolina State University; and Nan Xue, Wuhan University<\/p>\n<p><em>Presented<\/em>: Association for the Advancement of Artificial Intelligence Conference on Artificial Intelligence, Feb. 22-March 1.<\/p>\n<p><strong>Abstract:<\/strong> Monocular 3D object detection aims to localize 3D bounding boxes in an input single 2D image. It is a highly challenging problem and remains open, especially when no extra information (e.g., depth, lidar and\/or multi-frames) can be leveraged in training and\/or inference. This paper proposes a simple yet effective formulation for monocular 3D object detection without exploiting any extra information. It presents the MonoCon method which learns Monocular Contexts, as auxiliary tasks in training, to help monocular 3D object detection. The key idea is that with the annotated 3D bounding boxes of objects in an image, there is a rich set of well-posed projected 2D supervision signals available in training, such as the projected corner keypoints and their associated offset vectors with respect to the center of 2D bounding box, which should be exploited as auxiliary tasks in training. The proposed MonoCon is motivated by the Cram\u00e9r\u2013Wold theorem in measure theory at a high level. In implementation, it utilizes a very simple end-to-end design to justify the effectiveness of learning auxiliary monocular contexts, which consists of three components: a Deep Neural Network (DNN) based feature backbone, a number of regression head branches for learning the essential parameters used in the 3D bounding box prediction, and a number of regression head branches for learning auxiliary contexts. After training, the auxiliary context regression branches are discarded for better inference efficiency. In experiments, the proposed MonoCon is tested in the KITTI benchmark (car, pedestrian and cyclist). It outperforms all prior arts in the leaderboard on the car category and obtains comparable performance on pedestrian and cyclist in terms of accuracy. Thanks to the simple design, the proposed MonoCon method obtains the fastest inference speed with 38.7 fps in comparisons.<\/p>\n<p><em>This post was <a href=\"https:\/\/news.ncsu.edu\/2022\/01\/monocon-ai-3d\/\">originally published<\/a> in NC State News.<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p><img decoding=\"async\" width=\"1024\" height=\"576\" src=\"https:\/\/ece.ncsu.edu\/wp-content\/uploads\/2022\/01\/tianfu-wu-monocon-header-1024x576.jpg\" class=\"attachment-large size-large wp-post-image\" alt=\"photo shows cars on a street, with each of them surrounded by lines indicating a bounding box\" loading=\"lazy\" srcset=\"https:\/\/ece.ncsu.edu\/wp-content\/uploads\/2022\/01\/tianfu-wu-monocon-header-980x551.jpg 980w, https:\/\/ece.ncsu.edu\/wp-content\/uploads\/2022\/01\/tianfu-wu-monocon-header-480x270.jpg 480w\" sizes=\"auto, (min-width: 0px) and (max-width: 480px) 480px, (min-width: 481px) and (max-width: 980px) 980px, (min-width: 981px) 1024px, 100vw\">The work would help autonomous vehicles navigate in relation to other vehicles.<\/p>\n","protected":false},"author":9,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"ncst_dynamicHeaderBlockName":"","ncst_dynamicHeaderData":"","ncst_content_audit_freq":"","ncst_content_audit_date":"","ncst_content_audit_display":false,"ncst_backToTopFlag":"","footnotes":""},"categories":[180],"tags":[],"class_list":["post-1180","post","type-post","status-publish","format-standard","hentry","category-research"],"displayCategory":null,"acf":{"ncst_posts_meta_modified_date":null},"_links":{"self":[{"href":"https:\/\/my.ece.ncsu.edu\/communications\/wp-json\/wp\/v2\/posts\/1180","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/my.ece.ncsu.edu\/communications\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/my.ece.ncsu.edu\/communications\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/my.ece.ncsu.edu\/communications\/wp-json\/wp\/v2\/users\/9"}],"replies":[{"embeddable":true,"href":"https:\/\/my.ece.ncsu.edu\/communications\/wp-json\/wp\/v2\/comments?post=1180"}],"version-history":[{"count":2,"href":"https:\/\/my.ece.ncsu.edu\/communications\/wp-json\/wp\/v2\/posts\/1180\/revisions"}],"predecessor-version":[{"id":2481,"href":"https:\/\/my.ece.ncsu.edu\/communications\/wp-json\/wp\/v2\/posts\/1180\/revisions\/2481"}],"wp:attachment":[{"href":"https:\/\/my.ece.ncsu.edu\/communications\/wp-json\/wp\/v2\/media?parent=1180"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/my.ece.ncsu.edu\/communications\/wp-json\/wp\/v2\/categories?post=1180"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/my.ece.ncsu.edu\/communications\/wp-json\/wp\/v2\/tags?post=1180"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}