We propose SiamGauss, a Siamese region proposal network with a Gaussian head for single-target visual object tracking for aerial benchmarks. Visual tracking in aerial videos faces unique challenges due to the large field of view resulting in small size objects, similar looking objects (confusers) in close proximity, occlusions, and fast motion due to simultaneous object and camera motion. In Siamese tracking, a cross-correlation ration is performed in the embedding space to obtain a similarity map of the target within a search frame, which is then used to localize the target. The proposed Gaussian head helps suppress the activation produced in the similarity map on confusers present in the search frame during training while boosting the confidence on the target. This activation suppression improves the confuser awareness of our tracker. In addition, improving the activation on the target helps maintain tracking consistency in fast motion. Our proposed Gaussian head is only applied during training and introduces no additional computational overhead during inference while tracking. Thus, SiamGauss achieves fast runtime performance. We evaluate our method on multiple aerial benchmarks showing that SiamGauss performs favorably with state-of-the-art trackers while rating at a frame rate of 96 frames per second.
We present a Fully Convolutional Adaptive Tracker (FCAT) based on a Siamese architecture that operates in real-time and is well suited for tracking from aerial platforms. Real time performance is achieved by using a fully convolutional network to generate a densely sampled response map in a single pass. The network is fined-tuned on the tracked target with an adaptation approach that is similar to the procedure used to train Discriminative Correlation Filters. A key difference between FCAT and Discriminative Correlation Filters is that FCAT fine-tunes the template feature directly using Stochastic Gradient Descent while DCF regresses a correlation filter. The effectiveness of the proposed method was illustrated on surveillance style videos, where FCAT performs competitively with state-of-the-art visual trackers while maintaining real-time tracking speeds of over 30 frames per second.
In this paper, we benchmark five state-of-the-art trackers on aerial platform videos: Multi-domain Convolutional Neural Network (MDNET) tracker, which was the winner of the VOT2015 tracking challenge, the Fully Convolutional Neural network Tracker (FCNT), the Spatially Regularized Correlation Filter (SRDCF) tracker, the Continuous Convolution Operator Tracker (CCOT) tracker, which was the winner of the VOT2016 challenge, and the Tree structure Convolutional Neural Network (TCNN) tracker. We assess performance in terms of both tracking accuracy and processing speed based on two sets of videos: a subset of the OTB dataset where the cameras are located at a high vantage point and a new dataset of aerial videos captured by a moving platform. Our results indicate that these trackers performed as expected for the videos in the OTB subset, however, tracker performance degraded significantly in aerial videos due to target size, camera motion and target occlusions. The CCOT tracker yielded the best overall performance in terms of accuracy, while the SRDCF tracker was the fastest.
As Unmanned Aerial Systems grow in numbers, pedestrian detection from aerial platforms is becoming a topic of increasing importance. By providing greater contextual information and a reduced potential for occlusion, the aerial vantage point provided by Unmanned Aerial Systems is highly advantageous for many surveillance applications, such as target detection, tracking, and action recognition. However, due to the greater distance between the camera and scene, targets of interest in aerial imagery are generally smaller and have less detail. Deep Convolutional Neural Networks (CNN’s) have demonstrated excellent object classification performance and in this paper we adopt them to the problem of pedestrian detection from aerial platforms. We train a CNN with five layers consisting of three convolution-pooling layers and two fully connected layers. We also address the computational inefficiencies of the sliding window method for object detection. In the sliding window configuration, a very large number of candidate patches are generated from each frame, while only a small number of them contain pedestrians. We utilize the Edge Box object proposal generation method to screen candidate patches based on an "objectness" criterion, so that only regions that are likely to contain objects are processed. This method significantly reduces the number of image patches processed by the neural network and makes our classification method very efficient. The resulting two-stage system is a good candidate for real-time implementation onboard modern aerial vehicles. Furthermore, testing on three datasets confirmed that our system offers high detection accuracy for terrestrial pedestrian detection in aerial imagery.
Unmanned Aerial Vehicles are becoming an increasingly attractive platform for many applications, as their cost decreases and their capabilities increase. Creating detailed maps from aerial data requires fast and accurate video mosaicking methods. Traditional mosaicking techniques rely on inter-frame homography estimations that are cascaded through the video sequence. Computationally expensive keypoint matching algorithms are often used to determine the correspondence of keypoints between frames. This paper presents a video mosaicking method that uses an object tracking approach for matching keypoints between frames to improve both efficiency and robustness. The proposed tracking method matches local binary descriptors between frames and leverages the spatial locality of the keypoints to simplify the matching process. Our method is robust to cascaded errors by determining the homography between each frame and the ground plane rather than the prior frame. The frame-to-ground homography is calculated based on the relationship of each point’s image coordinates and its estimated location on the ground plane. Robustness to moving objects is integrated into the homography estimation step through detecting anomalies in the motion of keypoints and eliminating the influence of outliers. The resulting mosaics are of high accuracy and can be computed in real time.
With the growing ubiquity of mobile devices, advanced applications are relying on computer vision techniques to provide novel experiences for users. Currently, few tracking approaches take into consideration the resource constraints on mobile devices. Designing efficient tracking algorithms and optimizing performance for mobile devices can result in better and more efficient tracking for applications, such as augmented reality. In this paper, we use binary descriptors, including Fast Retina Keypoint (FREAK), Oriented FAST and Rotated BRIEF (ORB), Binary Robust Independent Features (BRIEF), and Binary Robust Invariant Scalable Keypoints (BRISK) to obtain real time tracking performance on mobile devices. We consider both Google’s Android and Apple’s iOS operating systems to implement our tracking approach. The Android implementation is done using Android’s Native Development Kit (NDK), which gives the performance benefits of using native code as well as access to legacy libraries. The iOS implementation was created using both the native Objective-C and the C++ programing languages. We also introduce simplified versions of the BRIEF and BRISK descriptors that improve processing speed without compromising tracking accuracy.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.