Crowd counting has been a popular research topic in the field of computer vision due to the variation of human head scales and the interference of background noise. Some existing methods use multi-level feature fusion to solve scale variation, but the problem of background noise interference may be more serious due to the involvement of shallow features in the feature fusion process. In this paper, we propose a Multilevel Information Sharing Network based on Residual Attention(RA-MISNet) to solve this problem. The RA-MISNet consists of a feature extraction component, an information sharing module and a residual attention density map estimator. On the basis of solving the multi-scale problem, the residual attention mechanism is adopted by our proposed method to refine the population distribution information in sharing features at all levels, which can reduce the interference of complex texture background on density map regression. Furthermore, owing to the severe label noise interference problem in high-density crowd areas, we design a Regional Multi-level Segmentation Loss (RMS Loss) to divide the multi-level density regions with different label noise rates in a single crowd image and apply the corresponding granularity supervision constraints for each density level region. Extensive experiments on three crowd counting datasets (ShanghaiTech, UCF CC 50, UCF-QNRF) demonstrate the effectiveness and superiority of the proposed methods.
Nowadays, due to various challenges such as large-scale variation of population, mutual occlusion, perspective distortion and so on, crowd counting has gradually become a hot issue in computer vision. To address the large- scale variation exists in the images, in this paper, we propose a novel multi-scale network called MSNet which aims to maintain continuous variations and count the number of pedestrians accurately. While most state-of-the- arts multi-scale and multi-column networks aim to integrate the scale information of heads with different size, lots of researches still need to do to achieve continuous variations. In MSNet, specifically, the first ten layers of the visual geometry group network(VGG) are used as the backbone to extract the rough features of images and a multi-scale block is employed to maintain the scale information which contains several receptive kernels to obtain a better performance towards the difficulty of scale-variation. Inspired by the knowledge that using multiple small receptive field kernels to replace a single large receptive field will get a better performance, we utilize two dilated convolutions with the receptive field of 5 to replace the large kernel. Our MSNet has moderate increase in computation, and we evaluate our method on three benchmark datasets including ShanghaiTech (Part A: MAE=59.6, RMSE=96.1; Part B: MAE=7.5, RMSE=12.1), UCF-CC-50(MAE=207.9, RMSE=273.8) and UCF-QNRF(MAE=93, RMSE=158) to show the outperformance of our method.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.