|
1.INTRODUCTIONIn recent years, with the rise of the field of artificial intelligence, the research of deep learning in academia and industry has developed rapidly, and the research of related computer vision technology has also made great progress. Starting from the Perceptron1 proposed by Rosenblatt as the first-generation computer vision system, Fukushima et al. were inspired by the connection pattern of the cat’s brain nerves and proposed Neocognitron2. In subsequent studies, researchers will propagate backwards. The algorithm3 combined with Neocognitron finally proposed the concept of Convolutional Neural Network (CNN)4-5. Convolutional neural network (CNN) has been proven to be an effective model for solving various visual tasks6-8. It implements the function of a feature extractor through the interleaved stack of convolutional layers and a series of nonlinear and sub-sampling layers. This makes CNN a powerful network for describing images. Compared with other neural networks, the convolutional neural network greatly reduces the number of connections and weights between neurons in each layer due to the inspiration of the visual circuit structure of the brain nerves, which reduces the training difficulty and computing time of the model, and also reduces the probability of overfitting. The most direct way to improve the performance of a deep neural network is to increase the size of the network, which includes increasing the depth of the network and the number of units in each layer. Especially when considering the availability of large amounts of labeled training data, this is a simple and safe way to train higher quality models. However, this simple solution has two main disadvantages. One is that it will increase the probability of overfitting the model, which may be a big bottleneck when there are more and more data; the other is that it will increase computing resources9. Therefore, this paper adopts the VGG1610 model proposed by Simonyan et al. as the experimental model. 16 represents the number of network layers, including convolutional layer, fully connected layer and Softmax layer. The function of the loss function is to reflect the difference between the predicted data and the actual data. It is a way to measure the performance of the model. At the same time, the loss function is also a key to improving the accuracy of image recognition and classification. While researchers continue to propose new models, the loss function is also constantly evolving, such as Softmax Loss11, Contrastive Loss12, Triplet Loss13, SphereFace14, InsightFace15, etc. This article will show the common loss functions at this stage and analyze their advantages. Inferiority, the loss function is finally combined with the model, and the accuracy of the model is analyzed through the test of a large number of data sets. The second chapter of this article introduces three neural network models, the third chapter compares and analyzes the advantages and disadvantages of the three loss functions, and the fourth chapter is the part of experimental analysis and discussion. 2.CONVOLUTIONAL NEURAL NETWORKSThe VGG-16 model10 is a deep convolutional neural network jointly developed by the University of Oxford’s Visual Geometry Group and Google’s DeepMind Department, as shown in Figure 1. The depth of the VGG network is determined to be 16-19 layers after many studies. This article uses a 16-layer structure. Compared with the previous network structure with excellent performance, the VGG network has a significant drop in test error rate, and won the championship of the positioning project and the runner-up of the classification project in the ILSVRC-2014 competition. Therefore, the VGG network is widely used for image recognition and classification tasks. In the VGG-16 model, a 3*3 convolution kernel and a 2*2 pooling kernel are used, which is also a major feature of the excellent performance of the model. It can be seen from the figure that the first 13 layers of the VGG-16 model are a stack of convolutional layers, the last three layers are fully connected layers, and the last is the Softmax layer. 3.LOSS FUNCTIONAs mentioned in this paper, the loss function is the reflection of the fitting degree of the model to the data. The better the fitting degree, the smaller the value of the loss function, and vice versa. Loss function also plays a great role in depth feature learning, so the selection of loss function is also very important for image recognition and classification. 3.1.Large margin cosine lossLiu16 introduces angle margin. When Softmax classifies, there will be an obvious decision boundary. The points near the decision boundary will reduce the generalization ability and robustness of the model. Therefore Wang et al.17 proposed a Large Margin Cosine Loss (LMCL) loss function. where N represents the number of training samples, Xi represents the i-th eigenvector corresponding to class yi, Wj represents the weight vector of class j, θj is the angle between Wj and Xi, m represents the cosine margin, s = ǁxǁ The proposed LMCL loss function greatly improves the performance of the model, eliminates the need to set the tricky super parameter in the implementation, and makes it easier to fit without Softmax supervision. 3.2.Large margin angular lossCosine margin is characterized by one-to-one mapping from cosine space to angle space, but the margin of cosine space is different from that of angle space. In fact, the geometric representation of angle margin is clearer than cosine margin, and the boundary in angle space corresponds to the arc distance on hyperspherical aggregate, as shown in Figure 2. Take the binary classification problem as an example, Figure 2 shows the intuitive correspondence between the angle margin and the arc edge of the hypersphere15. Therefore, on the basis of LMCL15, a Large Margin Angular Loss (LMAL) loss function is proposed: The meaning of variables in the formula is the same as that in equation (1). One of the advantages of LMAL is its clear geometric interpretation, as shown in Figure 3. 4.EXPERIMENTAL ANALYSIS AND DISCUSSIONAs mentioned above, vgg16 model and LMAL loss function are selected for training on the data set Imagenet. Through the target logit curves of the three loss functions, as shown in Figure 4, we can see that when θ ∈ [30°,90°]. The target logit curve of LMAL is lower than that of LMCL. Therefore, according to equation (4), LMAL has a stricter margin in this interval than LMCL. According to the experimental analysis, the model performance can not be significantly improved when θ < 30°. According to Deng et al.15, when θ ∈ [60°,90°], the training will not converge, while when θ ∈ [30°,60°], the model performance can be effectively improved. The model is tested with Pascal visual object classes 2007 (voc-2007), Pascal visual object classes 2012 (voc-2012). Voc-2007 includes 20 categories, with a total of 9963 drawings; Voc-2012 includes 20 object classes and 10 action classes, with a total of 17125 drawings. 10 samples are selected from three data sets respectively, as shown in Table 1. Table 1.VOC-2012 top-1 test sample.
Through the test of three data sets, the accuracy of vgg-16 model based on angle margin loss is shown in Table 2. Table 2.Independent datasets’ top-1 accuracy (%).
It can be seen from Table 2 that among the test accuracy of the three data sets, LMAL has significantly improved compared with softmax, but it also reflects that the image recognition rate will be greatly affected when the image background, illumination and shooting angle are different, which is also the deficiency of this paper and one of the improvement directions in the future. 5.CONCLUSIONThis paper analyzes three common loss functions and models in image recognition, and proposes to apply LMAL loss function to image recognition and classification model VGG-16, which optimizes the potential problem of insufficient ability to reduce intra class gap when using softmax loss function. Through the integration of the improved loss function and the model, good results have been achieved in the experiments on the independent data sets voc-2007, voc-2012 and caltech-256, and the performance of the model has been improved to a certain extent. The future improvement direction of this paper is to solve the recognition and classification ability when the picture background is complex. REFERENCESRosenblatt, F.,
“The Perceptron: A Perceiving and Recognizing Automaton,”
85
–460
(1957). Google Scholar
Fukushima, K.,
“Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position,”
Biological Cybernetics, 36
(4), 193
–202
(1980). https://doi.org/10.1007/BF00344251 Google Scholar
Rumelhart, D. E., Hinton, G. E. and Williams, R. J.,
“Learning representations by back-propagating errors,”
Nature, 323
(6088), 533
–536
(1986). https://doi.org/10.1038/323533a0 Google Scholar
LeCun, Y., Boser, B. and Denker, J. S.,
“Backpropagation applied to handwritten zip code recognition,”
Neural Computation, 1
(4), 541
–551
(1989). https://doi.org/10.1162/neco.1989.1.4.541 Google Scholar
Lang, K. J., Waibel, A. H. and Hinton, G. E.,
“A time delay neural network architecture for speech recognition,”
Neural Networks, 3
(1), 23
–43
(1988). https://doi.org/10.1016/0893-6080(90)90044-L Google Scholar
Krizhevsky, A., Sutskever, I. and Hinton, G. E.,
“ImageNet classification with deep convolutional neural networks,”
NIPS, (2012). Google Scholar
Ren, S., He, K., Girshick, R. and Sun, J.,
“Faster R-CNN: Towards real-time object detection with region proposal networks,”
NIPS, (2015). Google Scholar
Toshev, A. and Szegedy, C.,
“DeepPose: Human pose estimation via deep neural networks,”
in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR),
1653
–1660
(2014). Google Scholar
Szegedy, C., Liu, W., Jia, Y. Q., et al,
“Going deeper with convolutions,”
(2014). Google Scholar
Simonyan, K. and Zisserman, A.,
“Very deep convolutional networks,”
(2015). Google Scholar
Cao, Q., Shen, L., Xie, W., Parkhi, O. M. and Zisserman, A.,
“Vggface2: A dataset for recognising faces across pose and age,”
(2017). Google Scholar
Sun, Y., Chen, Y., Wang, X. and Tang, X.,
“Deep learning face representation by joint identification-verification,”
Proc. of 27th Inter. Conf. on Advances in Neural Information Processing Systems, 1988
–1996
(2014). Google Scholar
Schroff, F., Kalenichenko, D. and Philbin, J.,
“Facenet: A unified embedding for face recognition and clustering,”
815
–823 CVPR,2015). Google Scholar
Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B. and Song, L.,
“Sphereface: Deep hypersphere embedding for face recognition,”
212
–220 CVPR,2017). Google Scholar
Deng, J. K., Guo, J. and Zafeiriou, S.,
“Additive angular margin loss for deep face recognition,”
(2018). Google Scholar
Liu, W., Wen, Y., Yu, Z. and Yang, M.,
“Large-margin softmax loss for convolutional neural networks,”
507
–516 ICML,2016). Google Scholar
Wang, F., Cheng, J., Liu, W. and Liu, H.,
“Additive margin softmax for face verification,”
(2018). https://doi.org/10.1109/LSP.97 Google Scholar
|