Open Access Paper
24 May 2022 Research on image recognition method based on LMAL and VGG-16
Ying Cao, Runlong Gu, Runlong Gu, Chenghua Huang
Author Affiliations +
Proceedings Volume 12260, International Conference on Computer Application and Information Security (ICCAIS 2021); 1226019 (2022) https://doi.org/10.1117/12.2637770
Event: International Conference on Computer Application and Information Security (ICCAIS 2021), 2021, Wuhan, China
Abstract
In deep convolutional neural networks, the traditional Softmax loss function lacks the ability to distinguish similar classes. In order to solve this problem, the idea of increasing the inter-class spacing and reducing the intra-class spacing is widely recognized. The large margin angular loss (LMAL) loss function is introduced to reduce the intra-class spacing by L2-standardization of the features and weight vectors of the Softmax loss function. At the same time, LMAL also has a good ability to distinguish deeper features. Combined the LMAL loss function and the VGG-16 model, the results on three independent datasets show that the image recognition accuracy of the improved model has been significantly improved.

1.

INTRODUCTION

In recent years, with the rise of the field of artificial intelligence, the research of deep learning in academia and industry has developed rapidly, and the research of related computer vision technology has also made great progress. Starting from the Perceptron1 proposed by Rosenblatt as the first-generation computer vision system, Fukushima et al. were inspired by the connection pattern of the cat’s brain nerves and proposed Neocognitron2. In subsequent studies, researchers will propagate backwards. The algorithm3 combined with Neocognitron finally proposed the concept of Convolutional Neural Network (CNN)4-5. Convolutional neural network (CNN) has been proven to be an effective model for solving various visual tasks6-8. It implements the function of a feature extractor through the interleaved stack of convolutional layers and a series of nonlinear and sub-sampling layers. This makes CNN a powerful network for describing images. Compared with other neural networks, the convolutional neural network greatly reduces the number of connections and weights between neurons in each layer due to the inspiration of the visual circuit structure of the brain nerves, which reduces the training difficulty and computing time of the model, and also reduces the probability of overfitting.

The most direct way to improve the performance of a deep neural network is to increase the size of the network, which includes increasing the depth of the network and the number of units in each layer. Especially when considering the availability of large amounts of labeled training data, this is a simple and safe way to train higher quality models. However, this simple solution has two main disadvantages. One is that it will increase the probability of overfitting the model, which may be a big bottleneck when there are more and more data; the other is that it will increase computing resources9. Therefore, this paper adopts the VGG1610 model proposed by Simonyan et al. as the experimental model. 16 represents the number of network layers, including convolutional layer, fully connected layer and Softmax layer.

The function of the loss function is to reflect the difference between the predicted data and the actual data. It is a way to measure the performance of the model. At the same time, the loss function is also a key to improving the accuracy of image recognition and classification. While researchers continue to propose new models, the loss function is also constantly evolving, such as Softmax Loss11, Contrastive Loss12, Triplet Loss13, SphereFace14, InsightFace15, etc. This article will show the common loss functions at this stage and analyze their advantages. Inferiority, the loss function is finally combined with the model, and the accuracy of the model is analyzed through the test of a large number of data sets.

The second chapter of this article introduces three neural network models, the third chapter compares and analyzes the advantages and disadvantages of the three loss functions, and the fourth chapter is the part of experimental analysis and discussion.

2.

CONVOLUTIONAL NEURAL NETWORKS

The VGG-16 model10 is a deep convolutional neural network jointly developed by the University of Oxford’s Visual Geometry Group and Google’s DeepMind Department, as shown in Figure 1. The depth of the VGG network is determined to be 16-19 layers after many studies. This article uses a 16-layer structure. Compared with the previous network structure with excellent performance, the VGG network has a significant drop in test error rate, and won the championship of the positioning project and the runner-up of the classification project in the ILSVRC-2014 competition. Therefore, the VGG network is widely used for image recognition and classification tasks.

Figure 1.

VGG-16 architecture.

00212_psisdg12260_1226019_page_2_1.jpg

In the VGG-16 model, a 3*3 convolution kernel and a 2*2 pooling kernel are used, which is also a major feature of the excellent performance of the model. It can be seen from the figure that the first 13 layers of the VGG-16 model are a stack of convolutional layers, the last three layers are fully connected layers, and the last is the Softmax layer.

3.

LOSS FUNCTION

As mentioned in this paper, the loss function is the reflection of the fitting degree of the model to the data. The better the fitting degree, the smaller the value of the loss function, and vice versa. Loss function also plays a great role in depth feature learning, so the selection of loss function is also very important for image recognition and classification.

3.1.

Large margin cosine loss

Liu16 introduces angle margin. When Softmax classifies, there will be an obvious decision boundary. The points near the decision boundary will reduce the generalization ability and robustness of the model. Therefore Wang et al.17 proposed a Large Margin Cosine Loss (LMCL) loss function.

00212_psisdg12260_1226019_page_2_2.jpg
00212_psisdg12260_1226019_page_2_3.jpg
00212_psisdg12260_1226019_page_2_4.jpg

where N represents the number of training samples, Xi represents the i-th eigenvector corresponding to class yi, Wj represents the weight vector of class j, θj is the angle between Wj and Xi, m represents the cosine margin, s = ǁxǁ

The proposed LMCL loss function greatly improves the performance of the model, eliminates the need to set the tricky super parameter in the implementation, and makes it easier to fit without Softmax supervision.

3.2.

Large margin angular loss

Cosine margin is characterized by one-to-one mapping from cosine space to angle space, but the margin of cosine space is different from that of angle space. In fact, the geometric representation of angle margin is clearer than cosine margin, and the boundary in angle space corresponds to the arc distance on hyperspherical aggregate, as shown in Figure 2. Take the binary classification problem as an example, Figure 2 shows the intuitive correspondence between the angle margin and the arc edge of the hypersphere15.

Figure 2.

Margin interpretation of LMAL diagram.

00212_psisdg12260_1226019_page_3_1.jpg

Therefore, on the basis of LMCL15, a Large Margin Angular Loss (LMAL) loss function is proposed:

00212_psisdg12260_1226019_page_3_2.jpg
00212_psisdg12260_1226019_page_3_3.jpg

The meaning of variables in the formula is the same as that in equation (1). One of the advantages of LMAL is its clear geometric interpretation, as shown in Figure 3.

Figure 3.

Geometrical interpretation of LMAL diagram.

00212_psisdg12260_1226019_page_3_4.jpg

4.

EXPERIMENTAL ANALYSIS AND DISCUSSION

As mentioned above, vgg16 model and LMAL loss function are selected for training on the data set Imagenet. Through the target logit curves of the three loss functions, as shown in Figure 4, we can see that when θ ∈ [30°,90°]. The target logit curve of LMAL is lower than that of LMCL. Therefore, according to equation (4), LMAL has a stricter margin in this interval than LMCL. According to the experimental analysis, the model performance can not be significantly improved when θ < 30°. According to Deng et al.15, when θ ∈ [60°,90°], the training will not converge, while when θ ∈ [30°,60°], the model performance can be effectively improved.

Figure 4.

Target logit curve.

00212_psisdg12260_1226019_page_4_1.jpg

The model is tested with Pascal visual object classes 2007 (voc-2007), Pascal visual object classes 2012 (voc-2012). Voc-2007 includes 20 categories, with a total of 9963 drawings; Voc-2012 includes 20 object classes and 10 action classes, with a total of 17125 drawings. 10 samples are selected from three data sets respectively, as shown in Table 1.

Table 1.

VOC-2012 top-1 test sample.

ImageTest resultAccuracy (%)ImageTest resultAccuracy (%)
Mountain bike0.61760551Persian cat0.65801388
Ocean liner0.99372792Racing car0.87700474
Bullet train0.99832124Airliner0.45872828
French bulldog0.99862778Yawl0.85090107
Motor scooter0.43986964Ski0.92091626

Through the test of three data sets, the accuracy of vgg-16 model based on angle margin loss is shown in Table 2.

Table 2.

Independent datasets’ top-1 accuracy (%).

Model/Test datasetVOC-2007VOC-2012Caltech-256
VGG-16 with LMAL79.1572.4882.27
VGG-16 with Softmax75.2770.5977.07

It can be seen from Table 2 that among the test accuracy of the three data sets, LMAL has significantly improved compared with softmax, but it also reflects that the image recognition rate will be greatly affected when the image background, illumination and shooting angle are different, which is also the deficiency of this paper and one of the improvement directions in the future.

5.

CONCLUSION

This paper analyzes three common loss functions and models in image recognition, and proposes to apply LMAL loss function to image recognition and classification model VGG-16, which optimizes the potential problem of insufficient ability to reduce intra class gap when using softmax loss function. Through the integration of the improved loss function and the model, good results have been achieved in the experiments on the independent data sets voc-2007, voc-2012 and caltech-256, and the performance of the model has been improved to a certain extent. The future improvement direction of this paper is to solve the recognition and classification ability when the picture background is complex.

REFERENCES

[1] 

Rosenblatt, F., “The Perceptron: A Perceiving and Recognizing Automaton,” 85 –460 (1957). Google Scholar

[2] 

Fukushima, K., “Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position,” Biological Cybernetics, 36 (4), 193 –202 (1980). https://doi.org/10.1007/BF00344251 Google Scholar

[3] 

Rumelhart, D. E., Hinton, G. E. and Williams, R. J., “Learning representations by back-propagating errors,” Nature, 323 (6088), 533 –536 (1986). https://doi.org/10.1038/323533a0 Google Scholar

[4] 

LeCun, Y., Boser, B. and Denker, J. S., “Backpropagation applied to handwritten zip code recognition,” Neural Computation, 1 (4), 541 –551 (1989). https://doi.org/10.1162/neco.1989.1.4.541 Google Scholar

[5] 

Lang, K. J., Waibel, A. H. and Hinton, G. E., “A time delay neural network architecture for speech recognition,” Neural Networks, 3 (1), 23 –43 (1988). https://doi.org/10.1016/0893-6080(90)90044-L Google Scholar

[6] 

Krizhevsky, A., Sutskever, I. and Hinton, G. E., “ImageNet classification with deep convolutional neural networks,” NIPS, (2012). Google Scholar

[7] 

Ren, S., He, K., Girshick, R. and Sun, J., “Faster R-CNN: Towards real-time object detection with region proposal networks,” NIPS, (2015). Google Scholar

[8] 

Toshev, A. and Szegedy, C., “DeepPose: Human pose estimation via deep neural networks,” in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 1653 –1660 (2014). Google Scholar

[9] 

Szegedy, C., Liu, W., Jia, Y. Q., et al, “Going deeper with convolutions,” (2014). Google Scholar

[10] 

Simonyan, K. and Zisserman, A., “Very deep convolutional networks,” (2015). Google Scholar

[11] 

Cao, Q., Shen, L., Xie, W., Parkhi, O. M. and Zisserman, A., “Vggface2: A dataset for recognising faces across pose and age,” (2017). Google Scholar

[12] 

Sun, Y., Chen, Y., Wang, X. and Tang, X., “Deep learning face representation by joint identification-verification,” Proc. of 27th Inter. Conf. on Advances in Neural Information Processing Systems, 1988 –1996 (2014). Google Scholar

[13] 

Schroff, F., Kalenichenko, D. and Philbin, J., “Facenet: A unified embedding for face recognition and clustering,” 815 –823 CVPR,2015). Google Scholar

[14] 

Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B. and Song, L., “Sphereface: Deep hypersphere embedding for face recognition,” 212 –220 CVPR,2017). Google Scholar

[15] 

Deng, J. K., Guo, J. and Zafeiriou, S., “Additive angular margin loss for deep face recognition,” (2018). Google Scholar

[16] 

Liu, W., Wen, Y., Yu, Z. and Yang, M., “Large-margin softmax loss for convolutional neural networks,” 507 –516 ICML,2016). Google Scholar

[17] 

Wang, F., Cheng, J., Liu, W. and Liu, H., “Additive margin softmax for face verification,” (2018). https://doi.org/10.1109/LSP.97 Google Scholar
© (2022) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Ying Cao, Runlong Gu, Runlong Gu, and Chenghua Huang "Research on image recognition method based on LMAL and VGG-16", Proc. SPIE 12260, International Conference on Computer Application and Information Security (ICCAIS 2021), 1226019 (24 May 2022); https://doi.org/10.1117/12.2637770
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Data modeling

Performance modeling

Convolutional neural networks

Visual process modeling

Visualization

Image classification

Neural networks

Back to Top