Open Access
5 August 2022 Not all temporal shift modules are profitable
Youshan Zhang, Yong Li, Shaozhe Guo, Qiming Liang
Author Affiliations +
Abstract

With the increasing coverage of video surveillance systems in modern society, demand for using artificial intelligence algorithm to replace humans in violent behavior recognition has also become stronger. By moving some channels in the temporal dimension, temporary shift module (TSM) can achieve the performance of three-dimensional convolution neural network (CNN) with the complexity of two-dimensional CNN, and extract the temporal and spatial information at the same time. Our intuition is that too many temporary shift modules may fuse too much action information in each frame, which weakens the capability of CNN on spatiotemporal information extraction. To verify the aforementioned conjecture, we adjusted the network structure based on TSM, proposed partial TSM, selected the optimal model through experiments, and verified the performance of the algorithm on multiple datasets and our expanded datasets. The proposed optimal model not only reduced the memory usage of hardware but also achieved higher accuracy on multiple datasets with 77.3% running time. Meanwhile, we achieved state-of-the-art performance of 91% on RWF-2000 dataset.

1.

Introduction

With the increase of urban digitization year by year, more and more video monitoring equipment is applied in various public places, such as shopping malls, airports, schools, railway stations, and so on, which promotes the rapid development of the video monitoring system. In just over 20 years, the monitoring equipment in many major cities has basically covered large places and trunk roads, and the video monitoring technology has been continuously developed and improved. However, at present, most intelligent video monitoring systems can only provide video data acquisition and storage functions. If you want to analyze and understand the video monitoring content, you have to rely on manual monitoring, which is time-consuming and costly. There is no doubt that people will be exhausted after viewing for a long time. Therefore, it is rather difficult for them to process a large amount of video data effectively.

Deep learning plays a great role in the field of computer vision. Many studies16 are devoted to extract and fuse spatiotemporal information quickly and effectively. Its application in behavior recognition has become an important method to replace manual video processing. According to the different backbone frameworks selected for recognition, behavior recognition mainly uses recurrent neural network, two-stream convolutional neural networks, three-dimensional (3D) convolutional neural networks, and transformer.

The main idea of two-stream convolutional neural networks is to extract spatial information and temporal information, respectively, through two convolutional neural networks, and then fuse them through appropriate information fusion method for action recognition. Simonyan and Zisserman7 first proposed a two-stream convolutional neural networks for behavior recognition. Feichtenhofer et al.8 first discussed different fusion strategies for spatial and temporal information in two-stream convolutional neural networks. Inspired by ResNet,9 Feichtenhofer et al.10 redesigned a lightweight two-stream convolutional neural networks, applied shortcut between spatial branch and temporal branch, and used different frame rate to better focus on dynamic information.

The main idea of 3D convolutional neural networks is to extract temporal and spatial information from multiple adjacent frames at the same time by 3D convolution kernels. Due to dimensional constraints, two-dimensional (2D) convolution is difficult to extract temporal information from a single picture, Ji et al.11 first applied the 3D convolution method to human behavior recognition. Tran et al.12 proposed an approach that used deep 3D convolutional networks for spatio–temporal feature learning, which proved that using 3*3*3 convolution kernel can achieve the best accuracy. Qiu et al.13 combined the 2D spatial and one-dimensional temporal convolutions, which replaced the 2D residual module in residual neural network (ResNet) and achieved better results.

As we all know, the algorithms and ideas applied in the field of natural language processing are widely delivered to the field of computer vision. To complete the task of computer vision, the most important thing is how to extract spatial information and fuse it with temporal information between frames.

Recurrent neural network can extract the spatial information of frames and transmit the temporal information through hidden layers. Hochreiter and Schmidhuber14 proposed a new network structure––long short-term memory (LSTM), which enables recurrent neural networks to retain feature information for a longer time in long sequence training, and solves the problem of narrowing sensing domain caused by gradient disappearance and gradient explosion to some extent. Donahue et al.15 proposed a new network architecture long-term recurrent revolutionary networks (LRCNs), which connects convolutional neural network directly with LSTM.

Since vision transformer16 broke the application barrier of transformer in the field of computer vision, a series of researches based on transformer have showed its potential in computer vision. The latest research results1722 showed that transformer’s performance in computer vision tasks has exceeded or equal to that of convolution neural network (CNN).

Because of its structural characteristics, recurrent neural network has not achieved good results in the field of computer vision. Two-stream convolutional neural networks extract spatio–temporal information through two-way convolution, since the two will be fused for recognition, redundant information is extracted. The 3D CNN expands the dimension of convolution kernel, resulting in the exponential increase of its parameters and computation compared with 2D CNN. The framework based on transformer uses pure attention instead of convolution kernel, so its feature extraction efficiency is not as efficient as CNN framework, which means the framework based on transformer needs to pay expensive time cost on massive data. If we want to recognize the violent behavior in surveillance video in real time, we need a more lightweight and efficient method to extract spatio–temporal information. Lin et al.23 proposed the temporal shift module (TSM). By shifting and splicing adjacent frames in the temporal dimension and using 2D convolution to extract spatio–temporal information at the same time, the performance of 3D convolution is achieved, and the problems of 3D convolution in parameters and calculations are solved.

It is worth mentioning that Ding et al.24 proposed RepVGG, a simple architecture with a stack of 3×3 Conv and rectified linear unit. The running speed is much higher than ResNet-50 and ResNet-101, and higher accuracy is achieved at the same time. This architecture has no branches, which means every layer takes the output of its only preceding layer as input and feeds the output into its only following layer. This has something in common with the idea of this paper.

Our intuition is that some TSMs can fuse the information of adjacent frames on a single picture, but too many TSMs may fuse too much action information on each frame, which weakens the ability of CNN on spatio–temporal information extraction. To verify the above conjecture, we adjust the network structure based on TSM and propose partial TSM (P-TSM), experiments are carried out to explore whether to use temporary shift module in different stages of the network. The main contributions of this paper are as follows:

  • 1. A P-TSM is proposed, which reduces the network complexity, protects the feature extraction capability of the backbone convolution network, and improves the accuracy of the algorithm.

  • 2. Introduce the two-cascade TSM into partial TSM. The experiments show that the combination of single-cascade TSM and two-cascade TSM can not further improve the accuracy compared with the full version of the two-cascade TSM.

  • 3. The performance of the algorithm is verified on the existing open source datasets and the expanded violent behavior recognition dataset we established.

2.

Background

2.1.

Temporal Shift Module

To better complete the task of behavior recognition, it is very important to effectively extract temporal and spatial information from continuous frames of video. The traditional 2D convolution uses 2D convolution kernel, which can only extract the spatial information of a single frame, but cannot extract the temporal and spatial information of multiple frames at the same time such as 3D convolution. TSM proposed by Lin et al.23 have solved this problem in a novel way. By moving some channels along the temporal dimension, the spatial information between adjacent frames can be mixed. TSMs were inserted before 2D CNN, so the 2D convolution kernels can extract spatial information and temporal information at the same time. By this way, the performance of 3D CNN is achieved with 2D CNNs complexity.

As shown in Fig. 1(a), the original input tensor is stacked by several adjacent frames, and different colors represent frames at different times. As shown in Fig. 1(b), by moving some channels along the temporal dimension in the same input batch, 2D CNN can extract spatial and temporal information at the same time.

Fig. 1

(a) The original tensor without shift and (b) the tensor after shift.

JEI_31_4_043030_f001.png

For better expression, we assume that the input is an infinite one-dimensional matrix X and a 1*3 convolution kernel W1=(a,b,c), then the convolution operation Y=Conv(W,X) can be written as

Eq. (1)

Yi=a*Xi1+b*Xi+c*Xi+1.

We shift the input X by 1, 0, +1, the shift operation can be written as

Eq. (2)

Xi1=Xi1,Xi0=Xi,Xi+1=Xi+1.

After that, the convolution block followed will complete the multiply accumulate operation:

Eq. (3)

Y=a*Xi1+b*Xi+c*Xi+1,

Eq. (4)

Y=a*X1+b*X0+c*X+1.

2.2.

Two-cascade TSM Residual Module

Based on the work above, Liang et al.25 proposed that TSM network acquires limited long-term information during behavior recognition. And the network structure is so simple that over-fitting is prone to occur in the process of feature learning. To solve this problem, a two-cascade TSM is proposed as shown in Fig. 2, which expands the receptive field of temporal dimensions and enhances the long-term information extraction capacity.

Fig. 2

Two-cascade temporal shift module.

JEI_31_4_043030_f002.png

On the result Y of the first TSM, carry out the second TSM, and set the 1*3 convolution kernel of the second stage as W2=(w1,w2,w3). A 1*5 convolution effect can be achieved through two cascades, the output result Z is

Eq. (5)

Z=w1*Y1+w2*Y0+w3*Y+1,

Eq. (6)

Z=wa*X2+wb*X1+wc*X0+wd*X+1+we*X+2,
and

Eq. (7)

wa=w1*a,

Eq. (8)

wb=w1*b+w2*a,

Eq. (9)

wc=w1*c+w2*b+w3*a,

Eq. (10)

wd=w2*c+w3*b,

Eq. (11)

we=w3*c.

3.

Partial Temporal Shift Module

3.1.

Intuition

We first explain the intuition behind P-TSM. The way CNN extracts spatio–temporal information is that the deeper the network is, the more specific shape the network can recognize. When the network is shallow, most of the extracted shape is points and lines. When the network reaches a deeper level, some specific objects or actions can be recognized. Excessively shift channels to adjacent frames will change the input tensor of the convolution layer, resulting in the decline of the overall spatial feature learning capability of the model. In the TSM model, ResNet-50 is used as the backbone and residual TSM is adopted through comparative experiments, the TSM is put inside the residual branch in a residual block. Similarly, the two-cascade TSM also uses ResNet-50 as the backbone, the same approach is adopted to insert TSM. To better explain the algorithm, we use a simple flowchart to represent ResNet-50 as shown in Fig. 3, in which TSM represents the temporary shift module, the dotted TSM represents the two-cascade TSM, Conv represents the convolution operation in each residual block, Avg represents the average pooling layer, and FC represents the full connection layer.

Fig. 3

A simple two-cascade TSM flowchart.

JEI_31_4_043030_f003.png

3.2.

Too Much Shifts Blur the Input Tensor

This is like a video played at multiple speed. When playing at 2n+1 speed, you can see more single frames, with the problem of blurring the characteristic information of the current time. In the two-cascade TSM, the input of the second residual block can be written as

Eq. (12)

L=R1(Z,X).

Among them, R1 represents the convolution and shortcut of the first residual block in ResNet-50

Eq. (13)

S1=T1(L),

Eq. (14)

S2=T2(S1),

Eq. (15)

S2=T2(T1(L)),

Eq. (16)

S2=wa*L2+wb*L1+wc*L0+wd*L+1+we*L+2.

Among them, T1 and T2 indicates two temporary shift modules in the second residual block. At this time, the input tensor of the second convolution layer has strode across nine frames. When the network becomes deeper, the input tensor of the convolution layer will contain more frames. After n TSMs, the input tensor will contain the information of 2n+1 frames, which is equivalent to expand the receptive field of temporal dimensions. However, this will lead to two serious problems: (1) weaken the CNNs capability of extracting spatial information at the current time and (2) the input tensor at the current time contains too much information of frames at other times, which means that the temporal information extracted by the subsequent convolution layers is useless for behavior recognition.

3.3.

Fewer Shift Performs Better

Although the two-cascade TSM does expand the receptive field of temporal dimensions compared with TSM, we find that it is not worth using two-cascade TSM. Specifically, the two-cascade TSM gains 1% accuracy improvement with higher calculation cost, compared with the original TSM. If we only apply the temporary shift module on some blocks of ResNet-50, it will bring two significant advantages: (1) less data movement brings higher efficiency (less GPU occupancy and algorithm running time). Although the shift operation does not increase calculation cost, but it involves data movement which will increase the memory usage and inference latency on hardware. (2) The spatial modeling capability of 2D CNN backbone is protected. By reducing some temporary shift modules, some blocks take the output of its only preceding layer as input without any branches, so as to retain the information of all channels in the current frame. We observed that when some temporary shift modules were appropriately reduced, the accuracy increased to a certain extent, compared with 2D CNN baseline (TSM).

3.4.

Module Design

The temporal shift operation endows the 2D CNN with the ability to learn temporal features and harms the spatial modeling capability of the 2D CNN backbone. To balance the spatial and temporal feature learning ability of the algorithm, we first measured the accuracy of the single-cascade TSM, two-cascade TSM and model without inserting temporary shift module under the same experimental conditions, and discussed the impact of temporary shift module on the performance of the algorithm from the following two aspects.

3.4.1.

Frequency and positions of single temporary shift module’s insertion

To study the influence of the frequency and positions of single TSMs insertion on the performance of the algorithm, we used different combination strategies and measured the accuracy. We measure the model with ResNet-50 backbone and eight-frame input using 0–4 temporary shift modules with all possible insertion positions. The experiments show that the results of inserting temporary shift modules in three blocks are all better than those of the baseline regardless of the insertion positions.

As shown in Fig. 4, block [1,1,1,0] represents that when ResNet-50 is used as the backbone, the single temporary shift module is inserted into the first, second and third residual block, the fourth residual block remain unchanged. As shown in Table 2, the block combination of [1,1,1,0] achieves the best performance in the combination of frequency and positions of single temporary shift module insertion. Therefore, in the following experiments, we use this combination to further improve the performance of the algorithm.

Fig. 4

Partial TSM with block [1,1,1,0].

JEI_31_4_043030_f004.png

3.4.2.

Combination of two-cascade and single temporary shift module

Since the two-cascade TSM can expand the receptive field of temporal dimensions, we use the combination of the two-cascade and single TSM with all possible combinations. The experiments show that the proper combination of two-cascade and single TSM cannot better improve the performance of the model, which also confirms our conjecture.

As shown in Fig. 5, block [2,1,1,0] represents that when ResNet-50 is used as the backbone, the two-cascade temporary shift module is inserted into the first residual block, the single temporary shift module is inserted into the second and the third residual block, and the fourth residual block remain unchanged.

Fig. 5

Partial TSM with block [2,1,1,0].

JEI_31_4_043030_f005.png

4.

Experiments Design

We first show that P-TSM can further improve the performance of 2D CNN in violent behavior recognition. Then we further explore the potential of the combination of single-cascade and two-cascade TSM to prove our conjecture. Finally, we demonstrate the performance of our method on multiple datasets and show a lower calculation cost of our method compared with other optimal methods.

4.1.

Training and Testing Setups

We carried out experiments on video violence recognition task. To be comparable with the two-cascade TSM, this paper refers to the hardware setting and deep-learning framework used in Ref. 25. During the whole experiments, the deep-learning framework used is pytorch1 5, the operating system is Ubuntu 16.04 and the CPU is Intel I9-10920x. Use CUDA10.2 to accelerate the GPU and use two NVIDIA RTX2080super GPU with 8 GB of video memory for parallel computing.

During training, we used the pre-trained model on Kinetics provided by Lin et al.23 to reduce the computational complexity of network training. When training on RWF-2000 dataset, we also found that the training loss of the experiment decreased from the beginning of the training until it is stable, and the verification loss of the experiment also decreased continuously, however, when the training reaches about 30 epochs, the verification loss of the experiment will rise sharply. This indicates that over-fitting occurred during the experiments. Therefore, this paper also uses the learning rate adjustment method proposed by Liang et al.,25 the initial learning rate is set to 0.001, and the learning rate is adjusted to 90% of the original every two epochs, which not only speeds up the adjustment speed of learning rate, but also accelerates the model learning process.

4.2.

Model

To compare with the original TSM and the two-cascade TSM, we also use ResNet-50 as the backbone. The method we used to insert TSM is the same as Refs. 23 and 25.

4.3.

Datasets

To fully test the performance of the algorithm and verify our conjecture, experiments are carried out on the four datasets mentioned in Ref. 25. A summary of the used datasets is shown in Table 1.

Table 1

Introduction of used datasets for violence recognition.

DatasetYear of releaseClips includeFrames per secondFrame resolution
Crowd violence201224625320 × 240
Hockey2011100025360 × 288
RWF-20002021200030300 × 240, 320 × 240, 480 × 360, 920 × 720, 1280 × 720
Expended dataset2021500025,30300 × 240, 320 × 240, 360 × 288, 480 × 360, 920 × 720, 1280 × 720

The crowd violence dataset mainly contains the scenes of crowd, but due to long shooting distance and low resolution, most of the scenes are chaotic and vague. The hockey dataset contains 1000 violent and non-violent videos collected from ice hockey game. The training set includes 800 video clips, the verification set includes 100 video clips, and the test set includes 100 video clips. The latest published RWF-200026 dataset contains 2000 surveillance video clips collected from Youtube. The training set includes 1600 video clips, the verification set includes 200 video clips and the test set includes 200 video clips. Each video clip is 5 s and contains 150 frames. It mainly includes violent behaviors such as two persons, multiple persons, and crowds. The scenes are so rich and complicated that it is difficult to recognize. All video clips are obtained through the security camera. Without multimedia technology transformation, they fit the actual scene and have high research value. Figure 6 shows the basic situation of the dataset.

Fig. 6

Basic situation of dataset: (a) crowd violence dataset; (b) hockey dataset; (c) RWF-2000 dataset; and (d) expanded dataset.

JEI_31_4_043030_f006.png

Compared with other datasets with high utilization rate, the above three datasets are still too small, and there is still serious over-fitting phenomenon in the process experiments, which is not conducive to the application of violence recognition in real life. Therefore, our team expanded the dataset on the basis of the open source violence recognition dataset UCF-Crime, we collect hockey dataset, movies dataset, violent-flow dataset, HMDB51 dataset, and other scenes of violence in the video, and collect UCF101 and HMDB51 datasets as the main non-violent scenarios in the expanded dataset. Adobe Premiere Pro is used for editing. Because the duration of violence is always short. To better learn the characteristics of violent behavior, the video duration is uniformly edited to 1 or 5 s, which enriches the scene of RWF-2000 dataset, greatly increases the number of samples, solves the problem of over-fitting, and increases the universality of the dataset.

5.

Results

Because the dataset is so small that the model become over-fitting between 10–30 epochs, and this phenomenon persists in the subsequent training process. We set the number of training rounds to 100. We will explain our experimental results from two aspects: single-cascade P-TSM and the combination of single-cascade and two-cascade P-TSM.

5.1.

Single-cascade P-TSM

As shown in Table 2, we tried all insertion strategies for temporary shift modules in single-cascade P-TSM. And the accuracy of different block combination are showed. The number 1 indicates that we have inserted single temporary shift module into the corresponding residual block and number 0 indicates that we keep the corresponding residual block unchanged.

Table 2

Different combination strategies of P-TSM compared with baseline.

BlockAccuracyAccuracy comparison
CombinationResNet-50TSMTwo-cascade TSM
[0,0,0,0]840−4-5
[1,0,0,0]83.5−0.5−4.5−5.5
[0,1,0,0]840−4−5
[0,0,1,0]87.75+3.75−0.25−1.25
[0,0,0,1]86.75+2.75−1.25−2.25
[1,1,0,0]86.25+2.25−1.75−2.75
[1,0,1,0]87.75+3.75−0.25−1.25
[1,0,0,1]87+3−1−2
[0,1,1,0]87.75+3.75−0.25−1.25
[0,1,0,1]87.25+3.25−0.25−1.75
[0,0,1,1]86.25+2.25−1.75−2.75
[1,1,1,0]91+7+3+2
[1,1,0,1]88.75+4.75+0.75−0.25
[1,0,1,1]89+5+10
[0,1,1,1]88.75+4.75+0.75−0.25
[1,1,1,1]88+40−1
Note: Bold value emphasizes optimal performance.

We compared the results with related work. ResNet-50,9 which equals to block [0,0,0,0], is used as the basic framework. TSM,23 which equal to block [1,1,1,1], represents that TSMs are inserted in all blocks of the basic framework. Two-cascade TSM,25 which equal to block [2,2,2,2], represents that TSMs are inserted twice in all blocks of the basic framework. We can see that most of the experiments using temporary shift modules have better performance than those without, which shows that even inserting one temporary shift module can expand the receptive field of temporal dimensions and enhance the long-term information extraction capacity. All P-TSM inserting three temporary shift modules have achieved better results than the single-cascade TSM, the best combination block [1,1,1,0] is 3% higher than original TSM and 2% higher than two-cascade TSM. It shows that using too many temporary shift modules in different blocks does weaken the backbone’s capability of extracting spatial information. Properly reducing the temporary shift modules can not only reduce the complexity of the model but also improve the performance.

P-TSM with one or two temporary shift modules only achieves the best performance of 87.75%, which is even lower than that of original TSM. It shows that inserting a moderate number of temporary shift modules could better improve the model’s capability of extracting spatio–temporal information, then improve the performance of the model.

5.2.

Combination of Single-cascade and Two-cascade P-TSM

As shown in Table 3, we tried all insertion strategies for temporary shift modules in single-cascade and two-cascade P-TSM. The number 2 indicates that we have inserted two-cascade temporary shift module into the corresponding residual block.

Table 3

Different combination strategies of single-cascade and two-cascade TSM.

BlockAccuracyAccuracy comparison
CombinationTwo-cascade TSM
[2,1,1,0]89.25+0.25
[1,2,1,0]88−1
[1,1,2,0]89.25+0.25
[2,2,1,0]88.25-0.75
[2,1,2,0]890
[1,2,2,0]88.25−0.75
[2,2,2,0]88.75−0.25

We can see that the combination of single-cascade and two-cascade P-TSM does not achieve excellent results, and each combination performs close to two-cascade TSM. We tried to train directly using the pre-trained model on Kinetics or the model we trained with the best results of single-cascade P-TSM (block [1,1,1,0]), the best accuracy we can achieve is 89.25%. We infer that there are two reasons for this phenomenon:

  • 1. Too many temporary shift modules break the balance between spatial and temporal feature learning ability of the optimal model we already trained and

  • 2. The combination of two-cascade and single TSM does not further improve the model’s capability of extracting spatio–temporal information.

5.3.

Comparison of Optimal Methods

To further verify the performance of the P-TSM, we conducted experiments on different datasets compared with other algorithms. Table 4 gives the specific situation of different algorithms on four violence recognition datasets.

Table 4

Comparison of optimal accuracy.

AlgorithmCrowd violenceHockeyRWF-2000Expanded dataset
3D-CNN1194.394.482.7591.7
LRCN1594.5797.17792.3
I3D2788.8997.585.7593.3
AR-Net2895.91897.287.392.8
TSM2395.9597.58894.6
TEA2996.93997.788.593.8
Two-cascade TSM2596.93998.058994.8
P-TSM (ours)96.93998.59195.8
Note: Bold value emphasizes optimal performance.

As can be seen in Fig. 7 that the algorithm proposed in this paper further improve the accuracy compared with the original TSM and two-cascade TSM. In the crowd violence dataset, the P-TSM is 1% higher than the original TSM, which is equal to two-cascade TSM. In the hockey dataset, the P-TSM is 1% higher than the original TSM and 0.5% higher than the two-cascade TSM. In the RWF-2000 dataset, the P-TSM is 3% higher than the original TSM and 2% higher than the two-cascade TSM. In the expanded dataset, the P-TSM is 1.2% higher than the original TSM and 1% higher than the two-cascade TSM.

Fig. 7

Improvement of P-TSM and two-cascade TSM compared with original TSM.

JEI_31_4_043030_f007.png

To better show the improvement of our method. We make a comparison of the computational cost of the proposed method with existing methods which is showed in Table 5. Torchsummary is used to calculate models’ parameters and estimated total size. For better comparison, the input of summary function is set to fixed value. Batch size is set as 8. Picture size is 224*224. The number of input channels is 3. Since torchsummary cannot deal with LSTM which is a part of LRCN, torchinfo is used to make relevant calculations of LRCN. Tensorboard is used to record relevant data which provides the time used after training and testing for 100 epochs.

Table 5

Comparison of computational cost.

AlgorithmParams (MB)Estimated total size (MB)Time cost
3D-CNN11297.562647.704 h 8 m 23 s
LRCN15237.831212.933 h 5 m 7 s
I3D2746.881000.202 h 3 m 48 s
TSN3089.69390.0546 m 41 s
TSM2389.69397.711 h 53 m 26 s
TEA2991.95479.783 h 10 m 48 s
Two-cascade TSM2589.69397.712 h 8 min 20 s
P-TSM (ours)89.69396.571 h 27 min 38 s

TSN is the basic of TSM, TEA, Two-cascade TSM and our method. We can see that the algorithms based on TSN all have less parameters and model size. Since TSM only increases the memory usage of hardware rather than model size and parameters. We compare the performance of the algorithm by time cost. After training and testing 100 epochs in the RWF-2000 dataset, P-TSM takes 1 h 27 min and 38 s comparing with 1 h 53 min and 26 s taken by the original TSM. Our algorithm improves the running speed by about 23%. With far more less TSMs inserted, our proposed method greatly reduce the training time cost compared with two-cascade TSM.

6.

Conclusion

To better recognize violent behavior in surveillance video, this paper improves the algorithm based on original TSM and two-cascade TSM. Our conjecture is that not all temporary shift modules can improve the performance of the algorithm. This paper proposes a P-TSM, some relevant experiments are done to prove that using the appropriate number of temporary shift modules can better balance the spatial and temporal learning capability of the algorithm. The proposed optimal model not only reduces the memory usage of hardware, but also achieves higher accuracy on multiple datasets with higher running speed. We also achieved state-of-the-art performance of 91% on RWF-2000 dataset.

Acknowledgments

This work was supported by the National Educational Science 13th Five-year Plan Project (Grant No. JYKYB2019012), the Basic Research Fund for the Engineering University of PAP (Grant No. WJY201907) and the Basic Research Fund of the Engineering University of PAP (Grant No. WJY202120).

References

1. 

S. Kaur, P. Kumar and P. Kumaraguru, “Deepfakes: temporal sequential analysis to detect face-swapped video clips using convolutional long short-term memory,” J. Electron. Imaging, 29 (3), 033013 (2020). https://doi.org/10.1117/1.JEI.29.3.033013 JEIME5 1017-9909 Google Scholar

2. 

T. Han et al., “Feature and spatial relationship coding capsule network,” J. Electron. Imaging, 29 (2), 023004 (2020). https://doi.org/10.1117/1.JEI.29.2.023004 JEIME5 1017-9909 Google Scholar

3. 

X. Zhang et al., “Multimodal polarization image simulated crater detection,” J. Electron. Imaging, 29 (2), 023027 (2020). https://doi.org/10.1117/1.JEI.29.2.023027 JEIME5 1017-9909 Google Scholar

4. 

J. Yan et al., “No-reference remote sensing image quality assessment based on gradient-weighted natural scene statistics in spatial domain,” J. Electron. Imaging, 28 (1), 013033 (2019). https://doi.org/10.1117/1.JEI.28.1.013033 JEIME5 1017-9909 Google Scholar

5. 

T. Dai et al., “Research on recognition of painted faces,” J. Electron. Imaging, 31 (1), 013005 (2022). https://doi.org/10.1117/1.JEI.31.1.013005 JEIME5 1017-9909 Google Scholar

6. 

S. Chen, W. Ma and L. Zhang, “Dual-bottleneck feature pyramid network for multiscale object detection,” J. Electron. Imaging, 31 (1), 013009 (2022). https://doi.org/10.1117/1.JEI.31.1.013009 JEIME5 1017-9909 Google Scholar

7. 

K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Adv. in Neural Inf. Process. Syst., (2014). Google Scholar

8. 

C. Feichtenhofer, A. Pinz and A. Zisserman, “Convolutional two-stream network fusion for video action recognition,” in Proc. IEEE Conf. Comput. Vis. and Pattern Recognit., 1933 –1941 (2016). https://doi.org/10.1109/CVPR.2016.213 Google Scholar

9. 

K. He et al., “Deep residual learning for image recognition,” in IEEE Conf. Comput. Vis. and Pattern Recognit., 770 –778 (2016). https://doi.org/10.1109/CVPR.2016.90 Google Scholar

10. 

C. Feichtenhofer et al., “Slowfast networks for video recognition,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 6202 –6211 (2019). https://doi.org/10.1109/ICCV.2019.00630 Google Scholar

11. 

S. Ji et al., “3D convolutional neural networks for human action recognition,” IEEE Trans. Pattern Anal. Mach. Intell., 35 (1), 221 –231 (2013). https://doi.org/10.1109/TPAMI.2012.59 ITPIDJ 0162-8828 Google Scholar

12. 

D. Tran et al., “Learning spatiotemporal features with 3D convolutional networks,” in Proc. IEEE Int. Conf. Comput. Vis., 4489 –4497 (2015). https://doi.org/10.1109/ICCV.2015.510 Google Scholar

13. 

Z. Qiu, T. Yao and T. Mei, “Learning spatio-temporal representation with pseudo-3d residual networks,” in Proc. IEEE Int. Conf. Comput. Vis., 5533 –5541 (2017). https://doi.org/10.1109/ICCV.2017.590 Google Scholar

14. 

S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., 9 (8), 1735 –1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735 NEUCEB 0899-7667 Google Scholar

15. 

J. Donahue et al., “Long-term recurrent convolutional networks for visual recognition and description,” in Proc. IEEE Conf. Comput. Vis. and Pattern Recognit., 2625 –2634 (2015). https://doi.org/10.1109/CVPR.2015.7298878 Google Scholar

16. 

A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” (2020). Google Scholar

17. 

H. Chefer, S. Gur and L. Wolf, “Transformer interpretability beyond attention visualization,” in Proc. IEEE/CVF Conf. Comput. Vis. and Pattern Recognit., 782 –791 (2021). Google Scholar

18. 

Z. Dai et al., “UP-DETR: unsupervised pre-training for object detection with transformers,” in Proc. IEEE/CVF Conf. Comput. Vis. and Pattern Recognit., 1601 –1610 (2021). Google Scholar

19. 

H. Chefer, S. Gur and L. Wolf, “Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers,” (2021). Google Scholar

20. 

W. Wang et al., “Pyramid vision transformer: a versatile backbone for dense prediction without convolutions,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 568 –578 (2021). https://doi.org/10.1109/ICCV48922.2021.00061 Google Scholar

21. 

B. Heo et al., “Rethinking spatial dimensions of vision transformers,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 11936 –11945 (2021). Google Scholar

22. 

S. He et al., “TransReID: transformer-based object re-identification,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 15013 –15022 (2021). https://doi.org/10.1109/ICCV48922.2021.01474 Google Scholar

23. 

J. Lin, C. Gan and S. Han, “TSM: temporal shift module for efficient video understanding,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 7083 –7093 (2019). Google Scholar

24. 

X. Ding et al., “RepVGG: Making VGG-style convNets great again,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. and Pattern Recognit., 13733 –13742 (2021). Google Scholar

25. 

Q. Liang et al., “Violence behavior recognition of two-cascade temporal shift module with attention mechanism,” J. Electron. Imaging, 30 (4), 043009 (2021). https://doi.org/10.1117/1.JEI.30.4.043009 JEIME5 1017-9909 Google Scholar

26. 

M. Cheng, K. Cai and M. Li, “RWF-2000: an open large scale video database for violence detection,” in 2020 25th Int. Conf. Pattern Recognit., 4183 –4190 (2021). https://doi.org/10.1109/ICPR48806.2021.9412502 Google Scholar

27. 

J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in Proc. IEEE Conf. Comput. Vis. and Pattern Recognit., 6299 –6308 (2017). https://doi.org/10.1109/CVPR.2017.502 Google Scholar

28. 

Y. Meng et al., “AR-Net: adaptive frame resolution for efficient action recognition,” Lect. Notes Comput. Sci., 12352 86 –104 (2020). https://doi.org/10.1007/978-3-030-58571-6_6 Google Scholar

29. 

Y. Li et al., “Tea: temporal excitation and aggregation for action recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. and Pattern Recognit., 909 –918 (2020). https://doi.org/10.1109/CVPR42600.2020.00099 Google Scholar

30. 

L. Wang et al., “Temporal segment networks for action recognition in videos,” IEEE Trans. Pattern Anal. Mach. Intell., 41 (11), 2740 –2755 (2019). https://doi.org/10.1109/TPAMI.2018.2868668 ITPIDJ 0162-8828 Google Scholar

Biography

Youshan Zhang is an MS degree candidate at the Engineering University of PAP (EUPAP). His research interests include behavior recognition.

Yong Li is an associate professor at the EUPAP. His research interests include pattern recognition.

Shaozhe Guo is an MS degree candidate at the Engineering University of PAP (EUPAP). His research interests include object detection.

Qiming Liang is an MS degree candidate at the Engineering University of PAP (EUPAP). His research interests include behavior recognition.

CC BY: © The Authors. Published by SPIE under a Creative Commons Attribution 4.0 International License. Distribution or reproduction of this work in whole or in part requires full attribution of the original publication, including its DOI.
Youshan Zhang, Yong Li, Shaozhe Guo, and Qiming Liang "Not all temporal shift modules are profitable," Journal of Electronic Imaging 31(4), 043030 (5 August 2022). https://doi.org/10.1117/1.JEI.31.4.043030
Received: 23 February 2022; Accepted: 21 July 2022; Published: 5 August 2022
Lens.org Logo
CITATIONS
Cited by 2 scholarly publications.
Advertisement
Advertisement
KEYWORDS
Convolution

Video

Video surveillance

Convolutional neural networks

Data modeling

Performance modeling

Detection and tracking algorithms

Back to Top