With the explosive growth of the number of online videos, video retrieval becomes increasingly difficult. Multi-modal visual and language understanding based video-text retrieval is one of the mainstream framework to solve this problem. Among them, MMT (Multi-modal Transformer) is a novel and mainstream model. On the language side, BERT (Bidirectional Encoder Representation for Transformers) is used to encode text, where the pretrained BERT will be fine tuned during training. However, there exists a mismatch in this stage. The pre-training tasks of BERT is based on NSP (Next Sentence Prediction) and MLM(masked language model) which have weak correlation with video retrieval. For text encoder will encode text into semantic embeddings. On the visual side, Transformer is used to aggregate multimodal experts of videos. We find that the output of visual transformer is not fully utilized. In this paper, a sentence- BERT model is introduced to substitute BERT model in MMT to improve sentence embeddings efficiency. In addition, a max-pooling layer is adopted after Transformer to improve the utilization efficiency of the output of the model. Experiment results show that the proposed model outperforms MMT.
Virtual reality 360-degree video is now widely used in various fields. However, the huge amount of data caused by its high resolution increases the coding complexity. While the existing video coding framework, such as HEVC, is not optimized according to the characteristics of virtual reality video, in order to accelerate the speed of intra-mode decision process in video coding, a new algorithm is proposed in this paper. Each virtual reality 360-degree video frame is divided into two regions, namely the pole region and the equatorial region. The number of candidate modes in the pole region is reduced in advance according to the statistical distribution pattern of the modes, and intra-mode in the equatorial region is determined by the cart decision tree using the texture contrast of the coding unit. The experimental results show that compared with the original HM16.20, the proposed algorithm can save the coding time by 28.3%, while only increasing the bd-rate by 0.612%.
Burgeoning virtual reality technology allows people to experience the content of video authentically. In order to provide an immersive experience to user, virtual reality video needs higher resolution (4K and higher) and more data than traditional video, which increases the coding complexity extraordinary. In this paper, a fast intra mode decision algorithm based on the sum of region-directional dispersion is proposed. Based on the diverse degree of horizontal stretching in different latitude area, the relationship of the optimal intra prediction mode related to latitude area for coding units is obtained. A new metric named the sum of region-directional dispersion is defined to determine the mode selection interval, and the prediction mode where located the minimum sum of dispersion is used as the center mode of the selection interval. The optimal intra prediction mode is obtained by comparing it with other adjacent modes in the interval. Compared with the original reference software HM16.20, the proposed algorithm can reduce the time in coding by 28.5%, while BD-rate only increases by 1.0%.
The new generation of high efficiency video coding standard (HEVC), although remarkably improves the compression performance, it increases the computational complexity of coding for Screen Content Coding (SCC). Therefore, how to reduce the coding complexity becomes an important task in HEVC optimization. In this work, the process of quad-tree recursive partitioning has been studied, and the relationship between the coding cost and the mother block and child block has been found in statistics way, and an improved fast early termination coding unit partition based on the cost of the coding unit has been proposed. The experiment results show that the fast algorithm proposed in this paper can save 31.37% of the encoding time at the cost of reduces the coding performance by about 1.02 %.
Pedestrian detection is an important application in computer vision. Due to uneven illumination, serious obstacles, low quality images, abnormal posture and other factors, pedestrian detection faces the problem of low detection accuracy in complex scenes. In this paper, pedestrian detection algorithm based on deep convolution neural network is studied. Since shorter connections between the input and output layers can help to build deeper and more efficient network in CNN, a densely connected convolution structure is introduced in this paper to optimize the Deconvolutional Single Shot Detector and improve the feature utilization and reduce the network parameters. Meanwhile, by augmenting the context information, the detection performance for small size pedestrians is improved. The initial experimental results show that the proposed algorithm improves the detection accuracy to 87.84% at the speed of 12.3fps on low-resolution (64x128) pedestrian dataset, which outperforms the reference algorithms.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.