An event camera adopts a bio-inspired sensing mechanism that can record the luminance changes over time. The recorded information, called events, are detected asynchronously at each pixel in the order of microseconds. Events are quite useful for framerate upsampling of a video, because the information between the low-framerate video frames (key-frames) can be supplemented from the events. We propose a method for framerate upsampling from events on the basis of an unsupervised approach; our method does not require ground-truth high-framerate videos for pre-training but can be trained solely on the key-frames and events taken from the target scene. We also report some promising experimental results with a fast moving scene captured by a DAVIS346 event camera.
KEYWORDS: Education and training, Quantization, Network architectures, Data compression, Video coding, Neural networks, Visualization, Image compression, 3D acquisition, Video
We propose a data compression method for a light field using a compact and computationally efficient neural representation. We first train a neural network with learnable parameters to reproduce the target light field. We then compress the set of learned parameters as an alternative representation of the light field. Our method is significantly different in concept from the traditional approaches where a light field is encoded as a set of images or a video (as a pseudo-temporal sequence) using off-the-shelf image/video codecs. We experimentally show that our method achieves a promising rate-distortion performance.
We propose a method of synthesizing a dense focal stack and an all-in-focus image from a sparse focal stack that consists of only a few differently-focused images. Our algorithm is implemented using a convolutional neural network (CNN), which is trained to produce a denser focal stack by interpolating the input focal stack, and an all-in-focus image by fusing the dense focal stack. Experimental results show that our method can achieve better image quality than the baseline methods..
The efficiency of lossless image coding depends on the pixel predictors, with which unknown pixels are predicted from already-processed pixels. Recent advances in deep learning brought new tools that can be used for pixel prediction, such as deep convolutional neural networks (CNNs). In this paper, we focus on the processing order of the pixels and propose a new pixel predictor constructed using CNNs. Instead of the conventional scanline order, we design a new processing order where the pixels are processed in a progressive, parallelizable manner and the reference pixels are located in all directions with respect to a target pixel. Our pixel predictor is implemented using a CNN architecture that was originally developed for image inpainting, a task of filling in missing pixels from known pixels in an image. We compare the performance of our method against the conventional scanlinebased CNN in terms of the potential coding efficiency and computational cost.
Light field rendering promises to overcome the limitations of stereoscopic representation by allowing for a more seamless transition between multiple point of views, thus giving a more faithful representation of 3D scenes. However, it is indisputable that there is a need for light field displays on which the data can be natively visualised, fuelled by the recent innovations in the realm of acquisition and compression of light field contents. Assessing the visual quality of light field contents on native light field display is of extreme importance in future development of both new rendering methods, as well as new compression solutions. However, the limited availability of light field displays restrict the possibility of using them to carry out subjective tests. Moreover, hardware limitations in prototype models may lessen considerably the perceptual quality of experience in consuming light field contents. In this paper, we compare three different compression approaches for multi-layer displays, through both objective quality metrics and subjective quality assessment. Furthermore, we analyze the results obtained through subjective tests conducted using a prototype multi-layer display, and a recently-proposed framework to conduct quality assessment of light field contents rendered through a tensor display simulator in 2D screens. Using statistical tools, we assess the correlation among the two settings and we draw useful conclusions for future design of compression solutions and subjective tests for light field contents with muti-layer rendering.
In 3D image processing field, many researches have been conducted, such as multiview image coding and data compression, view interpolation, coded aperture based light field acquisition, and light field display signal calculation. The challenge of these technologies is that they usually require heavy computation due to the large amount of data. In this paper, we report the results of some experiments where we replace these computation with deep neural network (DNN) and convolutional neural network (CNN). In some of the cases, DNN and CNN show better performance than conventional methods both in quality and calculation speed.
A multi-focused plenoptic camera is a powerful device that can capture a light field (LF), which is interpreted as a set of dense multi-view images. The camera has potential ability such that we can obtain LFs with high spatial/view resolutions and deep depth-of-field. To extract multi-view images, we need a sophisticated rendering process due to the complicated optical system of such cameras. However, there are few researches on this topic, and the only available rendering software to the best of our knowledge could not work well for some camera configurations. We therefore propose an improved rendering method and release our rendering software. Our software can extract multi-view images from a multi-focused plenoptic camera with higher quality than the previous one and work for various camera configurations.
Active stereo is one of the means for estimating depth, which uses a camera and a projector. The projected patterns are classified into two groups: a time code or a space code. In recent years, a space-time hybrid code was investigated to fully utilize the advantages of both codes. With this method, one needs to separate the information from captured images, on which the information for both codes are coexisting and mixing. In this paper, we propose to use the color space to successfully construct and extract the information for the time/space codes.
KEYWORDS: Binary data, Image compression, Video, 3D video compression, Video compression, Visualization, Image quality, Video coding, Cameras, 3D displays
We proposed an efficient coding scheme for a dense light field, i.e., a set of multi-view images taken with very small viewpoint intervals. The key idea behind our proposal is that a light field is represented only using some weighted binary images. This coding scheme is completely different from modern video codecs, but it has some advantages. For example, the decoding process is extremely simple, which leads to a faster and less power-hungry decoder. Moreover, we found that our scheme can achieve comparable rate-distortion performances to that of modern video codecs for datasets with small disparities. However, it is difficult to express regions with larger disparities using only a limited number of common binary images. Therefore, in this paper, we extend our light field coding scheme to the one with disparity-compensation and show its effectiveness.
We can generate arbitrarily focused images with arbitrarily aperture shapes from a light field, which is usually captured with a specific device such as a light field camera. In contrast, we propose a method that only needs a sparse focal stack, i.e., a few differently focused images, which can be easily captured with a conventional camera, for the same purpose. Specifically, our method first reconstructs the underlying dense light field from the sparse focal stack, and then, generates arbitrarily focused images.
We propose a structure for a layered light-field display composed of high-resolution binary layers and a low-resolution, multibit backlight. This structure aims to increase the upper bound of the spatial frequency while also reducing the total number of bits for the display. The increased layer resolution increases the upper bound of the spatial frequency, meaning that the display can reproduce an object with a large amount of pop-out more clearly than can a conventional light-field display. In contrast, limiting the layers’ transmittance to binary (on/off) levels reduces the total number of bits for the display, thus maintaining the high efficiency of light-field representation. The low-resolution backlight, whose pixels can take multibit values, compensates the number of intensity levels, which would otherwise be quite limited with only the binary layers. Through analytical and experimental results, we show that a display based on the proposed structure can reproduce a light field with high quality and high efficiency as a result of combining the high-resolution binary layers and low-resolution backlight.
KEYWORDS: 3D displays, Photography, Cameras, Transmittance, Data conversion, Visualization, 3D visualizations, Computer graphics, Computer simulations, 3D image processing
Layered light field display, which consists of a backlight and several light attenuating layers, has been attracting attentions because of its potential to simultaneously support many viewing directions and high resolution for each direction. The transmittances of the layers’ pixels can be controlled individually, and are determined inversely from expected observation for each viewing direction. The expected observations are typically represented as a set of multi-view images. We have developed a simulator of the layered light field display using computer graphics technology, and evaluated the quality of displayed images (output quality) using real multi-view images as input. An important finding from the evaluation is that aliasing artifacts are occasionally observed from the directions without input images. To prevent aliasing artifacts, it is necessary to limit the disparities between neighboring input images into ±1 pixel according to plenoptic sampling theory, which requires significantly small viewpoint intervals. However, it is not always possible to capture such dense multi-view images that would satisfy the aliasing-free condition. To tackle this problem, we propose to use image based rendering techniques for synthesizing sufficiently dense virtual multi-view images from actually photographed images. We demonstrate that by using our method, high quality visualization without aliasing artifacts is possible even when the photographed multi-view images are sparse.
In this paper, we present a free viewpoint video generation system with billboard representation for soccer games. Free viewpoint video generation is a technology that enables users to watch 3-D objects from their desired viewpoints. Practical implementation of free viewpoint video for sports events is highly demanded. However, a commercially acceptable system has not yet been developed. The main obstacles are insufficient user-end quality of the synthesized images and highly complex procedures that sometimes require manual operations. In this work, we aim to develop a commercially acceptable free viewpoint video system with a billboard representation. A supposed scenario is that soccer games during the day can be broadcasted in 3-D, even in the evening of the same day. Our work is still ongoing. However, we have already developed several techniques to support our goal. First, we captured an actual soccer game at an official stadium where we used 20 full-HD professional cameras. Second, we have implemented several tools for free viewpoint video generation as follow. In order to facilitate free viewpoint video generation, all cameras should be calibrated. We calibrated all cameras using checker board images and feature points on the field (cross points of the soccer field lines). We extract each player region from captured images manually. The background region is estimated by observing chrominance changes of each pixel in temporal domain (automatically). Additionally, we have developed a user interface for visualizing free viewpoint video generation using a graphic library (OpenGL), which is suitable for not only commercialized TV sets but also devices such as smartphones. However, practical system has not yet been completed and our study is still ongoing.
Streaming application of multi-view and free-viewpoint video is potentially attractive but due to the limitation of bandwidth, transmitting all multi-view video in high resolution may not be feasible. Our goal is to propose a new streaming data format that can be adapted to the limited bandwidth and capable of free-viewpoint video streaming using multi-view video plus depth (MVD). Given a requested free-viewpoint, we use the two closest views and corresponding depth maps to perform free-viewpoint video synthesis. We propose a new data format that consists of all views and corresponding depth maps in a lowered resolution, and the two closest views to the requested viewpoint in the high resolution. When the requested viewpoint changes, the two closest viewpoints will change, but one or both of views are transmitted only in the low resolution during the delay time. Therefore, the resolution compensation is required. In this paper, we investigated several cases where one or both of the views are transmitted only in the low resolution. We proposed adequate view synthesis method for multi resolution multi-view video plus depth. Experimental results show that our framework achieves view synthesis quality close to high resolution multi-view video plus depth.
In recent years, ray space (or light field in other literatures) photography has gained a great popularity in the
area of computer vision and image processing, and an efficient acquisition of a ray space is of great significance
in the practical application. In order to handle the huge data problem in the acquisition process, in this
paper, we propose a method of compressively sampling and reconstructing one ray space. In our method, one
weighted matrix which reflects the amplitude structure of non-zero coefficients in 2D-DCT domain is designed and
generated by using statistics from available data set. The weighted matrix is integrated in ι1 norm optimization to
reconstruct the ray space, and we name this method as statistically-weighted ι1 norm optimization. Experimental
result shows that the proposed method achieves better reconstruction result at both low (0.1 of original sampling
rate) and high (0.5 of original sampling rate) subsampling rates. In addition, the reconstruction time is also
reduced by 25% compared to the reconstruction time by plain ι1 norm optimization.
The goal of our research is to develop a real-time free-viewpoint image synthesis system for dynamic scenes using
multi-view video cameras. To this end, depth estimation that is efficient and suitable for dynamic scenes is
indispensable. A promising solution for this is view-dependent depth estimation where per-pixel depth maps are
estimated directly for the target views to synthesize. Such view dependent methods were successfully adopted in
previous works, but their depth estimation quality was limited especially for textureless objects, resulting in low quality
virtual views. This limitation comes from a fact that their depth estimation depended only on passive approaches such
as traditional stereo triangulation. To tackle this problem, we considered to use active methods in addition to the
passive stereo triangulation. Inspired by the success of recent commercial depth cameras, we developed a customized
active illumination using a DLP projector. The projector casts spacially incoherrent patterns to the scene and makes
textureless regions identifiable from the cameras, so that stereo triangulation among the multi-view cameras can be
greatly improved. Moreover, making the illuminations time-varying, we can stabilize depth estimation more by using
spatiotemporal matching across multi-view cameras based on the concept of spacetime stereo method, and also remove
the artificial patterns from the synthesized virtual views by averaging successive time frames. Our system consisting of
16 video cameras synchronized with the DLP projector runs in real-time (about 10 fps) thanks to our sophisticated
GPGPU implementation.
This paper proposes a method for constructing a reasonable scale of end-to-end free-viewpoint video system that
captures multiple view and depth data, reconstructs three-dimensional polygon models of objects, and display
them on virtual 3D CG spaces. This system consists of a desktop PC and four Kinect sensors. First, multiple
view plus depth data at four viewpoints are captured by Kinect sensors simultaneously. Then, the captured data
are integrated to point cloud data by using camera parameters. The obtained point cloud data are sampled to
volume data that consists of voxels. Since volume data that are generated from point cloud data are sparse,
those data are made dense by using global optimization algorithm. Final step is to reconstruct surfaces on dense
volume data by discrete marching cubes method. Since accuracy of depth maps affects to the quality of 3D
polygon model, a simple inpainting method for improving depth maps is also presented.
This paper presents a method of virtual view synthesis using view plus depth data from multiple viewpoints. Intuitively,
virtual view generation from those data can be easily achieved by simple 3D warping. However, 3D points reconstructed
from those data are isolated, i.e. not connected with each other. Consequently, the images generated by existing methods
have many holes that are very annoying due to occlusions and the limited sampling density. To tackle this problem, we
propose two steps algorithm as follows. In the first step, view plus depth data from each viewpoint is 3D warped to the
virtual viewpoint. In this process, we determine which neighboring pixels should be connected or kept isolated. For this
determination, we use depth differences among neighboring pixels, and SLIC-based superpixel segmentation that
considers both color and depth information. The pixel pairs that have small depth differences or reside in same
superpixels are connected, and the polygons enclosed by the connected pixels are inpainted, which greatly reduces the
holes. This warping process is performed individually for each viewpoint from which view plus depth data are provided,
resulting in several images at the virtual viewpoint that are warped from different viewpoints. In the second step, we
merge those warped images to obtain the final result. Thanks to the data provided from different viewpoints, the final
result has less noises and holes compared to the result from single viewpoint information. Experimental results using
publicly available view plus depth data are reported to validate our method.
Light field cameras are attracting much attention as tools for acquiring 3D information of a scene through a single
camera. The main drawback of typical lenselet-based light field cameras is the limited resolution. This limitation comes
from the structure where a microlens array is inserted between the sensor and the main lens. The microlens array projects
4D light field on a single 2D image sensor at the sacrifice of the resolution; the angular resolution and the position
resolution trade-off under the fixed resolution of the image sensor. This fundamental trade-off remains after the raw light
field image is converted to a set of sub-aperture images. The purpose of our study is to estimate a higher resolution
image from low resolution sub-aperture images using a framework of super-resolution reconstruction. In this
reconstruction, these sub-aperture images should be registered as accurately as possible. This registration is equivalent to
depth estimation. Therefore, we propose a method where super-resolution and depth refinement are performed
alternatively. Most of the process of our method is implemented by image processing operations. We present several
experimental results using a Lytro camera, where we increased the resolution of a sub-aperture image by three times
horizontally and vertically. Our method can produce clearer images compared to the original sub-aperture images and the
case without depth refinement.
This paper focuses on a road-to-vehicle visible light communication (VLC) system using LED traffic light as the transmitter and camera as the receiver. The traffic light is composed of a hundred of LEDs on two dimensional plain. In this system, data is sent as two dimensional brightness patterns by controlling each LED of the traffic light individually, and they are received as images by the camera. Here, there are problems that neighboring LEDs on the received image are merged due to less number of pixels in case that the receiver is distant from the transmitter, and/or due to blurring by defocus of the camera. Because of that, bit error rate (BER) increases due to recognition error of intensity of LEDs To solve the problem, we propose a method that estimates the intensity of LEDs by solving the inverse problem of communication channel characteristic from the transmitter to the receiver. The proposed method is evaluated by BER characteristics which are obtained by computer simulation and experiments. In the result, the proposed method can estimate with better accuracy than the conventional methods, especially in case that the received image is blurred a lot, and the number of pixels is small.
In this paper, we discuss a Free viewpoint synthesis with View + Depth format for Multiview applications such as 3DTV and Free View-point Television(FTV)1.When generating a virtual image, 3D warping is used with view and depth of a reference camera. This process includes the problem that holes appear in the virtual image. In conventional method, the holes were dealt with collectively by median filter. There are some different reasons why holes appear through this process. So, it is improper that they are not distinguished particularly and treated all at once like conventional method. We analyze the factors, and recognize that two ones exist, boundary between foreground and background, and reduction of resolution. In this paper, we propose a new hole filling method considering these factors. In the first step, we classify nearby pixels into boundary or same object area according to the gradient of depth value. For boundary case, we hold them and refer to other two real cameras. For another case of same object area, we set up sub-pixels between nearby pixels and warp them if the depth is gradually changing or virtual viewpoint of the warped image is closer to the object than the original view position because they probably cause holes from reduction of resolution. We implement these methods in the simulation. As a result, we prevent boundary in the virtual image from being ambiguous, and confirm the availability of proposed method.
KEYWORDS: Cameras, Imaging systems, Video, 3D acquisition, Analog electronics, Control systems, 3D displays, Video coding, 3D image processing, Signal generators
3D TV requires multiple view images and it is very important to adjust parameters used for capturing and display of
multiview images, which includes size of view images, focal length, and camera/viewpoint interval. However, the
parameters usually vary from systems to systems and that causes a problem regarding interconnectivity between
capturing and display devices. The Ray-Space method provides one of the solutions to such problems raised in 3D TV
data capturing, transmission, storing, and display. In this paper, we first review the Ray-Space method and describe its
relationship with 3D TV. Then, we introduce 3 types of Ray-Space acquisition systems: 100-camera system,
space/time-division system, and portable multi-camera system. We also describe test data set provided for MPEG
(Moving Picture Experts Group) Multiview Video Coding and 3D Video activities.
In this paper, we discuss a multiview video and depth coding system for Multiview video applications such as 3DTV
and Free View-point Television (FTV) 1. We target an appropriate multiview and depth compression method. And then
we investigate the effect on free view synthesis quality by changing the transmission rates between multiview and depth
sequences. In the simulations, we employ MVC in parallel to compress the multiview video and depth sequences at
different bitrates, and compare the virtual view sequences generated by decoded data with the original video sequences
taken in the same viewpoint. Our experimental results show that bitrates of multi depth stream has less effect on the view
synthesis quality compared with the multi view stream.
Visible Light Communication (VLC) is a wireless communication method using LEDs. LEDs can respond in high-speed
and VLC uses this characteristics. In VLC researches, there are two types of receivers mainly, one is photodiode receiver
and the other is high-speed camera. A photodiode receiver can communicate in high-speed and has high transmission
rate because of its high-speed response. A high-speed camera can detect and track the transmitter easily because it is not
necessary to move the camera. In this paper, we use a hybrid sensor designed for VLC which has advantages of both
photodiode and high-speed camera, that is, high transmission rate and easy detecting of the transmitter. The light
receiving section of the hybrid sensor consists of communication pixels and video pixels, which realizes the advantages.
This hybrid sensor can communicate in static environment in previous research. However in dynamic environment, high-speed
tracking of the transmitter is essential for communication. So, we realize the high-speed tracking of the transmitter
by using the information of the communication pixels. Experimental results show the possibility of communication in
dynamic environment.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.