Reversible information hiding technology can hide secret or sensitive information in the redundant information of the carrier image and completely restore the original image at the receiving end. Current, the difference histogram algorithm appears to be the most attractive for reversible information hiding. However, this technique cannot well balance the embedding capacity and security. To further improve the embedding capacity and security of the difference histogram algorithm, this paper proposes a large-capacity reversible information hiding algorithm based on multi-difference histograms and Gray code. At first, the original image is divided into multiple same-size blocks. Then the blocked image is scrambled with Gray code to improve the system's security. Thereafter, a difference histogram is established for the blocked image and the zero value on the right side of the peak value is selected as the embedding position. Finally, the secret information is embedded. Experimental results show that the proposed algorithm significantly improves the embedding capacity of the carrier image while ensuring the security of the carrier image and the secret information.
With the trustworthiness of multimedia data has been challenged by editing tools, image forgery localization aims to identify regions in images that have been modified. Although the existing techniques provide reasonably good results for image forgery localization, with emerging new editing techniques, such models must be retrained and it is highly dependent on the real tampering localization maps. In this paper, we propose an attention-based fusion network that combines the RGB image and noise residual yielding excellent results. Noise residual is commonly regarded as camera model fingerprint, and forgery localization can be detected as deviations from the expected regular pattern. The model consists of three parts: feature extraction, attentional feature fusion, and feature output. The feature extraction module is used to extract RGB image features and noise residuals separately, and the attentional feature fusion module is used to suppress the high frequency components, supplement and enhance model-related artifacts by combining the aforementioned features. Finally, the last module generates images with one channel as the camera model fingerprint. In order to avoid dependence on tampering localization maps, the model is trained with pairs of image patches coming from the same or different camera sensors by means of Siamese network. Experiment results obtained from several datasets show that the proposed technique successfully identifies modified regions, improves the quality of camera model fingerprints, and achieves significantly better performance when compared to the existing techniques.
Accurate crowd counting in congested scenes remain challengeable in the trade-off of efficiency and generalization. For solving this issue, we propose a mobile-friendly solution for the network deployment in high response speed demand scenarios. In order to introduce the profound potential of global crowd representations to lightweight counting model, this work suggests a novel crowd counting aimed mobile vision transformers architecture (CCMTNet), which strives for enhancing the efficiency of the model universality in real-time crowd counting tasks on resource constrained computing devices. The framework of linear CNN network interpolation structure with self-attention blocks endows the model with the ability of local feature extraction and global high-dimensional crowd information processing with low computational cost. In addition, several experimental networks with different scales based on the proposed architecture are comprehensively verified to balance the accuracy loss as compressing the computing costs. Extensive experiments on three mainstream datasets for crowd counting tasks well demonstrate the effectiveness of this proposed network. Particularly, CCMTNet achieves the feasibility of reconciling the counting accuracy and efficiency in comparisons with traditional lightweight CNN networks.
Makeup transfer aims to extract a specific makeup from a face and transfer it to another face, which can be widely used in portrait beauty, and cosmetics marketing. At present, existing methods can achieve the transfer of the entire facial makeup, but the quality of makeup transfer is not excellent because there may be a mismatch between the two images. In this paper, we propose a facial makeup transfer network based on the Laplacian pyramid, which can better preserve the facial structure from the source image and achieve high-quality transfer results. The model consists of three parts: makeup feature extraction, facial structure feature extraction, and makeup fusion. The makeup extraction part is used to extract the facial makeup from the reference image. And the facial structure feature extraction part is used to extract facial structure from the source image, in order to solve the loss of facial details when extracting facial structural features, we used the method based on Laplacian Pyramid. The makeup fusion part will fuse the facial makeup with facial structure features. Many experiments on the MT dataset have shown that this method can transfer makeup successfully without changing the original facial structure, and achieve advanced performance in various makeup styles.
In recent years, great progress has been made in the study of crowd counting. Although the crowd counting networks being proposed to solve different problems have achieved satisfactory counting results, the differences of crowd density and scale in the same scene still degrade the overall counting performance. In order to deal with this problem, we propose a Multi-Scale Attention Grading Crowd Counting Network (MSAGNet), which focuses on different crowd densities in the scene by attention mechanism and fuses multi-scale information to reduce scale differences. Specifically, the grading attention feature obtaining module focuses on different densities of people in the scene by attention mechanism, and adaptively assigns corresponding weights to different density regions. Dense regions are given more weights, allowing the model to focus more on that part making the training of that region more accurate and effective. In addition, the multi-scale density feature fusion module fuses the feature maps containing density information to generate the final feature maps. The obtained feature maps contain attention information at different scales, which are density mapped to obtain the estimated density maps. This method can focus on different density regions in the same scene, and simultaneously fuse multi-scale information and attention weight, which can effectively solve the problem of counting dense regions that is difficult to calculate. Extensive experiments on existing crowd counting datasets (UCF_CC_50, ShanghaiTech, UCF-QNRF) show that our method can effectively improve the counting performance.
Crowd counting has been a popular research topic in the field of computer vision due to the variation of human head scales and the interference of background noise. Some existing methods use multi-level feature fusion to solve scale variation, but the problem of background noise interference may be more serious due to the involvement of shallow features in the feature fusion process. In this paper, we propose a Multilevel Information Sharing Network based on Residual Attention(RA-MISNet) to solve this problem. The RA-MISNet consists of a feature extraction component, an information sharing module and a residual attention density map estimator. On the basis of solving the multi-scale problem, the residual attention mechanism is adopted by our proposed method to refine the population distribution information in sharing features at all levels, which can reduce the interference of complex texture background on density map regression. Furthermore, owing to the severe label noise interference problem in high-density crowd areas, we design a Regional Multi-level Segmentation Loss (RMS Loss) to divide the multi-level density regions with different label noise rates in a single crowd image and apply the corresponding granularity supervision constraints for each density level region. Extensive experiments on three crowd counting datasets (ShanghaiTech, UCF CC 50, UCF-QNRF) demonstrate the effectiveness and superiority of the proposed methods.
Pneumonia, an infectious disease that can influence the lungs, is a severe medical field topic. Therefore, how to correctly classify images of pneumonia is very important. The limitations of traditional machine learning algorithms and the significant improvement of computing performance make deep learning widely used. At present, using a convolutional neural network to classify pneumonia is still the mainstream method. This paper provides a modified capsule network to detect and classify pneumonia by using X-ray pictures. The model consists of two parts: encoder and decoder. Encoder contains convolutional layer, primary capsule layer, and digital capsule layer. The primary capsule layer and digital capsule layer convert a scalar into a vector and then try to cluster vectors of the same category by dynamic routing. The decoder contains a deconvolutional layer. The image is reconstructed by up-sampling the vector generated by the encoder, and the reconstructed image is compared with the original image to make the features extracted by the encoder more representative. The training and testing process takes place on the dataset "Labeled Optical Coherence Tomography (OCT) and Chest XRay Images for Classification." This dataset contains a total of 5856 pictures. We divide the images into the training set and testing set at a ratio of 8:2. The accuracy rate on this dataset is 98.6%. This model has a more straightforward structure and fewer parameters than other popular models, which means that it can be more easily deployed in various conditions in practical applications.
Nowadays, due to various challenges such as large-scale variation of population, mutual occlusion, perspective distortion and so on, crowd counting has gradually become a hot issue in computer vision. To address the large- scale variation exists in the images, in this paper, we propose a novel multi-scale network called MSNet which aims to maintain continuous variations and count the number of pedestrians accurately. While most state-of-the- arts multi-scale and multi-column networks aim to integrate the scale information of heads with different size, lots of researches still need to do to achieve continuous variations. In MSNet, specifically, the first ten layers of the visual geometry group network(VGG) are used as the backbone to extract the rough features of images and a multi-scale block is employed to maintain the scale information which contains several receptive kernels to obtain a better performance towards the difficulty of scale-variation. Inspired by the knowledge that using multiple small receptive field kernels to replace a single large receptive field will get a better performance, we utilize two dilated convolutions with the receptive field of 5 to replace the large kernel. Our MSNet has moderate increase in computation, and we evaluate our method on three benchmark datasets including ShanghaiTech (Part A: MAE=59.6, RMSE=96.1; Part B: MAE=7.5, RMSE=12.1), UCF-CC-50(MAE=207.9, RMSE=273.8) and UCF-QNRF(MAE=93, RMSE=158) to show the outperformance of our method.
This paper aims at comparing 4 top models for crowd counting and evaluating their highlights based on their performance. In DSNet, the distended convolution block network was proposed, where the distended layers are densely connected to each other in order to preserve information from continuously varied scales. Three blocks are cascaded and linked to dense residual connections to widen the range of levels covered by network and also a novel loss of consistency at multi-scale density level was introduced to improve performance. In SFANet, two foremost elements with VGG backbone CNN and two-way path multi-scale fusion networks were suggested for the front end feature extractor and back end to make density map in which one path highlights crowded regions present in images. The other direction is responsible for the fusion of multi-scale features and for the generation of the final high-quality high-density maps. In MANet (Multi-scale Attention Network), a new mechanism of soft attention was presented, which learns a series of masks and a level-conscious loss feature was introduced to regularize and direct the learning of different branches to specialize on a specific scale. In Bayesian Loss, a novel loss function was used to generate a density contribution model from the point annotations. We also analyzed the results of the 4 convolutional neural networks, extracted the pattern of convolutional neural network structure and found promising pathways for researchers in this fast-growing area.
Crowd counting is an important part of crowd analysis, which is of great significance to crowd control and management. The convolutional neural network (CNN) based crowd counting method is widely used to solve the problem of insufficient counting accuracy due to heavy occlusion, background clutters, head scale and perspective changes in crowd scenes. The multi-column convolutional neural network (MCNN) is a CNN-based method for crowd counting, which adapts to head scale variation of crowd scenes by constructing multi-column convolutional neural network composing of three single-column networks corresponding to the convolution kernel with different sizes (large, medium and small). However, as the MCNN network is relatively shallow, its receptive field is also limited, which affects the adaptability to large scale variations. In addition, due to insufficient training data, it is necessary to carry out a pre-training strategies which pre-trains the single-column convolutional neural network individually and combines the cumbersome. In this paper, a crowd counting method based on multi-column dilated convolutional neural network was proposed. Dilated convolution was used to enhance the receptive field of the network, so as to be better adaptive to the head scale variations. The image patches were obtained by randomly clipping from the original training data set images in the process of each iterative training to further expand the training data, while the training could be achieved without tedious pre-training. The experimental results on ShanghaiTech public dataset showed that the accuracy of crowd counting proposed in this paper was better than that of MCNN, which proved that this method is more robust to head scale variations in crowd scenes.
Speech recognition has always been one of the research focuses in the field of human-computer communication and interaction. The main purpose of automatic speech recognition (ASR) is to convert speech waveform signals into text. Acoustic model is the main component of ASR, which is used to connect the observation features of speech signals with the speech modeling units. In recent years, deep learning has become the mainstream technology in the field of speech recognition. In this paper, a convolutional neural network architecture composed of VGG and Connectionist Temporal Classification (CTC) loss function was proposed for speech recognition acoustic model. Traditional acoustic model training is based on frame-level labels with cross-entropy criterion, which requires a tedious label alignment procedure. The CTC loss was adopted to automatically learn the alignments between speech frames and label sequences, such that the training process is end-to-end. The architecture can exploit temporal and spectral structures of speech signals simultaneously. Batch normalization (BN) technique was used for normalizing each layers input to reduce internal covariance shift. To prevent overfitting, dropout technique was used during training to improve network generalization ability. The speech signal was transformed into a spectral image through a series of processing to be the input of the neural network. The input feature is 200 dimensions, and output labels of acoustic mode is 415 Chinese pronunciation without pitch. The experimental results demonstrated that the proposed model achieves the Character error rate (CER) of 17.97% and 23.86% on public Mandarin speech corpus, AISHELL-1 and ST-CMDS-20170001_1, respectively.
Since accurate early detection of malignant lung nodule can greatly enhance the survival of the patient, detection of early stage lung cancer with chest computed tomography (CT) scans is a major problem from the last couple of decades. Therefore, automated lung cancer detection techniques is important. However, it is a significant challenge to accurately detect lung cancer at the early stage due to substantial similarities in the structure of the benign and the malignant lung nodules. The major task is to reduce the false positive and false negative results in lung cancer detection. Recent advancements in convolutional neural network (CNN) models have improved image detection and classification for many tasks. In this study, we presented a deep learning-based framework for automated lung cancer detection. The proposed framework works in multiple stages on 3D lung CT scan images to detect and determine the malignancy of the nodules. Considering 3D nature of lung CT data and the compactness of mixed link network (MixNet), two deep 3D faster R-CNN and U-Net encoder-decoder with MixNet were designed to detect and learn the features of the lung nodule, respectively. For the classification of the nodules, the gradient boosting machine (GBM) with 3D MixNet was proposed. The proposed system was tested with manually draw radiologist contours on 1200 images obtained from LIDC-IDRI including 3250 nodules by using statistical measures. LIDC-IDRI comprises of equal number of benign and malignant lung nodules. The proposed system was evaluated on this data set in the form of sensitivity (94%), specificity (90%), area under the receiver operating curve (0.99) and obtained better results compared to the existing methods.
An image encryption method combing chaotic map and Arnold transform in the gyrator transform domains was
proposed. Firstly, the original secret image is XOR-ed with a random binary sequence generated by a logistic map. Then,
the gyrator transform is performed. Finally, the amplitude and phase of the gyrator transform are permutated by Arnold
transform. The decryption procedure is the inverse operation of encryption. The secret keys used in the proposed method
include the control parameter and the initial value of the logistic map, the rotation angle of the gyrator transform, and the
transform number of the Arnold transform. Therefore, the key space is large, while the key data volume is small. The
numerical simulation was conducted to demonstrate the effectiveness of the proposed method and the security analysis
was performed in terms of the histogram of the encrypted image, the sensitiveness to the secret keys, decryption upon
ciphertext loss, and resistance to the chosen-plaintext attack.
For the gyrator transform-based image encryption, besides the random operations, the rotation angles used in the gyrator transforms are also taken as the secret keys, which makes such cryptosystems to be more secure. To analyze the security of such cryptosystems, one may start from analyzing the security of a single gyrator transform. In this paper, the security of the gyrator transform-based image encryption by chosen-plaintext attack was discussed in theory. By using the impulse functions as the chosen-plaintext, it was concluded that: (1) For a single gyrator transform, by choosing a plaintext, the rotation angle can be obtained very easily and efficiently; (2) For image encryption with a single random phase encoding and a single gyrator transform, it is hard to find the rotation angle directly with a chosen-plaintext attack. However, assuming the value of one of the elements in the random phase mask is known, the rotation angle can be obtained very easily with a chosen-plaintext attack, and the random phase mask can also be recovered. Furthermore, by exhaustively searching the value of one of the elements in the random phase mask, the rotation angle as well as the random phase mask may be recovered. By obtaining the relationship between the rotation angle and the random phase mask for image encryption with a single random phase encoding and a single gyrator transform, it may be useful for further study on the security of the iterative random operations in the gyrator transform domains.
An image hiding method based on cascaded iterative Fourier transform and public-key encryption
algorithm was proposed. Firstly, the original secret image was encrypted into two phase-only masks
M1 and M2 via cascaded iterative Fourier transform (CIFT) algorithm. Then, the public-key
encryption algorithm RSA was adopted to encrypt M2 into M2' . Finally, a host image was
enlarged by extending one pixel into 2×2 pixels and each element in M1 and M2' was
multiplied with a superimposition coefficient and added to or subtracted from two different elements in
the 2×2 pixels of the enlarged host image. To recover the secret image from the stego-image, the
two masks were extracted from the stego-image without the original host image. By applying
public-key encryption algorithm, the key distribution was facilitated, and also compared with the image
hiding method based on optical interference, the proposed method may reach higher robustness by
employing the characteristics of the CIFT algorithm. Computer simulations show that this method has
good robustness against image processing.
In this paper, double-random phase-encoding based image hiding method was employed to encrypt and hide text. The
ASCII codes of the secret text information was denoted as binary, and then transformed to a 2-dimensional array in the
form of an image. Each element in the transformed array has a value between 0 and 255, where the highest 2 bits or the
highest 4 bits were stored with the binary bits of the text information, while the lower bits were filled with binary bits.
Then, the double-random phase-encoding method was used to encode the transformed array, and the encoded array was
hidden into an expanded cover image to achieve text information hiding. Experimental results show that the secret text
can be recovered accurately with the ratio of 100% and 99.89% by storing the binary bits of the text information to the
highest 2 bits and the highest 4 bits of the transformed array, respectively. By employing the optical information
processing method, the proposed method can improve the security of text information transmission, while keeping high
hiding capacity.
The binary phase-only filter (BPOF) based watermarking for image authentication was proposed earlier [5] and it shows good performance. In this paper, three image self-authentication algorithms based on BPOF with different watermark embedding methods from Reference 5 are proposed. The BPOF of an image is used as the watermark to embed into its Fourier spectrum by adding to the magnitude or quantizing the magnitude of the Fourier spectrum. For image authentication, either the correlation between the BPOF of the test image with its Fourier magnitude spectrum is computed, or the correlation between the phase information of the test image with the candidate watermark extracted from it is calculated, according to the different embedding methods used in the embedding stage. By using these embedding methods, we expand the potential watermark embedding strength, which is very important for security and robustness of digital watermarking. The performance of these three algorithms is evaluated via computer simulation.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.