With the new generation of satellite technologies, the archives of remote sensing (RS) images are growing very fast. To make the intrinsic information of each RS image easily accessible, visual question answering (VQA) has been introduced in RS. VQA allows a user to formulate a free-form question concerning the content of RS images to extract generic information. It has been shown that the fusion of the input modalities (i.e., image and text) is crucial for the performance of VQA systems. Most of the current fusion approaches use modalityspecific representations in their fusion modules instead of joint representation learning. However, to discover the underlying relation between both the image and question modality, the model is required to learn the joint representation instead of simply combining (e.g., concatenating, adding, or multiplying) the modality-specific representations. We propose a multi-modal transformer-based architecture to overcome this issue. Our proposed architecture consists of three main modules: i) the feature extraction module for extracting the modality-specific features; ii) the fusion module, which leverages a user-defined number of multi-modal transformer layers of the VisualBERT model (VB); and iii) the classification module to obtain the answer. In contrast to recently proposed transformer-based models in RS VQA, the presented architecture (called VBFusion) is not limited to specific questions, e.g., questions concerning pre-defined objects. Experimental results obtained on the RSVQAxBEN and RSVQA-LR datasets (which are made up of RGB bands of Sentinel-2 images) demonstrate the effectiveness of VBFusion for VQA tasks in RS. To analyze the importance of using other spectral bands for the description of the complex content of RS images in the framework of VQA, we extend the RSVQAxBEN dataset to include all the spectral bands of Sentinel-2 images with 10m and 20m spatial resolution. Experimental results show the importance of utilizing these bands to characterize the land-use land-cover classes present in the images in the framework of VQA. The code of the proposed method is publicly available at https://git.tu-berlin.de/rsim/multimodal- fusion-transformer-for-vqa-in-rs.
Subsurface tile drainage pipes provide agronomic, economic and environmental benefits. By lowering the water table of wet soils, they improve the aeration of plant roots and ultimately increase the productivity of farmland. They do however also provide an entryway of agrochemicals into subsurface water bodies and increase nutrition loss in soils. For maintenance and infrastructural development, accurate maps of tile drainage pipe locations and drained agricultural land are needed. However, these maps are often outdated or not present. Different remote sensing (RS) image processing techniques have been applied over the years with varying degrees of success to overcome these restrictions. Recent developments in deep learning (DL) techniques improve upon the conventional techniques with machine learning segmentation models. In this study, we introduce two DL-based models: i) improved U-Net architecture; and ii) Visual Transformer-based encoder-decoder in the framework of tile drainage pipe detection. Experimental results confirm the effectiveness of both models in terms of detection accuracy when compared to a basic U-Net architecture. Our code and models are publicly available at https: //git.tu-berlin.de/rsim/drainage-pipes-detection.
This paper presents a novel approach based on the direct use of deep neural networks to approximate wavelet sub-bands for remote sensing (RS) image scene classification in the JPEG 2000 compressed domain. The proposed approach consists of two main steps. The first step aims to approximate the finer level wavelet sub-bands. To this end, we introduce a novel Deep Neural Network approach that utilizes the coarser level binary decoded wavelet sub-bands to approximate the finer level wavelet sub-bands (the image itself) through a series of deconvolutional layers. The second step aims to describe the high-level semantic content of the approximated wavelet sub- bands and to perform scene classification based on the learnt descriptors. This is achieved by: i) a series of convolutional layers for the extraction of descriptors which models the approximated sub-bands; and ii) fully connected layers for the RS image scene classification. Then, we introduce a loss function that allows to learn the parameters of both steps in an end-to-end trainable and unified neural network. The proposed approach requires only the coarser level wavelet sub-bands as input and thus minimizes the amount of decompression applied to the compressed RS images. Experimental results show the effectiveness of the proposed approach in terms of classification accuracy and reduced computational time when compared to the conventional use of Convolutional Neural Networks within the JPEG 2000 compressed domain.
This paper presents a novel multi-label active learning (MLAL) technique in the framework of multi-label remote sensing (RS) image scene classification problems. The proposed MLAL technique is developed in the framework of the multi-label SVM classifier (ML-SVM). Unlike the standard AL methods, the proposed MLAL technique redefines active learning by evaluating the informativeness of each image based on its multiple land-cover classes. Accordingly, the proposed MLAL technique is based on the joint evaluation of two criteria for the selection of the most informative images: i) multi-label uncertainty and ii) multi-label diversity. The multi-label uncertainty criterion is associated to the confidence of the multi-label classification algorithm in correctly assigning multi-labels to each image, whereas multi-label diversity criterion aims at selecting a set of un-annotated images that are as more diverse as possible to reduce the redundancy among them. In order to evaluate the multi-label uncertainty of each image, we propose a novel multi-label margin sampling strategy that: 1) considers the functional distances of each image to all ML-SVM hyperplanes; and then 2) estimates the occurrence on how many times each image falls inside the margins of ML-SVMs. If the occurrence is small, the classifiers are confident to correctly classify the considered image, and vice versa. In order to evaluate the multi-label diversity of each image, we propose a novel clustering-based strategy that clusters all the images inside the margins of the ML-SVMs and avoids selecting the uncertain images from the same clusters. The joint use of the two criteria allows one to enrich the training set of images with multi-labels. Experimental results obtained on a benchmark archive with 2100 images with their multi-labels show the effectiveness of the proposed MLAL method compared to the standard AL methods that neglect the evaluation of the uncertainty and diversity on multi-labels.
This paper presents a novel content-based image search and retrieval (CBIR) system that achieves coarse to fine remote sensing (RS) image description and retrieval in JPEG 2000 compressed domain. The proposed system initially: i) decodes the code-streams associated to the coarse (i.e., the lowest) wavelet resolution, and ii) discards the most irrelevant images to the query image that are selected based on the similarities estimated among the coarse resolution features of the query image and those of the archive images. Then, the code-streams associated to the sub-sequent resolution of the remaining images in the archive are decoded and the most irrelevant images are selected by considering the features associated to both resolutions. This is achieved by estimating the similarities between the query image and remaining images by giving higher weights to the features associated to the finer resolution while assigning lower weights to those related to the coarse resolution. To this end, the pyramid match kernel similarity measure is exploited. These processes are iterated until the code-streams associated to the highest wavelet resolution are decoded only for a very small set of images. By this way, the proposed system exploits a multiresolution and hierarchical feature space and accomplish an adaptive RS CBIR with significantly reduced retrieval time. Experimental results obtained on an archive of aerial images confirm the effectiveness of the proposed system in terms of retrieval accuracy and time when compared to the standard CBIR systems.
This paper presents a novel class sensitive hashing technique in the framework of large-scale content-based remote sensing (RS) image retrieval. The proposed technique aims at representing each image with multi-hash codes, each of which corresponds to a primitive (i.e., land cover class) present in the image. To this end, the proposed method consists of a three-steps algorithm. The first step is devoted to characterize each image by primitive class descriptors. These descriptors are obtained through a supervised approach, which initially extracts the image regions and their descriptors that are then associated with primitives present in the images. This step requires a set of annotated training regions to define primitive classes. A correspondence between the regions of an image and the primitive classes is built based on the probability of each primitive class to be present at each region. All the regions belonging to the specific primitive class with a probability higher than a given threshold are highly representative of that class. Thus, the average value of the descriptors of these regions is used to characterize that primitive. In the second step, the descriptors of primitive classes are transformed into multi-hash codes to represent each image. This is achieved by adapting the kernel-based supervised locality sensitive hashing method to multi-code hashing problems. The first two steps of the proposed technique, unlike the standard hashing methods, allow one to represent each image by a set of primitive class sensitive descriptors and their hash codes. Then, in the last step, the images in the archive that are very similar to a query image are retrieved based on a multi-hash-code-matching scheme. Experimental results obtained on an archive of aerial images confirm the effectiveness of the proposed technique in terms of retrieval accuracy when compared to the standard hashing methods.
This paper investigates the effectiveness of deep learning (DL) for domain adaptation (DA) problems in the classification of remote sensing images to generate land-cover maps. To this end, we introduce two different DL architectures: 1) single-stage domain adaptation (SS-DA) architecture; and 2) hierarchal domain adaptation (H-DA) architecture. Both architectures require that a reliable training set is available only for one of the images (i.e., the source domain) from a previous analysis, whereas it is not for another image to be classified (i.e., the target domain). To classify the target domain image, the proposed architectures aim to learn a shared feature representation that is invariant across the source and target domains in a completely unsupervised fashion. To this end, both architectures are defined based on the stacked denoising auto-encoders (SDAEs) due to their high capability to define high-level feature representations. The SS-DA architecture leads to a common feature space by: 1) initially unifying the samples in source and target domains; and 2) then feeding them simultaneously into the SDAE. To further increase the robustness of the shared representations, the H-DA employs: 1) two SDAEs for learning independently the high level representations of source and target domains; and 2) a consensus SDAE to learn the domain invariant high-level features. After obtaining the domain invariant features through proposed architectures, the classifier is trained by the domain invariant labeled samples of the source domain, and then the domain invariant samples of the target domain are classified to generate the related classification map. Experimental results obtained for the classification of very high resolution images confirm the effectiveness of the proposed DL architectures.
This paper presents a novel compressed histogram attribute profile (CHAP) for classification of very high resolution remote sensing images. The CHAP characterizes the marginal local distribution of attribute filter responses to model the texture information of each sample with a small number of image features. This is achieved based on a three steps algorithm. The first step is devoted to provide a complete characterization of spatial properties of objects in a scene. To this end, the attribute profile (AP) is initially built by the sequential application of attribute filters to the considered image. Then, to capture complete spatial characteristics of the structures in the scene a local histogram is calculated for each sample of each image in the AP. The local histograms of the same pixel location can contain redundant information since: i) adjacent histogram bins can provide similar information; and ii) the attributes obtained with similar attribute filter threshold values lead to redundant features. In the second step, to point out the redundancies the local histograms of the same pixel locations in the AP are organized into a 2D matrix representation, where columns are associated to the local histograms and rows represents a specific bin in all histograms of the considered sequence of filtered attributes in the profile. This representation results in the characterization of the texture information of each sample through a 2D texture descriptor. In the final step, a novel compression approach based on a uniform 2D quantization strategy is applied to remove the redundancy of the 2D texture descriptors. Finally the CHAP is classified by a Support Vector Machine classifier with histogram intersection kernel that is very effective for high dimensional histogram-based feature representations. Experimental results confirm the effectiveness of the proposed CHAP in terms of computational complexity, storage requirements and classification accuracy when compared to the other AP-based methods.
This paper presents a novel semisupervised learning (SSL) technique defined in the context of ε-insensitive support vector regression (SVR) to estimate biophysical parameters from remotely sensed images. The proposed SSL method aims to mitigate the problems of small-sized biased training sets without collecting any additional samples with reference measures. This is achieved on the basis of two consecutive steps. The first step is devoted to inject additional priors information in the learning phase of the SVR in order to adapt the importance of each training sample according to distribution of the unlabeled samples. To this end, a weight is initially associated to each training sample based on a novel strategy that defines higher weights for the samples located in the high density regions of the feature space while giving reduced weights to those that fall into the low density regions of the feature space. Then, in order to exploit different weights for training samples in the learning phase of the SVR, we introduce a weighted SVR (WSVR) algorithm. The second step is devoted to jointly exploit labeled and informative unlabeled samples for further improving the definition of the WSVR learning function. To this end, the most informative unlabeled samples that have an expected accurate target values are initially selected according to a novel strategy that relies on the distribution of the unlabeled samples in the feature space and on the WSVR function estimated at the first step. Then, we introduce a restructured WSVR algorithm that jointly uses labeled and unlabeled samples in the learning phase of the WSVR algorithm and tunes their importance by different values of regularization parameters. Experimental results obtained for the estimation of single-tree stem volume show the effectiveness of the proposed SSL method.
This paper presents a novel active learning (AL) technique in the context of ε-insensitive support vector regression
(SVR) to estimate biophysical parameters from remotely sensed images. The proposed AL method aims at selecting the
most informative and representative unlabeled samples which have maximum uncertainty, diversity and density assessed
according to the SVR estimation rule. This is achieved on the basis of two consecutive steps that rely on the kernel kmeans
clustering. In the first step the most uncertain unlabeled samples are selected by removing the most certain ones
from a pool of unlabeled samples. In SVR problems, the most uncertain samples are located outside or on the boundary
of the ε-tube of SVR, as their target values have the lowest confidence to be correctly estimated. In order to select these
samples, the kernel k-means clustering is applied to all unlabeled samples together with the training samples that are not
SVs, i.e., those that are inside the ε-tube, (non-SVs). Then, clusters with non-SVs inside are rejected, whereas the
unlabeled samples contained in the remained clusters are selected as the most uncertain samples. In the second step the
samples located in the high density regions in the kernel space and as much diverse as possible to each other are chosen
among the uncertain samples. The density and diversity of the unlabeled samples are evaluated on the basis of their
clusters’ information. To this end, initially the density of each cluster is measured by the ratio of the number of samples
in the cluster to the distance of its two furthest samples. Then, the highest density clusters are chosen and the medoid
samples closest to the centers of the selected clusters are chosen as the most informative ones. The diversity of samples
is accomplished by selecting only one sample from each selected cluster. Experiments applied to the estimation of
single-tree parameters, i.e., tree stem volume and tree stem diameter, show the effectiveness of the proposed technique.
This paper addresses the problem of land-cover maps updating by classifying multitemporal remote sensing images (i.e.,
images acquired on the same area at different times) in the context of change-detection-driven active transfer learning.
The proposed method is based on the assumption that training samples are available for one of the available
multitemporal images (i.e., source domain), whereas they are not for the others (i.e., target domain). In order to
effectively classify the target domain (i.e., update the maps obtained for the source domain according to the new
information brought from another acquisition) we present a novel approach to automatically define a training set for the
target domain taking advantage of its temporal correlation with the source domain. The proposed method is based on
four steps. In the first step unsupervised change detection is applied to multitemporal images (i.e., target and source
domains). Labels of detected unchanged training samples are propagated from the source to the target domain in the
second step, thus becoming its initial training set. In the third step, changed areas are statistically compared with land-cover
classes in the target domain training set. This information is used to drive the initial training set expansion by
Active Learning (AL). In the first expansion iterations priority is given to samples detected as being changed, in the next
ones the most informative samples are selected from a pool including both changed and unchanged unlabeled samples
(i.e., priority is removed). At convergence of the AL process, the target image is classified (fourth step). To this, in this
paper we use a Support Vector Machine classifier. Experimental results show that transferring the class-labels from
source domain to target domain provides a reliable initial training set and that the priority rule for AL involves a faster
convergence to the desired accuracy with respect to standard AL.
In this paper, a multiple description image coding scheme is proposed to facilitate the transmission of images over media with possible packet loss. The proposed method is based on finding the optimal reconstruction filter coefficients that will be used to reconstruct lost descriptions. For this purpose initially, the original image is downsampled and each subimage is coded using standard JPEG. These decoded images are then mapped to the original image size using the optimal filters. Multiple descriptions consist of coded down-sampled images and the corresponding optimal reconstruction filter coefficients. It is shown that the proposed method provided better results compared to standard interpolation filters (i.e., bicubic and bilinear).
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.