Recent state-of-the-art work on speaker recognition and verification uses a simple factor analysis to derive a low-dimensional total variability space" which simultaneously captures speaker and channel variability. This approach simplified earlier work using joint factor analysis to separately model speaker and channel differences. Here we adapt this "i-vector" method to image classification by replacing speakers with image categories, voice cuts with images, and cepstral features with SURF local descriptors, and where the role of channel variability is attributed to differences in image backgrounds or lighting conditions. A Universal Gaussian mixture model (UGMM) is trained (unsupervised) on SURF descriptors extracted from a varied and extensive image corpus. Individual images are modeled by additively perturbing the supervector of stacked means of this UGMM by the product of a low-rank total variability matrix (TVM) and a normally distributed hidden random vector, X. The TVM is learned by applying an EM algorithm to maximize the sum of log-likelihoods of descriptors extracted from training images, where the likelihoods are computed with respect to the GMM obtained by perturbing the UGMM means via the TVM as above, and leaving UGMM covariances unchanged. Finally, the low-dimensional i-vector representation of an image is the expected value of the posterior distribution of X conditioned on the image's descriptors, and is computed via straightforward matrix manipulations involving the TVM and image-specific Baum-Welch statistics. We compare classification rates found with (i) i-vectors (ii) PCA (iii) Discriminant Attribute Projection (the last two trained on Gaussian MAP-adapted supervector image representations), and (iv) replacing the TVM with the matrix of dominant PCA eigenvectors before i-vector extraction.
This research addresses the document vs. non-document image classification problem. The ability to select images
containing text from an OCR processing stream that also includes images of scenes, people, faces, etc., will
eliminate unnecessary computation and free up valuable computer resources for other tasks. This is particularly
true for high volume OCR systems. Fisher vectors represent images as gradients of a global generative Gaussian
Mixture Model (GMM) of low level image descriptors, and exhibit state-of-the-art performance for object categorization.
Gaussian supervectors represent images by soft clustering low level image descriptors according to
posterior GMM mixture probabilities, optionally using MAP adaptation, and have demonstrated state-of-the-art
performance for scene categorization. We compare results obtained by applying linear SVMs to Fisher vector
and Gaussian supervector representations to categorize images as having only text, no text, or a mixture of
text and non-text. We also report the performance of GMM-based soft versions of vectors of locally aggregated
descriptors (VLAD) and Bag of Visual words (BOV).
Generic optical character recognition (OCR) engines often perform very poorly in transcribing scanned low resolution
(LR) text documents. To improve OCR performance, we apply the Neighbor Embedding (NE) single-image
super-resolution (SISR) technique to LR scanned text documents to obtain high resolution (HR) versions, which we
subsequently process with OCR. For comparison, we repeat this procedure using bicubic interpolation (BI). We demonstrate
that mean-square errors (MSE) in NE HR estimates do not increase substantially when NE is trained in one
Latin font style and tested in another, provided both styles belong to the same font category (serif or sans serif). This
is very important in practice, since for each font size, the number of training sets required for each category may be
reduced from dozens to just one. We also incorporate randomized k-d trees into our NE implementation to perform
approximate nearest neighbor search, and obtain a 1000x speed up of our original NE implementation, with negligible
MSE degradation. This acceleration also made it practical to combine all of our size-specific NE Latin models
into a single Universal Latin Model (ULM). The ULM eliminates the need to determine the unknown font category
and size of an input LR text document and match it to an appropriate model, a very challenging task, since the dpi
(pixels per inch) of the input LR image is generally unknown. Our experiments show that OCR character error rates
(CER) were over 90% when we applied the Tesseract OCR engine to LR text documents (scanned at 75 dpi and 100
dpi) in the 6-10 pt range. By contrast, using k-d trees and the ULM, CER after NE preprocessing averaged less than
7% at 3x (100 dpi LR scanning) and 4x (75 dpi LR scanning) magnification, over an order of magnitude improvement.
Moreover, CER after NE preprocessing was more that 6 times lower on average than after BI preprocessing.
Automatic Identification Systems (AIS) are commonly used in navigation for collision avoidance, and AIS
signals (GMSK modulation) contain a vessel's identity, position, course and speed - information which is
also vital in safeguarding U.S. ports. AIS systems employ Self Organizing Time Division Multiple Access
(SOTDMA) regions in which users broadcast in dedicated time slots to prevent AIS collisions. However,
AIS signals broadcast from outside a SOTMDA region may collide with those originating inside, and
demodulation in co-channel interference is desirable. In this article we compare two methods for performing
such demodulation. The first method involves Laurent's Amplitude Modulated Pulse (AMP) decomposition
of constant amplitude binary phase modulated signals. Kaleh has demonstrated that this method is
highly accurate for demodulating a single GMSK signal in additive Gaussian white noise (AWGN). Here
we evaluate the performance of this Laurent-Kaleh method for demodulating a target AIS signal through a
collision with an interfering AIS signal. We also introduce a second, far simpler demodulation method
which employs a set of filters matched to tribit states and phases of GMSK signals. We compute the bit
error rate (BER) for these two methods in demodulating a target AIS signal through a collision with
another AIS signal, both as a function of the signal-to-interference ratio (SIR), and as a function the carrier
frequency difference (CFD) between the two signals. Our experiments show that there is no outstanding
advantage for either of these methods over a wide range of SIR and CFD values. However, the matched filter
approach is conceptually much simpler, easier to motivate and implement, while the Laurent-Kaleh
method involves a highly complex and non-intuitive signal decomposition.
KEYWORDS: Electromagnetism, Signal to noise ratio, Error analysis, Receivers, Transmitters, Super resolution, Statistical analysis, Signal processing, Doppler effect, Motion models
For wide-band transmission, geolocation modeling using the wide-band cross-ambiguity function (WBCAF) is preferable
to conventional CAF modeling, which assumes that the transmitted signal is essentially a sinusoid. We compare
the accuracy of two super-resolution techniques for joint estimation of the time-scale (TS) and TDOA
parameters in the WBCAF geolocation model. Assuming a complex-valued signal representation, both techniques
exploit the fact that the maximum value of the magnitude of the WBCAF is attained when the WBCAF is real-valued.
The first technique enhances a known joint estimation method based on sinc interpolation and 2-D Newton root-finding
by (1) extending the original algorithm to handle complex-valued signals, and (2) reformulating the original algorithm
to estimate the difference in radial velocities of the receivers (DV) rather than time scale, which avoids machine
precision problems encountered with the original method. The second technique makes a rough estimate of TDOA on
the sampling lattice by peak-picking the real part of the cross-correlation function of the received signals. Then, by
interpolating the phase of the WBCAF, it obtains a root of the phase in the vicinity of this correlation peak, which
provides a highly accurate TDOA estimate. TDOA estimates found in this way are differentiated in time to obtain DV
estimates. We evaluate both super-resolution techniques applied to simulated received electromagnetic signals which
are linear combinations of complex sinusoids having randomly generated amplitudes, phases, TS, and TDOA. Over a
wide SNR range, TDOA estimates found with the enhanced sinc/Newton technique are at least an order of magnitude
more accurate than those found with conventional CAF, and the phase interpolated TDOA estimates are 3-4 times
more accurate than those found with the enhanced sinc/Newton technique. In the 0-10 dB SNR range, TS estimates
found with the enhanced sinc/Newton technique are a little more accurate than those found with phase interpolation.
Moreover, the TS estimate errors observed with both super-resolution techniques are too small for a CAF-type grid
search to realize in comparable time.
The conventional cross-ambiguity function (CAF) process assumes that the transmitted signal is a sinusoid
having slowly varying complex modulation, and models a received signal as a delayed version of the transmitted
signal, doppler shifted by the dominant frequency. For wide-band transmitted signals, it is more accurate to
model a received signal as a time-scaled version of the transmitted signal, combined with a time delay, and
wide-band cross-ambiguity models are well-known. We provide derivations of time-dependent wide-band cross-ambiguity
functions appropriate for estimating radar target range and velocity, and time-difference of arrival
(TDOA) and differential receiver velocity (DV) for geolocation. We demonstrate through simulations that for
wide-band transmission, these scale CAF (SCAF) models are signficantly more accurate than CAF for estimating
target range and velocity, TDOA and DV. In these applications, it is critical that the SCAF surface be evaluated
in real-time, and we provide a method for fast computation of the scale correlation in SCAF, using only the
discrete Fourier transform (DFT). SCAF estimates of delay and scale are computed on a discrete lattice, which
may not provide sufficient resolution. To address this issue we further demonstrate simple methods, based on
the DFT and phase differentiation of the time-dependent SCAF surface, by which super resolution of delay and
scale, may be achieved.
KEYWORDS: Signal processing, Signal detection, Picosecond phenomena, Signal to noise ratio, Time-frequency analysis, Interference (communication), Fourier transforms, Signal analyzers, Environmental sensing, Data modeling
We address the problem of efficient resolution, detection and estimation of weak tones in a potentially massive
amount of data. Our goal is to produce a relatively small reduced data set characterizing the signals in the
environment in time and frequency. The requirements for this problem are that the process must be computationally
efficient, high gain and able to resolve signals and efficiently compress the signal information into a form
that may be easily displayed and further processed. We base our process on the cross spectral representation we
have previously applied to other problems. In selecting this method, we have considered other representations
and estimation methods such as the Wigner distribution and Welch's method. We compare our method to these
methods. The spectral estimation method we propose is a variation of Welch's method and the cross-power
spectral (CPS) estimator which was first applied to signal estimation and detection in the mid 1980's. The CPS
algorithm and the method we present here are based on the principles first described by Kodera et al. now
frequently called the reassignment principle.
We present a method for computing the theoretically exact estimate of the instantaneous frequency of a signal from local values of its short time Fourier transform under the assumption that the complex logarithm of the signal is a polynomial in time. We apply the method to the problem of estimating and separating non-stationary components of a multi-component signal. Signal estimation and separation is based on a linear TF model in which the value of the signal at each time is distributed in frequency. This is a significant departure from the conventional nonlinear model in which signal energy is distributed in time and frequency. We further demonstrate by a simple example that IF estimated by the higher order method is significantly better than previously used first order methods.
KEYWORDS: Signal processing, Fourier transforms, Fermium, Frequency modulation, Signal detection, Nonlinear optics, Composites, Digital filtering, Linear filtering, Defense and security
We describe a new linear time-frequency paradigm in which the
instantaneous value of each signal component is mapped to the
curve functionally representing its instantaneous frequency.
The transform by which this surface is generated is linear,
uniquely defined by the signal decomposition and satisfies linear
marginal-like distribution properties. We further demonstrate
that such a surface may be estimated from the short time Fourier
transform by a concentration process based on the phase of the STFT
differentiated with respect to time. Interference may be identified
on the concentrated STFT surface, and the signal with the interference
removed may be estimated by applying the linear time marginal to the
concentrated STFT surface from which the interference components have
been removed.
In previous works, Umesh et al, demonstrated that phonetically similar vowels spoken by different individuals are related by a simple translation in a universal warped spectral representation. They experimentally derived this function and called it the “speech-scale”. We present further experimental evidence, based on a large data set, validating the speech-scale. We also estimate speaker-specific scale factors based on the speech-scale, and we present a vowel classification experiment, which demonstrates a significant performance improvement through a normalization based on the speech-scale. The results we present are based on formant estimates of vowels in a Western Michigan vowel database.
KEYWORDS: Databases, Biometrics, Statistical analysis, Detection and tracking algorithms, Algorithm development, Fourier transforms, Ear, Data processing, Signal processing, Standards development
We address the problem of classification of speakers based on measurements of features obtained from their speech. The process is an adaption of biometric methods used to identify people. The process for speech differs since speech is not stationary. We therefore propose the classification of speakers b y the statistical distributions of parameters which may be accurately estimated by modern signal processing techniques. The intent is to develop a speaker clustering algorithm which is dependent of transmission channel and insensitive to language variations, and which may be re-trained, with minimal data, to include a new speaker. We demonstrate effectiveness on the problem of identification of the speakers gender, and present evidence that the methods may be extended to the general problem of speaker identification.
KEYWORDS: Detection and tracking algorithms, Sensors, Algorithm development, Signal detection, Interference (communication), Signal to noise ratio, Signal processing, Signal analyzers, Data analysis, Dimension reduction
Computationally efficient algorithms which perform speech activity detection have significant potential economic and labor saving benefit, by automating an extremely tedious manual process. In many applications, it is desirable to extract intervals of speech which are obtained by segments of other signal types. In the past, algorithms which successfully discriminate between speech and one specific other signal type have been developed. Frequently, these algorithms fail when the specific non-speech signal is replaced by a different non-speech discrimination problem. Typically, several signal specific discriminators are blindly combined with predictable negative results. Moreover, when a large number of discriminators are involved, dimensions reduction is achieved using Principal Components, which optimally compresses signal variance into the fewest number of dimensions. Unfortunately, these new coordinates are not necessarily optimal for discrimination. In this paper we apply graphical tools to determine a set of discriminators which produce excellent speech vs. non-clustering, thereby eliminating the guesswork in selecting good feature vectors. This cluster structure provides a basis for a general multivariate speech vs. non-speech discriminator, which compares very favorably with the TALKATIVE speech extraction algorithm.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.