Mini-project

Human voice separation from a mixed source using feature extraction

Abstract:

Human voice feature targetting for clustering & extraction of sound from a mix of human and non-human sounds. Similar to audio cleaning techniques, time series based data is processed to obtain spectral features. We try and compare various Deep Learning approaches, LSTMs, GMMs and classic techniques like Kalman filters. The main technique consists in grouping all the homogeneous speech segments obtained at the end of the segmentation process, by using the spatial information provided by the stereophonic speech. The Voxceleb1 dataset was used with manual processing of files to add background sounds and other human sounds using a combination of urban sounds dataset. Isolation approach may later be used for debate audio processing.

Reference papers

A Novel Windowing Technique for Efficient Computation of MFCC for Speaker Recognition

10.1109/LSP.2012.2235067

Abstract:

In this letter, we propose a novel family of windowing technique to compute mel frequency cepstral coefficient (MFCC) for automatic speaker recognition from speech. The proposed method is based on fundamental property of discrete time Fourier transform (DTFT) related to differentiation in frequency domain. Classical windowing scheme such as Hamming window is modified to obtain derivatives of discrete time Fourier transform coefficients. It is mathematically shown that this technique takes into account slope of power spectrum and phase information. Speaker recognition systems based on our proposed family of window functions are shown to attain substantial and consistent performance improvement over baseline single tapered Hamming window as well as recently proposed multitaper windowing technique.

Deep Neural Network Approaches to Speaker and Language Recognition

10.1109/LSP.2015.2420092

Abstract:

The impressive gains in performance obtained using deep neural networks (DNNs) for automatic speech recognition (ASR) have motivated the application of DNNs to other speech technologies such as speaker recognition (SR) and language recognition (LR). Prior work has shown performance gains for separate SR and LR tasks using DNNs for direct classification or for feature extraction. In this work we present the application of single DNN for both SR and LR using the 2013 Domain Adaptation Challenge speaker recognition (DAC13) and the NIST 2011 language recognition evaluation (LRE11) benchmarks. Using a single DNN trained for ASR on Switchboard data we demonstrate large gains on performance in both benchmarks: a 55% reduction in EER for the DAC13 out-of-domain condition and a 48% reduction in Cavg on the LRE11 30 s test condition. It is also shown that further gains are possible using score or feature fusion leading to the possibility of a single i-vector extractor producing state-of-the-art SR and LR performance.

Improving Short Utterance Speaker Recognition by Modeling Speech Unit Classes

10.1109/TASLP.2016.2544660

Abstract:

Short utterance speaker recognition (SUSR) is highly challenging due to the limited enrollment and/or test data. We argue that the difficulty can be largely attributed to the mismatched prior distributions of the speech data used to train the universal background model (UBM) and those for enrollment and test. This paper presents a novel solution that distributes speech signals into a multitude of acoustic subregions that are defined by speech units, and models speakers within the subregions. To avoid data sparsity, a data-driven approach is proposed to cluster speech units into speech unit classes, based on which robust subregion models can be constructed. Further more, we propose a model synthesis approach based on maximum likelihood linear regression (MLLR) to deal with no-data speech unit classes. The experiments were conducted on a publicly available database SUD12. The results demonstrated that on a text-independent speaker recognition task where the test utterances are no longer than 2 seconds and mostly shorter than 0.5 seconds, the proposed subregion modeling offered a 21.51% relative reduction in equal error rate (EER), compared with the standard GMM-UBM baseline. In addition, with the model synthesis approach, the performance can be greatly improved in scenarios where no enrollment data are available for some speech unit classes.

Short-timed speech dynamics for speaker recognition

10.1049/el:19950962

Abstract:

A temporal transition model of speech is proposed for speaker recognition and verification. The issues of model building, distance measure and implementation are addressed. A set of experiments are conducted, which give a 98.9% recognition rate and 99.5% verification rate. Short-timed dynamics of utterance well encodes the speaker specificity.< >

Text-independent speaker recognition based on the Hurst parameter and the multidimensional fractional Brownian motion model

10.1109/TSA.2005.858054

Abstract:

In this paper, a text-independent automatic speaker recognition (ASkR) system is proposed-the SR/sub Hurst/-which employs a new speech feature and a new classifier. The statistical feature pH is a vector of Hurst (H) parameters obtained by applying a wavelet-based multidimensional estimator (M/spl I.bar/dim/spl I.bar/wavelets ) to the windowed short-time segments of speech. The proposed classifier for the speaker identification and verification tasks is based on the multidimensional fBm (fractional Brownian motion) model, denoted by M/spl I.bar/dim/spl I.bar/fBm. For a given sequence of input speech features, the speaker model is obtained from the sequence of vectors of H parameters, means, and variances of these features. The performance of the SR/sub Hurst/ was compared to those achieved with the Gaussian mixture models (GMMs), autoregressive vector (AR), and Bhattacharyya distance (dB) classifiers. The speech database-recorded from fixed and cellular phone channels-was uttered by 75 different speakers. The results have shown the superior performance of the M/spl I.bar/dim/spl I.bar/fBm classifier and that the pH feature aggregates new information on the speaker identity. In addition, the proposed classifier employs a much simpler modeling structure as compared to the GMM.

Speaker based clustering using the differential energy

10.1109/AICCSA.2014.7073264

Abstract:

A new approach of speaker clustering is presented and discussed in this paper. The main technique consists in grouping all the homogeneous speech segments obtained at the end of the segmentation process, by using the spatial information provided by the stereophonic speech. The proposed system is suitable for debates or multi-conferences for which the speakers are located at fixed positions. The new method uses the differential energy of the two stereophonic signals collected by two cardioid microphones, in order to group all the speech segments that are uttered by the same speaker. The total number of clusters obtained at the end should be equal to the real number of speakers present in the meeting room and each cluster should contain the global intervention of only one speaker. The new proposed approach (which we called Energy Differential based Spatial Clustering or EDSC) has been experimented comparatively with a classic statistical approach called "Mono-Gaussian Sequential Clustering". Experiments of speaker clustering are done on a stereophonic speech corpus called DB15, composed of 15 stereophonic scenarios of about 3.5 minutes each. Every scenario corresponds to a free discussion between several speakers seated at fixed positions in the meeting room. Results show the strong performances of the new approach in terms of precision and speed, especially for short speech segments.

Impact of overlapping speech detection on speaker diarization for broadcast news and debates

10.1109/ICASSP.2013.6639163

Abstract:

The overlapping speech detection systems developped by Orange and LIMSI for the ETAPE evaluation campaign on French broadcast news and debates are described. Using either cepstral features or a multi-pitch analysis, a F1-measure for overlapping speech detection up to 59.2% is reported on the TV data of the ETAPE evaluation set, where 6.7% of the speech was measured as overlapping, ranging from 1.2% in the news to 10.4% in the debates. Overlapping speech segments were excluded during the speaker diarization stage, and these segments were further labelled with the two nearest speaker labels, taking into account the temporal distance. We describe the effects of this strategy for various overlapping speech systems and we show that it improves the diarization error rate in all situations and up to 26.1% relative in our best configuration.

Classification of multi speaker shouted speech and single speaker normal speech

10.1109/TENCON.2017.8228261

Abstract:

This work proposes a method for the shouted and multi speaker's vs normal and single speaker's speech classification, which is the most frequently occurring scenario in news debates. In this work, multi speaker shouted and single speaker normal speech classes are addressed as shouted and normal speech, respectively. Spectral features and source features are explored for the classification task. The source characteristics are studied in terms of strength of excitation (SoE). Spectral flux, spectral tilt, sum of ten largest spectral peaks (STLP), modulation spectrum energy (ModSE) and Mel frequency cepstral coefficients (MFCCs) are explored as the spectral features. Shouted and normal speech are classified using two approaches. In the first approach, these features, except MFCCs, are non-linearly mapped and combined using a threshold based technique. In the second approach, a predefined radial basis function (RBF) kernel based Support Vector Machine (SVM) classifier is used for the classification task on the extracted features. The performance evaluation is done in terms of F-Score. The performance is also evaluated on the basis of leave one out analysis to measure the strength of a particular feature for this task. By leave one out analysis, SoE is the most important feature among all one-dimensional features. When all the features are combined for classification, F-score of forty four dimensional feature is highest.

Voice source features for cognitive load classification

10.1109/ICASSP.2011.5947654

Abstract:

Previous work in speech-based cognitive load classification has shown that the glottal source contains important information for cognitive load discrimination. However, the reliability of glottal flow features depends on the accuracy of the glottal flow estimation, which is a non-trivial process. In this paper, we propose the use of acoustic voice source features extracted directly from the speech spectrum (or cepstrum) for cognitive load classification. We also propose pre and post-processing techniques to improve the estimation of the cepstral peak prominence (CPP). 3-class classification results on two databases showed CPP as a promising cognitive load classification feature that outperforms glottal flow features. Score-level fusion of the CPP-based classification system with a formant frequency-based system yielded a final improved accuracy of 62.7%, suggesting that CPP contains useful voice source information that complements the information captured by vocal tract features.

The Delta-Phase Spectrum With Application to Voice Activity Detection and Speaker Recognition

10.1109/TASL.2011.2109379

Abstract:

For several reasons, the Fourier phase domain is less favored than the magnitude domain in signal processing and modeling of speech. To correctly analyze the phase, several factors must be considered and compensated, including the effect of the step size, windowing function and other processing parameters. Building on a review of these factors, this paper investigates a spectral representation based on the Instantaneous Frequency Deviation, but in which the step size between processing frames is used in calculating phase changes, rather than the traditional single sample interval. Reflecting these longer intervals, the term delta-phase spectrum is used to distinguish this from instantaneous derivatives. Experiments show that mel-frequency cepstral coefficients features derived from the delta-phase spectrum (termed Mel-Frequency delta-phase features) can produce broadly similar performance to equivalent magnitude domain features for both voice activity detection and speaker recognition tasks. Further, it is shown that the fusion of the magnitude and phase representations yields performance benefits over either in isolation.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Research		Research
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Mini-project

Human voice separation from a mixed source using feature extraction

Abstract:

Reference papers

A Novel Windowing Technique for Efficient Computation of MFCC for Speaker Recognition

10.1109/LSP.2012.2235067

Abstract:

Deep Neural Network Approaches to Speaker and Language Recognition

10.1109/LSP.2015.2420092

Abstract:

Improving Short Utterance Speaker Recognition by Modeling Speech Unit Classes

10.1109/TASLP.2016.2544660

Abstract:

Short-timed speech dynamics for speaker recognition

10.1049/el:19950962

Abstract:

Text-independent speaker recognition based on the Hurst parameter and the multidimensional fractional Brownian motion model

10.1109/TSA.2005.858054

Abstract:

Speaker based clustering using the differential energy

10.1109/AICCSA.2014.7073264

Abstract:

Impact of overlapping speech detection on speaker diarization for broadcast news and debates

10.1109/ICASSP.2013.6639163

Abstract:

Classification of multi speaker shouted speech and single speaker normal speech

10.1109/TENCON.2017.8228261

Abstract:

Voice source features for cognitive load classification

10.1109/ICASSP.2011.5947654

Abstract:

The Delta-Phase Spectrum With Application to Voice Activity Detection and Speaker Recognition

10.1109/TASL.2011.2109379

Abstract:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages