Abstract
Despite recent advances, improving the accuracy of automatic speaker recognition systems remains an important and challenging area of research. This thesis investigates two-phase based features, namely the frequency modulation (FM) feature and the group delay feature in order to improve the speaker recognition accuracy. Introducing features complementary to spectral envelope-based features is a promising approach for increasing the information content of the speaker recognition system. Although phase-based features are motivated by psychophysics and speech production considerations, they have rarely been incorporated into speaker recognition front-ends.
A theory has been developed and reported in this thesis, to show that the FM component can be extracted using second-order all pole modelling, and a technique for extracting FM features using this model is proposed, to produce very smooth, slowly varying FM features that are effective for speaker recognition tasks. This approach is shown herein to significantly improve speaker recognition performance over other existing FM extraction methods.
A highly computationally efficient FM estimation technique is then proposed and its computational efficiency is shown through a comparative study with other methods with respect to the trade off between computational complexity and performance.
In order to further enhance the FM based front-end specifically for speaker recognition, optimum frequency band allocation is studied in terms of the number of sub-bands and spacing of centre frequencies, and two new frequency band re-allocations are proposed for FM based speaker recognition.
Two group delay features are also proposed: log compressed group delay feature and the sub-band group delay feature, to address problems in group delay caused by the zeros of the z-transform polynomial of a speech signal being close to the unit circle. It has been shown that the combination of group delay and FM, complements Mel Frequency Cepstral Coefficient (MFCC) in speaker recognition tasks.
Furthermore, the proposed FM feature is successfully utilised for automatic forensic speaker recognition, which is implemented based on the likelihood ratio framework with two stage modelling and calibration, and shown to behave in a complementary manner to MFCCs. Notably, the FM based system provides better calibration loss than the MFCC based system, suggesting less ambiguity of FM information than MFCC information in an automatic forensic speaker recognition system.
In order to demonstrate the effectiveness of FM features in a large scale speaker recognition environment, an FM-based speaker recognition subsystem is developed and submitted to the NIST 2008 speaker recognition evaluation as part of the I4U submission. Post evaluation analysis shows a 19.7% relative improvement over the traditional MFCC based subsystem when it is augmented by the FM based subsystem. Consistent improvements in performance are obtained when MFCC is augmented with FM in all sub-categories of NIST 2008, in three development tasks and for the NIST 2001 database, demonstrating the complementary behaviour of MFCC and FM features.