论文部分内容阅读
As unique biometric information of human being,voice print has been widely applied in many fields of daily life and has a broad market application prospect,such as mobile payment,a criminal investigation,and so on.At the same time,neural network structures are developed more and more and applied to speaker recognition.The accuracy of speech recognition is getting higher and higher and better than that of humans in some scenarios.The purpose of an automatic speaker verification(ASV)system is to verify whether two test utterances belong to the same speaker based on speaker characteristics extracted from the raw speech.Speech is a complex signal that conveys many types of information,such as linguistic content,speaker individuality,nationality,gender,and emotion.Some information may be useful for ASV,while others may not.Therefore,the performance of ASV could be increased by enhancing meaningful information and suppressing useless information.However,the learning ability of a deep neural network(DNN)is still limited because of the practice of using speaker labels only in the training stage of the ASV system,which does not take into account the interaction between different types of domain information.Multitask learning(MTL),which is recently proposed to learn useful information for ASV by using speaker-irrelevant labels.Domain adversarial training(DAT),which is designed to eliminate the effect of useless information by using a gradient reversal layer(GRL)in different domains.These two methods have significantly improved the performance of ASV.What these two approaches have in common is that they add more constraints during the DNN training stage.There are many types of domain information that has been shown to be useful for ASV,such as phonetic in frame level,channel,and signal-to-noise-ratio variability information.However,information for gender and nationality is crucial in verifying the identity of a speaker because these information can be used as multiple verifications.Utterances uttered by the same speaker in different emotions vary significantly in their characteristics,which also influence the extraction of speaker individual features and decrease the accuracy of the ASV.Subjectively,gender and nationality are speaker-invariant information.This means that,for a given speaker in a training database,they will not change and can provide additional information for the authentication of speaker identity.Therefore,these two pieces of information should be beneficial for ASV.On the contrary,emotion information can change in different speaking scenarios,which will decrease the cosine similarity score in test pairs even though the two utterances are from the same speaker.Therefore,it has to be suppressed.Based on these considerations,the authors of this thesis investigated the effects of gender,nationality,and emotion information on the performance of ASV systems.Four effective systems were proposed by using MTL-and DAT-based methods.More specifically,MTL-based systems,which including multitask gender(MTG),multitask nationality(MTN),and multitask gender and nationality(MTGN),were used to enhance gender and nationality information learning in the NN training stage.The DATbased system,which including emotion domain adversarial training(EDAT),was used to suppress different emotions information learning.Experimental results indicate that encouraging gender and nationality information and suppressing emotion information learning improves the performance of ASV.In the end,proposed systems achieved 16.4and 22.9% relative improvements in the equal error rate for MTL-and DAT-based systems,respectively.