Nothing Special   »   [go: up one dir, main page]

US3700815A - Automatic speaker verification by non-linear time alignment of acoustic parameters - Google Patents

Automatic speaker verification by non-linear time alignment of acoustic parameters Download PDF

Info

Publication number
US3700815A
US3700815A US135697A US3700815DA US3700815A US 3700815 A US3700815 A US 3700815A US 135697 A US135697 A US 135697A US 3700815D A US3700815D A US 3700815DA US 3700815 A US3700815 A US 3700815A
Authority
US
United States
Prior art keywords
parameters
sample
time
signals
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Lifetime
Application number
US135697A
Inventor
George Rowland Doddington
James Loton Flanagan
Robert Carl Lummis
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
AT&T Corp
Original Assignee
Bell Telephone Laboratories Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bell Telephone Laboratories Inc filed Critical Bell Telephone Laboratories Inc
Application granted granted Critical
Publication of US3700815A publication Critical patent/US3700815A/en
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/12Speech classification or search using dynamic programming techniques, e.g. dynamic time warping [DTW]
    • GPHYSICS
    • G07CHECKING-DEVICES
    • G07CTIME OR ATTENDANCE REGISTERS; REGISTERING OR INDICATING THE WORKING OF MACHINES; GENERATING RANDOM NUMBERS; VOTING OR LOTTERY APPARATUS; ARRANGEMENTS, SYSTEMS OR APPARATUS FOR CHECKING NOT PROVIDED FOR ELSEWHERE
    • G07C9/00Individual registration on entry or exit
    • G07C9/30Individual registration on entry or exit not involving the use of a pass
    • G07C9/32Individual registration on entry or exit not involving the use of a pass in combination with an identity check
    • G07C9/37Individual registration on entry or exit not involving the use of a pass in combination with an identity check using biometric data, e.g. fingerprints, iris scans or voice recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition

Definitions

  • Keefauver Richardson, Tex.; James Loton Flanagan, Somerset; Robert Carl 57 ABSTRACT Lummis, Berkeley Heights, both of Speaker verification, as opposed to speaker identification, is carried out by matching a sample of a person's Asslgnee! Bell Telephone Labomml'lesa speech with a reference version of the same text p i Murray ill, NJ. derived from prerecorded samples of the same [22] Filed; April 20 197] speaker.
  • Acceptance or rejection of the person as the claimed individual is based on the concordance of a l PP 135,697 number of acoustic parameters, for example, formant frequencies, pitch period and speech energy. The degree of match is assessed by time aligning the sam- [52] US.
  • FIG. 7 A 7' 7 TWNF V PATENTEnncr 24 m2 SHEEI 2 BF 5 FIG. 7
  • FIG 8B krnlh l REFERENCE) I l H62 (t) NORMALIZED AUTOMATIC SPEAKER VERIFICATION BY NON- LINEAR TIME ALIGNMENT OF ACOUSTIC PARAMETERS
  • This invention relates to speech signal analysis and, more particularly, to a system for verifying the identity to his speech.
  • the problem of verification appears both more important and more tractable than the problem of absolute identification.
  • the former problem consists of the decision to accept or reject an identity claim made by an unknown voice.
  • identification the problem is to decide which of a reference set is the unknown most like.
  • the expected probability of error tends to remain constant regardless of the size of the user population, whereas in identification the expected probability of error tends to unity as the population becomes large.
  • the machine may ask for repeats and might adjust its acceptance threshold in accordance with the importance of the transaction. Further, the machine may control the average mix of the two kinds of errors it can make: i.e., accept a false speaker (miss), or reject a true speaker (false alarm).
  • the problem may be defined as follows.
  • a person asserts a certain identity and then makes a sample utterance of a special test phrase.
  • Previously prepared information about the voice of the person whose identity is claimed i.e., a reference utterance, embodies the typical way in which that person utters the test phrase, as well as measures of the variability to be expected in separate repetitions of the phrase by that person.
  • the sample utterance is compared with the reference information and a decision is rendered as to the veracity of the identity claim.
  • it is convenient to divide the verification technique into three human listener) thus is a possibility, but it is generally inconvenient and it occupies talent that might be better applied otherwise.
  • present indications are that auditory verification is not as reliable as machine verification.
  • Time registration is the process in which the time axis of a sample time function is warped so as to make the function most nearly similar to the unwarped version of a reference function.
  • the warped time scale may be specified by any continuous transformation.
  • One suitable function is a piece-wise linear continuous function of unwarped time. In this case warping is uniquely determined by two coordinates of each breakpoint in the piece-wise linear function. Typically, 10 break-points may be used for warping a two-secondlong function, so the registration task amounts to the optimal assignment of values to 20 parameters.
  • the coefficient of correlation between warped sample and unwarped reference may be used as one index of the similarity of the two functions.
  • the 20 warping parameters are iteratively modified to maximize the correlation coefficient.
  • One suitable technique is the method of steepest ascent. That is, in every iteration, each of the 20 parameters is incremented by an amount proportional to the partial derivative of the correlation coefficient with respect to that parameter.
  • a reference phrase is formed by collecting a number of independent utterances of the phrase by the same speaker. Each is referred to as a specimen" utterance. A typical phrase which has been used in practice is We were away a year ago. Each utterance is analyzed to yield, for an all voiced utterance such as this one, five control functions (so called because they can be used to control a formant synthesizer to generate a signal similar to the original voice signal). It has been found that gain, pitch period, and first, second, and third formant frequencies, are satisfactory as control functions. The gain function is scaled to have a particular peak value independent of the talking level.
  • the reference consists of a version of each of the five control functions chosen to represent a typical utterance by that speaker.
  • the length of the reference is always the same; a value of 1.9 seconds may be used as the standard length. Any value may be used that is not grossly. different from the natural length of the utterance.
  • the reference functions are constructed by averaging together the specimen functions after each has been time-warped to bring them all into mutual registration with each other.
  • One way this mutual registration has been achieved is as follows.
  • One of the five control functions is singled out to guide the registration. This control function is called the guide function. Either gain or second formant may be used for this purpose.
  • the guide function from each specimen is linearly expanded (or contracted) to the desired reference length, and then all of the expanded guide functions are averaged together. This average is the first trial reference for the control function serving as guide.
  • Each of the specimen guide functions is then registered to the trial reference by non-linear time-warping, and a new trial reference is generated by averaging the warped specimens.
  • This process is continued iteratively, i.e., warp each specimen guide function for registration with the current trial reference, and then make a new trial reference by averaging the warped guide functions, until the reference does not change significantly.
  • the other four control functions for each specimen utterance are then warped by the final guide warping function for that utterance, and then each control function is averaged across all specimens to form a reference.
  • the reference control functions are stored for future use, along with computed variance values which indicate the reliability of the function as a standard in selected intervals of the utterance.
  • a distance value is computed that is a measure of the unlikelihood that that sample would have been generated by the person whose identity is claimed. Distances are always positive numbers; a distance value of zero means that the utterance is identical to the reference in every detail.
  • the sample is first analyzed to generate the five control functions in terms of which the reference is stored.
  • the control functions are then brought into temporal registration with the reference. This is done by choosing one of the control functions (e.g.,' gain) to serve as the guide.
  • the guide function of the sample utterance is registered with its counterpart in the reference by non-linear. warping, and other control functions are then warped in an identical way.
  • a variety of distances between the sample and reference ut-- terance are-measured. Included are measures of the difference in local average, local linear variation, and local quadratic variation for all control functions; local and global correlation coefiicients between sample and reference control functions; and measures that represent the difficulty of time registration. ln forming these separate measures, various time segments of the utterance are weighted in proportion to the constancy of the given measure in that time segment across the set of warped specimens. These measures are then combined to form a single overall distance that represents the degree to which the sample utterance differs from the reference.
  • the verification decision is based on the single overall distance. If it is less than a pre-determined criterion, the claimed identity is accepted (verified"); if it is greater than the criterion, the identity claim is rejected. In addition, an indeterminate zone may be established around the criterion value within which neither definite decision would be rendered. In this event, additional information about the person is sought.
  • FIG. 1 is a block schematic diagram of a speech verification system in accordance with the invention
  • FIG. 2 illustrates an alternative analyzer arrangement
  • FIG. 3 illustrates graphically the registration technique employed in the practice of the invention
  • FIG. 4 is a chart which illustrates the dependence of two kinds of error ratios on the choice of threshold
  • FIG. 5 is a block schematic diagram of a time adjustment configuration which-may be employed for nonlinearly warping parameter values
  • FIG. 6 illustrates a criterion for maximizing similarity of acoustic parameters in accordance with the invention
  • FIG. 7 illustrates a number of distance measures used in establishing an identity between two speech samples.
  • FIGS. 8A, 8B and 8C are graphic illustrations of speech parameters of an unknown and a reference talker. There is illustrated in A, the time normalized parameters before the nonlinear time warping procedure. B illustrates parameters for a reference and specimen utterance which match after time registration using the second formant as the guide function. C illustrates a fully time normalized set of parameters for an impostor, i.e., a no-match condition.
  • FIG. 1 A system for verifying an individuals claimed identity is shown schematically in FIG. 1.
  • a library of reference utterances is established to maintain a voice standard for each individual subscriber to the system.
  • a later claim of identity is verified by reference to the appropriate stored reference utterance.
  • an individual speaks a reference sentence, for example, by way of a microphone at a subscriber location, or over his telephone to a central location (indicated generally by the reference numeral 10).
  • the phrase should be capable of representing a number of prosodic characteristics and variations of his speech. Since vowel or voiced sounds contain a considerable number of such features, the reference sentence, We were away a year ago. has been used in practice. This phrase is effective, in part, because of its lack of nasal sounds and its totally voiced character. Moreover, it is long enough to require more than passing attention to temporal registration, and is short enough to afford economical analysis and storage.
  • speech analyzer 11 of any known construction, wherein a number of different acoustic parameters are derived to represent it. For example, individual formant frequencies, amplitudes, and pitch, at the Nyquist rate are satisfactory. These speech parameters are commonly used to synthesize speech in vocoder apparatus and the like.
  • One entirely suitable speech signal analyzer is described in detail in a copending application of L. R. Rabiner and R. W. Schafer, Ser. No. 872,050, filed Oct. 29, 1969.
  • analyzer 11 includes individual channels for identifying formant frequencies F1, F2, F3, pitch period P, and gain G control signals.
  • fricative identifying signals may be derived if desired.
  • each spoken reference sentence therefore is either stretched or contracted in apparatus 12 to adjust it to the standard duration.
  • Each adjusted set of parameters is then stored either as analog, or after conversion, as digital signals, for example, in unit 12.
  • an average set of parameters is developed in averaging apparatus 13. The single resultant set of reference parameter values is then stored for future use in storage unit 15.
  • variance values are developed, in the manner described hereinafter, for parameters in each of a number of time segments within the span of the reference utterance to indicate the extent of any difference in the manner in which the speaker utters that segment of the test phrase.
  • variance values provide a measure of the reliability with which parameters in different segments may be used as a standard.
  • a non-vocal identification of each individual is also stored, preferably in library store 15.
  • the identification may be either in the form of a separate address or some other key to the storage location of the reference utterance for each individual.
  • dividual identifies himself, for example, by means of his name and address or his credit card number.
  • This data is entered into reading unit 16, of any desired construction, in order that a request to verify may be initiated.
  • the individual speaks the reference sentence.
  • These operations are indicated generally by block 18 in FIG. 1.
  • the sample voice signal is delivered to analyzer 19 where it is broken down to develop parameter values equivalent to those previously stored for him.
  • Analyzer l9 accordingly, should be of identical construction to analyzer 11, and preferably, is located physically at the central processing station.
  • the resultant set of sample parameter values are thereupon delivered to unit 17 to initiate all subsequent operations.
  • time adjustment apparatus 20 Since it is unlikely that the sample utterance will be in time registration with the reference sample, it is necessary to adjust its time scale to bring it into temporal alignment with the reference. This operation is carried out in time adjustment apparatus 20. in essence, iterative processing is employed to maximize the similarity between the specimen parameters and the reference parameters. Similarity may be measured by the coefficient of correlation between the sample and reference. Sample parameters are initially adjusted to start and stop in registry with the reference. It is also in accordance with the invention to match the time spread of variables within the speech sample. Internal time registration is achieved by a nonlinear process which maximizes the similarity between the sample and the reference by way of a monotonic continuous transformation of time.
  • coefficient a and b are determined so as to cause the end points of the sample to coincide with those of the reference when q(t) is zero.
  • the function q(t) defines the character of the time scale transformation between the end points of the utterance. In practice, q(t) may be a continuous piece-wise linear function. The time adjustment operation is illustrated graphically in FIG. 3.
  • a reference function r(t) extends through the period 0 to T. It is noted, however, that the sample reference s(t) is considerably shorter in duration. It is necessary, therefore, to stretch it to a duration T. This is done by means of the substitute function 1'(t) shown in the third line of the illustration.
  • a so-called gradient climbing procedure may be employed in which values q, at times t, are varied in order that values of q, and t, may be found that maximize the normalized correlation 1 between the reference speech and the sample speech, where The symbols denote a time average value of the enclosed expression.
  • a value is prepared in measurement apparatus 25 which denotes the internal dissimilarities between the two speech signals.
  • a measure is prepared in apparatus 26 which denotes the extent of warping required to bring the two into registry. If the dissimilarities are found to be small, it is likely that a match has been found. Yet, if the warping function value is extremely high, there is a likelihood that the match is a false one, resulting solely from extensive registration adjustment.
  • the two measures of dissimilarity are combined in apparatus 27 and delivered to comparison unit 28 wherein a judgment is made in accordance with preestablished rules, i.e., threshold levels, balanced between similarity and inconsistencies.
  • An accept or reject signal isthereupon developed. Ordinarily, this signal is returned to, unit 16 to verify or reject the claim of identity made by the speaker.
  • analyzer 11 is used only to prepare reference samples. It may, of course, be switched as required to analyze identity claim samples. Such an arrangement is illustrated in FIG. 2. Reference and sample information is routed by way of switch 29a to analyzer l1 and delivered by way of switch 29b to the appropriate processing channel. Other redundancies within the apparatus may, of course, be minimized by judicious construction. Moreover, it is evident that all of the operations described may equally well be performed on "a computer. All of the steps and all of the apparatus functions may be incorporated in a program for implementation on a general-purpose or dedicated computer. Indeed, in practice, a computer implementation has been found to be most effective. No unusual programming steps are required for carrying outthe indicated operations.
  • FIG. 4 illustrates the manner in which acceptance or rejection of a sample is established. Since absolute discrimination between reference and sample values would require near perfect matching, it is evident that a compromise must be used. FIG. 4 indicates, therefore, the error rate of the verification procedure as a function of the value of the dissimilarity measure between the reference and sample, taken as a threshold value for acceptance or rejection. A compromise value is selected that then determines the number of true matches that are rejected, i.e., customers whose claim to identity is disallowed, versus the number of impostors whose claim to identity is accepted. Evidently, thev crossover point may be adjusted in accordance with the particular identification application.
  • FIG. illustrates in block schematic form the operations employed in accordance with the invention for registering the time scales of a reference utterance, in parameter value form, with the sample utterancein like.
  • FIG. 5 represent the flow chart of the program that has been used in the practice of the invention. As with the overall system, no unusual programming steps are required to implement the arrangement.
  • the system illustrated in FIG. 5 corresponds generally to the portion of FIG. 1 depicted in unit 20.
  • Reference values of speech signal parameters r(t) from store 15 are read into store 51 as a set.
  • samples from analyzer 19 are stored in unit 52.
  • samples s(t) are converted into a new set of values s(r) in transformation function generator 53. This operation is achieved by developing, in generator 54, values of s(r) as discussed above in Equation (1).
  • Coefficients a and b are'determined to cause the end points of the sample utterance, as determined for example by speech detector 55, to coincide with those of the reference when q(t) is zero.
  • Detector 55 issues a first marker signal at the onset of the sample and a second marker signal at the cessation of the utterance. These signals are delivered directly to generator 54.
  • - Values of q, and t, for the interval between the terminal points of the utterance are initially entered into the system as prescribed sets of constants q and 1, These values are delivered to OR gates 56 and 57, respectively, and by way of adders 58 and 59 to the input of generator 54. Accordingly, with these initial values, a set of values -r(t) is developed in generator 54 in accordance with Equation (1). Values of the specimen s(t) are thereupon remapped in generator 53 according to the functions developed in generator 54 to produce a time-warped version of the sample, designated s(r).
  • Values of s(r) are next compared with the reference samples to determine whether or not the transformed specimen values have been brought into satisfactory time alignment with the reference.
  • the normalized correlation I as defined above in Equation (2), is used for this comparison. Since I is developed on the basis of root mean square values of the sample functions, the necessary algebraic terms are prepared by developing a product in multiplier 60 and summing the resultant over the period T in accumulator 61. This summation establishes the numerator of Equation (2). Similarly, values of r(t) and s(t) are squared, integrated, and rooted, respectively, in units 62, 63, 64, and 65, 66, and 67.
  • the two normalized values are delivered to multiplier 68 to form the denominator of Equation (2).
  • Di- I vider network 69 then delivers as its output a value of the normalized correlation function I in accordance with Equation (2). It indicates the similarity of the tikhe indicated changes in the values of q and t for the previous values supplied to generator 54. Accordingly, the partial derivative values of I with respect to q and with respect to t are prepared and delivered to multipliers 72 and 73, respectively. These values are equalized by multiplying by constants Cq and Ct in order to enhance the subsequent evaluation. These products constitute incremental values of q and t. The mean squares of the sets of values.
  • q, and t are thereupon compared in gates 74 and 75 to selected small constants U and V. Constants U and V are selected to indicate the required degree of correlation that will assure a low error rate in the ultimate decision. If either of the comparisons is unsatisfactory, incremental values of q, or n, or both, are returned to adders 58 and 59. The previously developed values q, and t,, are incremented thereby to provide a new set of values as inputs to function generator 54. Values of s(1) are thereupon developed using the new data and the process is repeated. In essence, the values of q at intervals t as shown in FIG. 3 are individually altered to determine an appropriate set of values for maximizing the correlation between the reference utterance and the altered sample utterance.
  • FIG. 6. illustrates mathematically correlating the operation. The relationships are those used in a computer program used to implement the steps discussed above.
  • each q, and t is adjusted until a further change in its value produces only a small change in correlation.
  • the last generated value is held, e. g., in generator 54.
  • gates 74 and 75 deliver the last values of s(1') by way of AND gate 78 to store 79. These values then are used as the time registered specimen samples and, in the apparatus of FIG. 1, are delivered to dissimilarity measuring apparatus 25.
  • the values of q at time t from function generator 54 are similarly delivered, for example, by way of a gate (not shown in the Figure to avoid undue complexity) energized by the output of gate 78 to function measurement apparatus 26 of FIG. 1.
  • the two registered utterances are examined in a variety of different ways and at a variety of different locations in time.
  • the resultant measures are combined to form a single distance value.
  • One convenient way of assessing dissimilarity comprises dividing the interval of the utterances, O to T, into N equal intervals. If T 2 seconds, as discussed in the above example for a typical application, it is convenient to divide the interval into N 20 equal parts.
  • FIG. 7 illustrates such a subdivision. Each subdivision i is then treated individually and a number of measures of dissimilarity are developed.
  • the reliability of these measures varies between individual segments of the utterances. That is to say, certain speakers appreciably vary the manner in which they deliver certain portions of an utterance but are relatively consistent in delivering other portions. It is preferable therefore to use the most reliable segments for matching purposes and to reduce the relative weight of, or eliminate entirely, the measures in those segments known to be unreliable.
  • the degree of reliability in each segment is based on the variance between the reference'speech signal in each segment for each of the several reference utterances used in preparing the average reference in unit 13 of FIG. 1. The average values are thus compared and a value 0 representative of the variance, is developed and stored along with values r(t) in storage unit 15.
  • Dissimilarity measurement apparatus 25 thus is supplied with the function r(t), s(1'), and 0 It performs the necessary mathematical evaluation to divide the functions into N equal parts and to compute a measure of the squared difference in average values of the reference utterance and adjusted sample utterance, the squared difference in linear components between the two, (also designated slope) the squared difference in quadratic components between the two (also designated curvature), and the correlation between the two.
  • Each of the measures is scaled in accordance with the reliability factor as measured by the variance 0- discussed above.
  • Equation 7 The equations which define these mathematical equations are set forth in FIG. 7.
  • the subscripts r and s refer, respectively, to the reference utterance and the warped sample utterance
  • the functions x, y, and z are the coefficients of the first three terms of an orthogonal polynominal expression of the corresponding utterance value.
  • the symbol p represents the correlation coefficient between the sample and reference functions computed over the full length of the sample.
  • the function p represents the correlation coefficient between the sample and reference computed for the ith segment.
  • 0 represents the variance of the reference parameters computed for the entire set of reference utterances used to produce the average.
  • the numerical evaluation for each of these measures is combined to form a single number and a signal representative of the number is delivered to combining network 27.
  • an expression for the amount of warping employed is defined as Typically, 10 values of 1' are employed so that 10 values of A are produced. These values are averaged to get a single numerical value A X. A value of X is developed for each of the reference speech utterances used to prepare the average. All values of X are next averaged over each of the N reference utterances to produce a value X. A first measure of distance for warping is then evaluated as In similar fashion, a number Y representative of the linear component of variation in the values of A is prepared, and a quadratic component of variation is evaluated as Z. A second measure of distance is then evaluated as D, Z (5) Finally, a third measure of distance is developed as 1 N 1 2 (A,-X- Y(t,t Z(tit,,,)
  • t is the value of the t at the midpoint of the utterance.
  • each of the individual distance values is suitably weighted. If the weighting function is equal to one for each distance value, a simple summation is performed. Other weighting systems may be employed in accordance with experience, i.e., the error rate experienced in verifying claims of identity of those references accommodated by the system.
  • the warping function measurements are therefore delivered to combining network 27 where they are combined with the numerical values developed in apparatus 25.
  • the composite distance measure is thereupon used in threshold comparison network 28 to determine whether the sample speech should be accepted or rejected as being identical with the reference, i.e., to verify or reject the claim of identity. Since the distance measure is in the form of a numerical value, it may be matched directly against a stored numerical value in apparatus 28.
  • the stored threshold value is selected to distribute the error possibility between a rejection of true claims of identity versus the acceptance of false claims of identity as illustrated in FIG. 4, discussed above. It is also possible that the distance value is too close to the threshold limit to permit a positive decision 5 to be made. In this case, i.e., in an intermediate zone 10 signal may be used to suggest that additional information about the individual claiming identity is needed, e. g., in the form of other tangible identification.
  • FIGS. 8A, 8B and 8C illustrate the overall performance of the system of the invention based on data developed in practice.
  • FIG. 8A waveforms of the sample sentence We were away a year ago. are shown for the first three formants, for the pitch period, and for signal gain, both for a sample utterance and for an averaged reference utterance. It will be observed that the waveforms of the sample and reference are not in time registry.
  • FIG. 8B illustrates the same parameters after time adjustment, i.e., after warping, for a sample utterance determined to be substantially identical to. the reference. In this case, the dissimilarity measure is sufficiently low to yield an accept signal, thus to verify the claim of identity.
  • FIG. 8C the sample and reference utterances of the test sentence have been registered; yet it is evident that severe disparities are present between the two. Hence, the resulting measure of dissimilarity is sufficiently high to yield a reject signal.
  • FIGS. 1 and 5 the block schematic diagrams of FIGS. 1 and 5, together with the mathematical relationships set forth in the specification and figures constitute in essence a flowchart diagram illustrative of the programming steps used in the practice of the invention.
  • said means for developing a plurality of signals representative of selected similarities includes, means for measuring a plurality of different speech signal characteristics for similarity in each of a number of time subintervals within the interval of said designated time scale.

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)

Abstract

Speaker verification, as opposed to speaker identification, is carried out by matching a sample of a person''s speech with a reference version of the same text derived from prerecorded samples of the same speaker. Acceptance or rejection of the person as the claimed individual is based on the concordance of a number of acoustic parameters, for example, formant frequencies, pitch period and speech energy. The degree of match is assessed by time aligning the sample and reference utterance. Time alignment is achieved by a nonlinear process which maximizes the similarity between the sample and reference through a piece-wise linear continuous transformation of the time scale. The extent of time transformation that is required to achieve maximum similarity also influences the decision to accept or reject the identity claim.

Description

United States Patent 51 3,700,815
Doddington et al. 1 Oct. 24, 1972 [54] AUTOMATIC SPEAKER 3,525,811 8/1970 Trice ..l79/ 1 SB VERIFICATION BY NON-LINEAR TIME 3,466,394 9/1969 French 179/1 SB ALIGNMENT OF ACOUSTIC PARAMETERS Primary Examiner--Kathleen l-l. Claffy Assistant Examiner-Jon Bradford Leaheey [72] Inventors' F Rowland noddingmn, Attorney-R. J. Guenther and William L. Keefauver Richardson, Tex.; James Loton Flanagan, Somerset; Robert Carl 57 ABSTRACT Lummis, Berkeley Heights, both of Speaker verification, as opposed to speaker identification, is carried out by matching a sample of a person's Asslgnee! Bell Telephone Labomml'lesa speech with a reference version of the same text p i Murray ill, NJ. derived from prerecorded samples of the same [22] Filed; April 20 197] speaker. Acceptance or rejection of the person as the claimed individual is based on the concordance of a l PP 135,697 number of acoustic parameters, for example, formant frequencies, pitch period and speech energy. The degree of match is assessed by time aligning the sam- [52] US. Cl. ..l79/l SA [51] Int. Cl. ..Gl0l1/02 and reference. F f f [58] Field of Search 179/1 SA 1 SB 15 55 R 15 55 achieved by a nonlinear process which maximizes the g similarity between the sample and reference through a piece-wise linear continuous transformation of the time scale. The extent of time transformation that is required to achieve maximum similarity also in- [56] References Clted fluences the decision to accept or reject the identity UNITED STATES PATENTS claim- 3,509,280 4/1970 Jones ..179/l SB 12 Claims, 10 Drawing Figures IDENTITY SAMPLE CLAIM SPEECH IN J l6 l9 ANALYZE START l3) 2 CONTROL TIME ADJ.
STORE BUFFER l5 COMPUTE VARIANCES DISSIMILARITY RP NG 26 FUNCTION 25 MEASUREMENT MEASUREMENT THRESHOLD ACCEPT COMPARISON NO DECISION PATENTED 24 972 3.700.815
SHEET 1 OF 5 FIG l0 1L i R IDENTITY SAMPLE l8 SPEECH 9I AI SPEECH IN l6 9 ll\, ANALYZE ANALYZE START CONTROL MUTUAL UNIT TIME ADJ.
STORE BUFFER I 2 l5 2| COMPUTE r VARIANCES TIME ,2 REGISTRATION s(T)\. qinn DISSIMILARITY Qffifififl 26 25 MEASUREMENT MEASUREMENT COMBINE u 27 THRESHOLD No DECISION G. R. DODD/NGTO/V RC. M/S
A 7' 7 TWNF V PATENTEnncr 24 m2 SHEEI 2 BF 5 FIG. 7
DISTANCE MEASURES 2 SEC D CORRELATION (I n) 2 2 D CORRELATION 2 (l-Prs) FIG 2 PATENTFD 3 3- SHEET l BF 5 FIG. 5
5| (52 SITES I SEI RESL% I, s s STORE t T STORE Sm 1 DEIEESR S(T)\ (I) START I A i f sh) TRANSFORMATIONT- FUNCTION GEN. 5m
SQUARE SQUARE MULT 3 I 66 WARP r f I FU IQN INT INT. df'
T 0 i n+l I I g, W A
r n SQUARE SQUARE I ROOT ROOT c" STORE l 2 v2 1 2 I/2 (-I r dt) (TI s dt) I v 1 STORE 1- n MULT DIV l 57 ADD (68 $9 {"i}o OR 3 56 ADD OR .1 r(f)s(T) [(r I s ITI J I {qfio I l SENSITIVITY SENSITIVITY MEASURE MEASURE ix --i s2 s3 .Q "-2 q d g 0m m MULT Cq cI MULT E a 72 {At }n\. {AIQ No llAqiHsU? IIAT IIsV? GATE 2 YES GATE NO l 74 YES {Aqi} AND TRANSFER PATENTEBHBT 24 I972 SHEET 5 OF 5 WE WERE AWAY A YEAR AGO.
THIRD FORMANT 3) SECOND FORMANT 2) FIRST FORMANT PITCH PERIOD GAIN QSAMPLE REFERENCE) 0.75' (t) NORMAUZED 'o.25' aso' 3 2 0m w mi 658%: 82
050' 075' (t) NORMALIZED FIG 8B krnlh l REFERENCE) I l H62 (t) NORMALIZED AUTOMATIC SPEAKER VERIFICATION BY NON- LINEAR TIME ALIGNMENT OF ACOUSTIC PARAMETERS This invention relates to speech signal analysis and, more particularly, to a system for verifying the identity to his speech.
BACKGROUND OF THE INVENTION Many business transactions might be conducted by voice over a telephone if the identity of a caller could be verified. It might, for example, be convenient if a person could telephone his bank and ascertain the balance of his account. He might dial the bank and enter both his identification number and his request by keying the dial. A computer could (via synthetic speech) ask him to speak his verification phrase. If a verification of sufficiently high confidence was achieved, the machine would proceed to read out the requested balance. Other instances are apparent where verification by voice would prove useful.
From the practical point of view, the problem of verification appears both more important and more tractable than the problem of absolute identification. The former problem consists of the decision to accept or reject an identity claim made by an unknown voice. In identification the problem is to decide which of a reference set is the unknown most like. In verification, the expected probability of error tends to remain constant regardless of the size of the user population, whereas in identification the expected probability of error tends to unity as the population becomes large. In the usual context of the verification problem one has a closed set of cooperative customers, who wish to be verified and who are willing to pronounce prescribed code phrases (tailored to the individual voices if necessary). The machine may ask for repeats and might adjust its acceptance threshold in accordance with the importance of the transaction. Further, the machine may control the average mix of the two kinds of errors it can make: i.e., accept a false speaker (miss), or reject a true speaker (false alarm).
DESCRIPTION OF THE PRIOR ART responding word. Some work has been done on comparing selected parameters in a sample utterance for example, peaks and valleys of pitch periods, against corresponding reference data.
I SUMMARY OF THE INVENTION It is, accordingly, an object of this invention to verify the identity of a human being on the basis of certain unique acoustic cues in his speech. In accordance with the invention, verification of a speaker is achieved by comparing the characteristic way in which heutters a test sentence with a previously prepared utterance of the same sentence. A number of different tests are made on the speech signals and a binary decision is then made; the identity claim of the talker is either rejected or accepted.
The problem may be defined as follows. A person asserts a certain identity and then makes a sample utterance of a special test phrase. Previously prepared information about the voice of the person whose identity is claimed i.e., a reference utterance, embodies the typical way in which that person utters the test phrase, as well as measures of the variability to be expected in separate repetitions of the phrase by that person. The sample utterance is compared with the reference information and a decision is rendered as to the veracity of the identity claim. For the sake of exposition, it is convenient to divide the verification technique into three human listener) thus is a possibility, but it is generally inconvenient and it occupies talent that might be better applied otherwise. Also, present indications are that auditory verification is not as reliable as machine verification.
Accordingly, several proposals have been made for the automatic recognition of speech sounds based entirely on acoustic information. These have shown some degree of promise, providing that the sample words to be recognized or identified are limited in number. Most of these recognition techniques are based on individual words, with each word being compared to a corbasic operations: time registration, construction of a reference, and measurement of the distance from the reference to a particular sample utterance.
Time registration is the process in which the time axis of a sample time function is warped so as to make the function most nearly similar to the unwarped version of a reference function. The warped time scale may be specified by any continuous transformation. One suitable function is a piece-wise linear continuous function of unwarped time. In this case warping is uniquely determined by two coordinates of each breakpoint in the piece-wise linear function. Typically, 10 break-points may be used for warping a two-secondlong function, so the registration task amounts to the optimal assignment of values to 20 parameters.
The coefficient of correlation between warped sample and unwarped reference may be used as one index of the similarity of the two functions. The 20 warping parameters are iteratively modified to maximize the correlation coefficient. One suitable technique is the method of steepest ascent. That is, in every iteration, each of the 20 parameters is incremented by an amount proportional to the partial derivative of the correlation coefficient with respect to that parameter.
Success of this procedure hinges on the avoidance of certain degenerate outcomes. Accordingly, several constraints on the steepest ascent iteration process are employed. In effect, these constraints prevent the original function from being distorted too severely, and prevent unreasonably large steps on any one iteration.
A reference phrase is formed by collecting a number of independent utterances of the phrase by the same speaker. Each is referred to as a specimen" utterance. A typical phrase which has been used in practice is We were away a year ago. Each utterance is analyzed to yield, for an all voiced utterance such as this one, five control functions (so called because they can be used to control a formant synthesizer to generate a signal similar to the original voice signal). It has been found that gain, pitch period, and first, second, and third formant frequencies, are satisfactory as control functions. The gain function is scaled to have a particular peak value independent of the talking level.
The reference consists of a version of each of the five control functions chosen to represent a typical utterance by that speaker. By convention, the length of the reference is always the same; a value of 1.9 seconds may be used as the standard length. Any value may be used that is not grossly. different from the natural length of the utterance.
The reference functions are constructed by averaging together the specimen functions after each has been time-warped to bring them all into mutual registration with each other. One way this mutual registration has been achieved is as follows. One of the five control functions is singled out to guide the registration. This control function is called the guide function. Either gain or second formant may be used for this purpose. The guide function from each specimen is linearly expanded (or contracted) to the desired reference length, and then all of the expanded guide functions are averaged together. This average is the first trial reference for the control function serving as guide. Each of the specimen guide functions is then registered to the trial reference by non-linear time-warping, and a new trial reference is generated by averaging the warped specimens. This process is continued iteratively, i.e., warp each specimen guide function for registration with the current trial reference, and then make a new trial reference by averaging the warped guide functions, until the reference does not change significantly. The other four control functions for each specimen utterance are then warped by the final guide warping function for that utterance, and then each control function is averaged across all specimens to form a reference. The reference control functions are stored for future use, along with computed variance values which indicate the reliability of the function as a standard in selected intervals of the utterance.
When a sample of the standard utterance is presented for verification, a distance value is computed that is a measure of the unlikelihood that that sample would have been generated by the person whose identity is claimed. Distances are always positive numbers; a distance value of zero means that the utterance is identical to the reference in every detail.
The sample is first analyzed to generate the five control functions in terms of which the reference is stored. The control functions are then brought into temporal registration with the reference. This is done by choosing one of the control functions (e.g.,' gain) to serve as the guide. The guide function of the sample utterance is registered with its counterpart in the reference by non-linear. warping, and other control functions are then warped in an identical way.
After registration of the control functions, a variety of distances between the sample and reference ut-- terance are-measured. Included are measures of the difference in local average, local linear variation, and local quadratic variation for all control functions; local and global correlation coefiicients between sample and reference control functions; and measures that represent the difficulty of time registration. ln forming these separate measures, various time segments of the utterance are weighted in proportion to the constancy of the given measure in that time segment across the set of warped specimens. These measures are then combined to form a single overall distance that represents the degree to which the sample utterance differs from the reference.
The verification decision is based on the single overall distance. If it is less than a pre-determined criterion, the claimed identity is accepted (verified"); if it is greater than the criterion, the identity claim is rejected. In addition, an indeterminate zone may be established around the criterion value within which neither definite decision would be rendered. In this event, additional information about the person is sought.
BRIEF DESCRIPTION OF THE DRAWING The invention will be fully apprehended from the fol- I lowing detailed description of a preferred illustrative embodiment thereof taken in connection with the appended drawings.
In the drawings:
FIG. 1 is a block schematic diagram of a speech verification system in accordance with the invention;
FIG. 2 illustrates an alternative analyzer arrangement; I
FIG. 3 illustrates graphically the registration technique employed in the practice of the invention;
FIG. 4 is a chart which illustrates the dependence of two kinds of error ratios on the choice of threshold;
FIG. 5 is a block schematic diagram of a time adjustment configuration which-may be employed for nonlinearly warping parameter values;
FIG. 6 illustrates a criterion for maximizing similarity of acoustic parameters in accordance with the invention;
FIG. 7 illustrates a number of distance measures used in establishing an identity between two speech samples; and
FIGS. 8A, 8B and 8C are graphic illustrations of speech parameters of an unknown and a reference talker. There is illustrated in A, the time normalized parameters before the nonlinear time warping procedure. B illustrates parameters for a reference and specimen utterance which match after time registration using the second formant as the guide function. C illustrates a fully time normalized set of parameters for an impostor, i.e., a no-match condition.
DETAILED DESCRIPTION A system for verifying an individuals claimed identity is shown schematically in FIG. 1. A library of reference utterances is established to maintain a voice standard for each individual subscriber to the system. A later claim of identity is verified by reference to the appropriate stored reference utterance. Accordingly, an individual speaks a reference sentence, for example, by way of a microphone at a subscriber location, or over his telephone to a central location (indicated generally by the reference numeral 10). Although any reference phrase may be used, the phrase should be capable of representing a number of prosodic characteristics and variations of his speech. Since vowel or voiced sounds contain a considerable number of such features, the reference sentence, We were away a year ago. has been used in practice. This phrase is effective, in part, because of its lack of nasal sounds and its totally voiced character. Moreover, it is long enough to require more than passing attention to temporal registration, and is short enough to afford economical analysis and storage.
Whatever the phrase spoken by the individual to establish a standard, it is delivered to speech analyzer 11, of any known construction, wherein a number of different acoustic parameters are derived to represent it. For example, individual formant frequencies, amplitudes, and pitch, at the Nyquist rate are satisfactory. These speech parameters are commonly used to synthesize speech in vocoder apparatus and the like. One entirely suitable speech signal analyzer is described in detail in a copending application of L. R. Rabiner and R. W. Schafer, Ser. No. 872,050, filed Oct. 29, 1969. In essence, analyzer 11 includes individual channels for identifying formant frequencies F1, F2, F3, pitch period P, and gain G control signals. In addition, fricative identifying signals may be derived if desired.
In order that variations in the manner in which the individual speaks the phrase may be taken into account, it is preferable to have him repeat the reference sentence a number of times in order that an average set 'of speech parameters may be prepared. It is convenient to analyze the utterance as it is spoken, and to adjust the duration of the utterance to a standard length T. Typically, a two-second sample is satisfactory. Each spoken reference sentence therefore is either stretched or contracted in apparatus 12 to adjust it to the standard duration. Each adjusted set of parameters is then stored either as analog, or after conversion, as digital signals, for example, in unit 12. When all of the test utterances have been analyzed and brought into mutual time registration, an average set of parameters is developed in averaging apparatus 13. The single resultant set of reference parameter values is then stored for future use in storage unit 15.
In addition, a set of variance signals is prepared and stored in unit 15. Variance values are developed, in the manner described hereinafter, for parameters in each of a number of time segments within the span of the reference utterance to indicate the extent of any difference in the manner in which the speaker utters that segment of the test phrase. Hence, variance values provide a measure of the reliability with which parameters in different segments may be used as a standard.
It is evident that a non-vocal identification of each individual is also stored, preferably in library store 15. The identification may be either in the form of a separate address or some other key to the storage location of the reference utterance for each individual. Any
dividual identifies himself, for example, by means of his name and address or his credit card number. This data is entered into reading unit 16, of any desired construction, in order that a request to verify may be initiated. Secondly, upon command, i.e., a ready light from unit 17, the individual speaks the reference sentence. These operations are indicated generally by block 18 in FIG. 1. The sample voice signal is delivered to analyzer 19 where it is broken down to develop parameter values equivalent to those previously stored for him. Analyzer l9, accordingly, should be of identical construction to analyzer 11, and preferably, is located physically at the central processing station. The resultant set of sample parameter values are thereupon delivered to unit 17 to initiate all subsequent operations.
Since it is unlikely that the sample utterance will be in time registration with the reference sample, it is necessary to adjust its time scale to bring it into temporal alignment with the reference. This operation is carried out in time adjustment apparatus 20. in essence, iterative processing is employed to maximize the similarity between the specimen parameters and the reference parameters. Similarity may be measured by the coefficient of correlation between the sample and reference. Sample parameters are initially adjusted to start and stop in registry with the reference. It is also in accordance with the invention to match the time spread of variables within the speech sample. Internal time registration is achieved by a nonlinear process which maximizes the similarity between the sample and the reference by way of a monotonic continuous transformation of time.
Accordingly, values of the sample signal parameters, s( t) alleged to be the same as reference signal r(t), are delivered to adjustment apparatus 20. They are remapped, i.e., converted, by a substitution process to values s(1-) where 1-(t)=[a+bt+q(t)]. (I) In the equation, coefficient a and b are determined so as to cause the end points of the sample to coincide with those of the reference when q(t) is zero. The function q(t) defines the character of the time scale transformation between the end points of the utterance. In practice, q(t) may be a continuous piece-wise linear function. The time adjustment operation is illustrated graphically in FIG. 3. A reference function r(t) extends through the period 0 to T. It is noted, however, that the sample reference s(t) is considerably shorter in duration. It is necessary, therefore, to stretch it to a duration T. This is done by means of the substitute function 1'(t) shown in the third line of the illustration. A so-called gradient climbing procedure may be employed in which values q, at times t, are varied in order that values of q, and t, may be found that maximize the normalized correlation 1 between the reference speech and the sample speech, where The symbols denote a time average value of the enclosed expression. By thus maximizing the correlation between the two, a close match between prominent features in the utterance, e.g., formants, pitch, and intensity values, is achieved.
Details of the time normalization process are described hereinafter with reference to FIG. 5. Suffice it to say at this point that the substitute values of the sample s(r) together with values of r(t) and variance values are delivered to measurement apparatus 25. Values of q, and t which reflect the amount of nonlinear squeezing used to maximize I, are delivered to measurement apparatus 26.
Since the reference speech and the sample speech are now in time registry, it is possible to measure internal similarities between the two.- Accordingly, a value is prepared in measurement apparatus 25 which denotes the internal dissimilarities between the two speech signals. Similarly, a measure is prepared in apparatus 26 which denotes the extent of warping required to bring the two into registry. If the dissimilarities are found to be small, it is likely that a match has been found. Yet, if the warping function value is extremely high, there is a likelihood that the match is a false one, resulting solely from extensive registration adjustment. The two measures of dissimilarity are combined in apparatus 27 and delivered to comparison unit 28 wherein a judgment is made in accordance with preestablished rules, i.e., threshold levels, balanced between similarity and inconsistencies. An accept or reject signal isthereupon developed. Ordinarily, this signal is returned to, unit 16 to verify or reject the claim of identity made by the speaker. V
It is evident that there is redundancy in the apparatus illustrated in FIG. 1. Thus, for example, analyzer 11 is used only to prepare reference samples. It may, of course, be switched as required to analyze identity claim samples. Such an arrangement is illustrated in FIG. 2. Reference and sample information is routed by way of switch 29a to analyzer l1 and delivered by way of switch 29b to the appropriate processing channel. Other redundancies within the apparatus may, of course, be minimized by judicious construction. Moreover, it is evident that all of the operations described may equally well be performed on "a computer. All of the steps and all of the apparatus functions may be incorporated in a program for implementation on a general-purpose or dedicated computer. Indeed, in practice, a computer implementation has been found to be most effective. No unusual programming steps are required for carrying outthe indicated operations.
FIG. 4 illustrates the manner in which acceptance or rejection of a sample is established. Since absolute discrimination between reference and sample values would require near perfect matching, it is evident that a compromise must be used. FIG. 4 indicates, therefore, the error rate of the verification procedure as a function of the value of the dissimilarity measure between the reference and sample, taken as a threshold value for acceptance or rejection. A compromise value is selected that then determines the number of true matches that are rejected, i.e., customers whose claim to identity is disallowed, versus the number of impostors whose claim to identity is accepted. Evidently, thev crossover point may be adjusted in accordance with the particular identification application.
FIG. illustrates in block schematic form the operations employed in accordance with the invention for registering the time scales of a reference utterance, in parameter value form, with the sample utterancein like.
parameter value form. It is evident that the Figure illustrates a hardware implementation. The Figure also constitutes a working flow chart for an equivalent computer program. Indeed, FIG. 5 represent the flow chart of the program that has been used in the practice of the invention. As with the overall system, no unusual programming steps are required to implement the arrangement.
The system illustrated in FIG. 5 corresponds generally to the portion of FIG. 1 depicted in unit 20. Reference values of speech signal parameters r(t) from store 15 are read into store 51 as a set. Similarly, samples from analyzer 19 are stored in unit 52. In order to register the time scale of the samples with those of the reference, samples s(t) are converted into a new set of values s(r) in transformation function generator 53. This operation is achieved by developing, in generator 54, values of s(r) as discussed above in Equation (1). Coefficients a and b are'determined to cause the end points of the sample utterance, as determined for example by speech detector 55, to coincide with those of the reference when q(t) is zero. Detector 55 issues a first marker signal at the onset of the sample and a second marker signal at the cessation of the utterance. These signals are delivered directly to generator 54.- Values of q, and t, for the interval between the terminal points of the utterance are initially entered into the system as prescribed sets of constants q and 1, These values are delivered to OR gates 56 and 57, respectively, and by way of adders 58 and 59 to the input of generator 54. Accordingly, with these initial values, a set of values -r(t) is developed in generator 54 in accordance with Equation (1). Values of the specimen s(t) are thereupon remapped in generator 53 according to the functions developed in generator 54 to produce a time-warped version of the sample, designated s(r).
Values of s(r) are next compared with the reference samples to determine whether or not the transformed specimen values have been brought into satisfactory time alignment with the reference. The normalized correlation I, as defined above in Equation (2), is used for this comparison. Since I is developed on the basis of root mean square values of the sample functions, the necessary algebraic terms are prepared by developing a product in multiplier 60 and summing the resultant over the period T in accumulator 61. This summation establishes the numerator of Equation (2). Similarly, values of r(t) and s(t) are squared, integrated, and rooted, respectively, in units 62, 63, 64, and 65, 66, and 67. The two normalized values are delivered to multiplier 68 to form the denominator of Equation (2). Di- I vider network 69 then delivers as its output a value of the normalized correlation function I in accordance with Equation (2). It indicates the similarity of the samthe indicated changes in the values of q and t for the previous values supplied to generator 54. Accordingly, the partial derivative values of I with respect to q and with respect to t are prepared and delivered to multipliers 72 and 73, respectively. These values are equalized by multiplying by constants Cq and Ct in order to enhance the subsequent evaluation. These products constitute incremental values of q and t. The mean squares of the sets of values. q, and t, are thereupon compared in gates 74 and 75 to selected small constants U and V. Constants U and V are selected to indicate the required degree of correlation that will assure a low error rate in the ultimate decision. If either of the comparisons is unsatisfactory, incremental values of q, or n, or both, are returned to adders 58 and 59. The previously developed values q, and t,, are incremented thereby to provide a new set of values as inputs to function generator 54. Values of s(1) are thereupon developed using the new data and the process is repeated. In essence, the values of q at intervals t as shown in FIG. 3 are individually altered to determine an appropriate set of values for maximizing the correlation between the reference utterance and the altered sample utterance.
FIG. 6. illustrates mathematically correlating the operation. The relationships are those used in a computer program used to implement the steps discussed above.
Thus, each q, and t, is adjusted until a further change in its value produces only a small change in correlation. When the change is sufficiently small, the last generated value is held, e. g., in generator 54. When the sensitivity measures are found to meet the abovediscussed criterion, i.e., maximum normalized correlation, gates 74 and 75 deliver the last values of s(1') by way of AND gate 78 to store 79. These values then are used as the time registered specimen samples and, in the apparatus of FIG. 1, are delivered to dissimilarity measuring apparatus 25. The values of q at time t from function generator 54 are similarly delivered, for example, by way of a gate (not shown in the Figure to avoid undue complexity) energized by the output of gate 78 to function measurement apparatus 26 of FIG. 1.
With the sample speech utterance appropriately registered with the averaged reference speech utterance, it is then in accordance with the invention to assess the similarities between the two and to develop a single numerical evaluation of them. The numerical evaluation is used to accept or reject the claim of identity. For convenience, it has been found best to generate a measure of dissimilarity such that a numerical value of zero denotes a perfect match between the two, and progressively higher numerical values denote greater degrees of dissimilarity. Such a value is sometimes termed a distance value.
To provide a satisfactory measure of dissimilarity, the two registered utterances are examined in a variety of different ways and at a variety of different locations in time. The resultant measures are combined to form a single distance value. One convenient way of assessing dissimilarity comprises dividing the interval of the utterances, O to T, into N equal intervals. If T 2 seconds, as discussed in the above example for a typical application, it is convenient to divide the interval into N 20 equal parts. FIG. 7 illustrates such a subdivision. Each subdivision i is then treated individually and a number of measures of dissimilarity are developed. These are based on (1) differences in average values between the reference speech r(t) and the registered sample s('r), (2) the differences between linear components of variation of the two functions, (3) differences between quadratic components of variations of the two functions, and (4) the correlation between the two functions. In addition, a correlation coefficient (5) over the entire interval is obtained. Five such evaluations are made for each of the speech signal parameters used in representing the utterances. Thus, in the example of practice discussed herein, five evaluations are made for each of the formants F F and F for the pitch P of the signal, and for its gain G. Accordingly, 25 individual signal values of dissimilarity are produced.
It has also been found that the reliability of these measures varies between individual segments of the utterances. That is to say, certain speakers appreciably vary the manner in which they deliver certain portions of an utterance but are relatively consistent in delivering other portions. It is preferable therefore to use the most reliable segments for matching purposes and to reduce the relative weight of, or eliminate entirely, the measures in those segments known to be unreliable. The degree of reliability in each segment is based on the variance between the reference'speech signal in each segment for each of the several reference utterances used in preparing the average reference in unit 13 of FIG. 1. The average values are thus compared and a value 0 representative of the variance, is developed and stored along with values r(t) in storage unit 15.
Dissimilarity measurement apparatus 25 thus is supplied with the function r(t), s(1'), and 0 It performs the necessary mathematical evaluation to divide the functions into N equal parts and to compute a measure of the squared difference in average values of the reference utterance and adjusted sample utterance, the squared difference in linear components between the two, (also designated slope) the squared difference in quadratic components between the two (also designated curvature), and the correlation between the two. Each of the measures is scaled in accordance with the reliability factor as measured by the variance 0- discussed above.
The equations which define these mathematical equations are set forth in FIG. 7. In the equations, the subscripts r and s refer, respectively, to the reference utterance and the warped sample utterance, and the functions x, y, and z are the coefficients of the first three terms of an orthogonal polynominal expression of the corresponding utterance value. The symbol p represents the correlation coefficient between the sample and reference functions computed over the full length of the sample. The function p represents the correlation coefficient between the sample and reference computed for the ith segment. Similarly, 0 represents the variance of the reference parameters computed for the entire set of reference utterances used to produce the average. The numerical evaluation for each of these measures is combined to form a single number and a signal representative of the number is delivered to combining network 27.
Although the numerical value of dissimilarity thus prepared is sufficient to permit a reasonably reliable verification decision to be made, it is evident that the sample was adjusted severely to maximize the correlation between it and the reference. The degree of adjustment used constitutes another clue as to the likelihood of identity between the sample and the reference. If the warping values q, and t, were excessively large, it is more unlikely that the sample corresponds to the reference than if maximum correlation was achieved with less severe warping. Accordingly, the final values of q and I developed in generator 24 (FIG. 1) are delivered to measurement apparatus 26. Three measures of warping are thereupon prepared in apparatus 26.
For convenience an expression for the amount of warping employed is defined as Typically, 10 values of 1' are employed so that 10 values of A are produced. These values are averaged to get a single numerical value A X. A value of X is developed for each of the reference speech utterances used to prepare the average. All values of X are next averaged over each of the N reference utterances to produce a value X. A first measure of distance for warping is then evaluated as In similar fashion, a number Y representative of the linear component of variation in the values of A is prepared, and a quadratic component of variation is evaluated as Z. A second measure of distance is then evaluated as D, Z (5) Finally, a third measure of distance is developed as 1 N 1 2 (A,-X- Y(t,t Z(tit,,,)
where t,,, is the value of the t at the midpoint of the utterance.
The three warping distance measures, d D and d;, from system 26 are then delivered together with 25 dissimilarity measures from system 25 to combining unit 27 wherein a single distance measure is developed. Preferably, each of the individual distance values is suitably weighted. If the weighting function is equal to one for each distance value, a simple summation is performed. Other weighting systems may be employed in accordance with experience, i.e., the error rate experienced in verifying claims of identity of those references accommodated by the system.
The warping function measurements, are therefore delivered to combining network 27 where they are combined with the numerical values developed in apparatus 25. The composite distance measure is thereupon used in threshold comparison network 28 to determine whether the sample speech should be accepted or rejected as being identical with the reference, i.e., to verify or reject the claim of identity. Since the distance measure is in the form of a numerical value, it may be matched directly against a stored numerical value in apparatus 28. The stored threshold value is selected to distribute the error possibility between a rejection of true claims of identity versus the acceptance of false claims of identity as illustrated in FIG. 4, discussed above. It is also possible that the distance value is too close to the threshold limit to permit a positive decision 5 to be made. In this case, i.e., in an intermediate zone 10 signal may be used to suggest that additional information about the individual claiming identity is needed, e. g., in the form of other tangible identification.
FIGS. 8A, 8B and 8C illustrate the overall performance of the system of the invention based on data developed in practice. In FIG. 8A, waveforms of the sample sentence We were away a year ago. are shown for the first three formants, for the pitch period, and for signal gain, both for a sample utterance and for an averaged reference utterance. It will be observed that the waveforms of the sample and reference are not in time registry. FIG. 8B illustrates the same parameters after time adjustment, i.e., after warping, for a sample utterance determined to be substantially identical to. the reference. In this case, the dissimilarity measure is sufficiently low to yield an accept signal, thus to verify the claim of identity. In FIG. 8C, the sample and reference utterances of the test sentence have been registered; yet it is evident that severe disparities are present between the two. Hence, the resulting measure of dissimilarity is sufficiently high to yield a reject signal.
Since the basic features of the invention involve the computation of certain numerical values and certain comparison operations, it is evident that the invention may most conveniently be turned to account by way of a suitable program for a computer. Indeed, the block schematic diagrams of FIGS. 1 and 5, together with the mathematical relationships set forth in the specification and figures constitute in essence a flowchart diagram illustrative of the programming steps used in the practice of the invention.
What is claimed is:
1. In an auditory verification system in which acoustic parameters of a test sample of an individuals speech are matched for identity to like parameters of a reference sample of his speech, that improvement which includes the steps of:
time adjusting said test sample parameters with said 0 reference parameters according to a nonlinear registration schedule, measuring internal dissimilarities and irregularities between said time adjusted parameters and said 5 average reference parameters, and
3 measuring internal dissimilarities and irregularities between said test sample parameters and said reference parameters, and
verifying said individuals identity on the basis of said measures of dissimilarities and irregularities.
3. In an auditory verification system in which acoustic parameters of a test sample of an individuals speech are matched for identity to like parameters of a reference sample of his speech, that improvement which comprises the steps of:
developing said reference sample from time registered values of a plurality of different speech signal parameters,
developing a like plurality of different speech signal parameters from said test speech sample,
time adjusting said test sample parameters with said reference parameters according to a nonlinear registration schedule,
measuring internal dissimilarities and irregularities between said time adjusted parameters and said average reference parameters, and
verifying said individual s identity on the basis of said measures of dissimilarities and irregularities.
4. in a speech signal verification system wherein selected speech signal parameters derived from a test phrase spoken by an individual to produce a sample are compared to reference parameters derived from the same test phrase spoken by the same individual, and wherein verification or rejection of the identity of the individual is determined by the similarities of said sample and reference parameters,
means for bringing the time span of said sample parameters into temporal registration with the time span of said reference parameters, and
means for temporally adjusting the time distribution of parameters of said sample within said adjusted time span to maximize similarities between said sample parameters and said reference parameters.
5. The speech signal verification system as defined in claim 4, wherein said similarities between said sample parameters and said reference parameters are measured by the coefficient of correlation therebetween.
6. The speech signal verification system as defined in claim 5, wherein said temporal adjustment of parameters within said adjusted time span comprises,
means for iteratively incrementing the time locations of selected parameter features until said measure of correlation between said sample parameters and said reference parameters does not increase significantly for a selected number of consecutive iterations.
7. The speech signal verification system as defined in claim 4, wherein said means for temporally adjusting said parameters within said adjusted time span comprises,
means for temporally transforming said sample parameters, designated s(t), intoa set of parameters, designated s(1'), in which 1(t) a+bt+q(t), in
which a and b are constants selected to align the 6 14 tinuous piece-wise linear function described by N selected amplitude values q, and time values t, within said time span, wherein i=0, 1, .N.
9. An auditory speech signal verification system,
which comprises, in combination,
means for analyzing a plurality of individual utterances of a test phrase spoken by an individual to develope a prescribed set of acoustic parameter signals for each utterance, means for developing from each of said sets of parameter signals a reference set of parameter signals and a set of signals which denotes variations between parameter signals used to develop said reference set of signals, means for storing a set of reference parameter signals and a set of variation signals for each of a number of different individuals, means for analyzing a sample utterance of said test phrase spoken by an individual purported to be one of said number of different individuals to develop a set of acoustic parameter signals, means for adjusting selected parameter signals of said sample to bring the time scale of said utterance represented by said parameters into registry with the time scale of a designated one of said stored reference utterances represented by said reference parameters, I said means including means for adjusting selected values of said sample parameter signals to maximize similarities between said sample utterance and said reference utterance, means responsive to said reference paraineter signals, said adjusted sample parameter signals, and said variation signals for developing a plurality of signals representative of selected similarities between each of said sample parameters and each of said corresponding reference parameters, means for developing signals representative of the extent of adjustment employed to register said time scales, means responsive to said plurality of similarity signals and said signals representative of the extent of adjustment for developing a signal representative of the overall degree of similarity between said sample utterance and said designated reference utterance, and threshold comparison means supplied with said overall similarity signal for matching the magnitude of said similarity signal to the magnitude of a stored threshold signal, and for issuing an accept signal for similarity signals above threshold, a reject signal for signals below threshold, and a no decision signal for similarity signals within a prescribed narrow range of signal magnitudes near said threshold magnitude. 10. An auditory speech signal verification system, as defined in claim 9, wherein,
said means for developing a plurality of signals representative of selected similarities includes, means for measuring a plurality of different speech signal characteristics for similarity in each of a number of time subintervals within the interval of said designated time scale. 11. An auditory speech signal verification system, as defined in claim 10, wherein,
said different speech signal characteristics are based,
respectively, on (1) the difference in average values between said reference speech signal parameters and said sample speech signal parameters, (2) the squared difference in linear com- 5 ponents between the two, (3) the squared difference in quadratic components between the two, (4) the correlation between the two in each of said subintervals, and (5) the correlation between the two over said entire interval of said designated 1. time scale. 12. An auditory speech signal verification system, as
defined in claim 1 1, wherein, each of said signals representative of selected

Claims (12)

1. In an auditory verification system in which acoustic parameters of a test sample of an individual''s speech are matched for identity to like parameters of a reference sample of his speech, that improvement which includes the steps of: time adjusting said test sample parameters with said reference parameters according to a nonlinear registration schedule, measuring internal dissimilarities and irregularities between said time adjusted parameters and said average reference parameters, and verifying said individual''s identity on the basis of said measures of dissimilarities and irregularities.
2. In an auditory verification system in which acoustic parameters of a test sample of an individual''s speech are matched for identity to corresponding parameters of a reference sample of his speech, that improvement which comprises the steps of: preparing said reference sample of an individual''s speech from sets of parameters developed from a plurality of different utterances of a test phrase by said individual which have been mutually registered in time, and from a plurality of measures of variation between said different utterances, measuring internal dissimilarities and irregularities between said test sample parameters and said reference parameters, and verifying said individual''s identity on the basis of said measures of dissimilarities and irregularities.
3. In an auditory verification system in which acoustic parameters of a test sample of an individual''s speech are matched for identity to like parameters of a reference sample of his speech, that improvement which comprises the steps of: developing said reference sample from time registered values of a plurality of different speech signal parameters, developing a like plurality of different speech signal parameters from said test speech sample, time adjusting said test sample parameters with said reference parameters according to a nonlinear registration schedule, measuring internal dissimilarities and irregularities between said time adjusted parameters and said average reference parameters, and verifying said individual''s identity on the basis of said measures of dissimilarities and irregularities.
4. In a speech signal verification system wherein selected speech signal parameters derived from a test phrase spoken by an individual to produce a sample are compared to reference parameters derived from the same test phrase spoken by the same individual, and wherein verification or rejection of the identity of the individual is determined by the similarities of said sample and reference parameters, means for bringing the time span of said sample parameters into temporal registration with the time span of said reference parameters, and means for temporally adjusting the time distribution of parameters of said sample withIn said adjusted time span to maximize similarities between said sample parameters and said reference parameters.
5. The speech signal verification system as defined in claim 4, wherein said similarities between said sample parameters and said reference parameters are measured by the coefficient of correlation therebetween.
6. The speech signal verification system as defined in claim 5, wherein said temporal adjustment of parameters within said adjusted time span comprises, means for iteratively incrementing the time locations of selected parameter features until said measure of correlation between said sample parameters and said reference parameters does not increase significantly for a selected number of consecutive iterations.
7. The speech signal verification system as defined in claim 4, wherein said means for temporally adjusting said parameters within said adjusted time span comprises, means for temporally transforming said sample parameters, designated s(t), into a set of parameters, designated s( Tau ), in which Tau (t) a+bt+q(t), in which a and b are constants selected to align the end points of said time span of said sample parameters with the end points of said reference parameters, and in which q(t) is a nonlinear function which defines the distribution of parameter values within said time span.
8. The speech signal verification system as defined in claim 7, wherein said nonlinear function q(t) is a continuous piece-wise linear function described by N selected amplitude values qi and time values ti within said time span, wherein i 0, 1, . . .N.
9. An auditory speech signal verification system, which comprises, in combination, means for analyzing a plurality of individual utterances of a test phrase spoken by an individual to develope a prescribed set of acoustic parameter signals for each utterance, means for developing from each of said sets of parameter signals a reference set of parameter signals and a set of signals which denotes variations between parameter signals used to develop said reference set of signals, means for storing a set of reference parameter signals and a set of variation signals for each of a number of different individuals, means for analyzing a sample utterance of said test phrase spoken by an individual purported to be one of said number of different individuals to develop a set of acoustic parameter signals, means for adjusting selected parameter signals of said sample to bring the time scale of said utterance represented by said parameters into registry with the time scale of a designated one of said stored reference utterances represented by said reference parameters, said means including means for adjusting selected values of said sample parameter signals to maximize similarities between said sample utterance and said reference utterance, means responsive to said reference parameter signals, said adjusted sample parameter signals, and said variation signals for developing a plurality of signals representative of selected similarities between each of said sample parameters and each of said corresponding reference parameters, means for developing signals representative of the extent of adjustment employed to register said time scales, means responsive to said plurality of similarity signals and said signals representative of the extent of adjustment for developing a signal representative of the overall degree of similarity between said sample utterance and said designated reference utterance, and threshold comparison means supplied with said overall similarity signal for matching the magnitude of said similarity signal to the magnitude of a stored threshold signal, and for issuing an ''''accept'''' signal for similarity signals above threshold, a ''''reject'''' signal for signals below threshold, and a ''''no decision'''' signal for simiLarity signals within a prescribed narrow range of signal magnitudes near said threshold magnitude.
10. An auditory speech signal verification system, as defined in claim 9, wherein, said means for developing a plurality of signals representative of selected similarities includes, means for measuring a plurality of different speech signal characteristics for similarity in each of a number of time subintervals within the interval of said designated time scale.
11. An auditory speech signal verification system, as defined in claim 10, wherein, said different speech signal characteristics are based, respectively, on (1) the difference in average values between said reference speech signal parameters and said sample speech signal parameters, (2) the squared difference in linear components between the two, (3) the squared difference in quadratic components between the two, (4) the correlation between the two in each of said subintervals, and (5) the correlation between the two over said entire interval of said designated time scale.
12. An auditory speech signal verification system, as defined in claim 11, wherein, each of said signals representative of selected similarities between each of said sample parameters and each of said corresponding reference parameters is scaled in accordance with the magnitudes of said variation signals.
US135697A 1971-04-20 1971-04-20 Automatic speaker verification by non-linear time alignment of acoustic parameters Expired - Lifetime US3700815A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US13569771A 1971-04-20 1971-04-20

Publications (1)

Publication Number Publication Date
US3700815A true US3700815A (en) 1972-10-24

Family

ID=22469241

Family Applications (1)

Application Number Title Priority Date Filing Date
US135697A Expired - Lifetime US3700815A (en) 1971-04-20 1971-04-20 Automatic speaker verification by non-linear time alignment of acoustic parameters

Country Status (2)

Country Link
US (1) US3700815A (en)
CA (1) CA938725A (en)

Cited By (49)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3883850A (en) * 1972-06-19 1975-05-13 Threshold Tech Programmable word recognition apparatus
US3896266A (en) * 1971-08-09 1975-07-22 Nelson J Waterbury Credit and other security cards and card utilization systems therefore
US3919479A (en) * 1972-09-21 1975-11-11 First National Bank Of Boston Broadcast signal identification system
US3989896A (en) * 1973-05-08 1976-11-02 Westinghouse Electric Corporation Method and apparatus for speech identification
US4032711A (en) * 1975-12-31 1977-06-28 Bell Telephone Laboratories, Incorporated Speaker recognition arrangement
US4053710A (en) * 1976-03-01 1977-10-11 Ncr Corporation Automatic speaker verification systems employing moment invariants
US4059725A (en) * 1975-03-12 1977-11-22 Nippon Electric Company, Ltd. Automatic continuous speech recognition system employing dynamic programming
US4060694A (en) * 1974-06-04 1977-11-29 Fuji Xerox Co., Ltd. Speech recognition method and apparatus adapted to a plurality of different speakers
US4069393A (en) * 1972-09-21 1978-01-17 Threshold Technology, Inc. Word recognition apparatus and method
US4092493A (en) * 1976-11-30 1978-05-30 Bell Telephone Laboratories, Incorporated Speech recognition system
JPS53114648U (en) * 1977-02-21 1978-09-12
US4282403A (en) * 1978-08-10 1981-08-04 Nippon Electric Co., Ltd. Pattern recognition with a warping function decided for each reference pattern by the use of feature vector components of a few channels
WO1981002943A1 (en) * 1980-04-08 1981-10-15 Western Electric Co Continuous speech recognition system
JPS56158387A (en) * 1980-05-10 1981-12-07 Fujitsu Ltd Voice recognizing method using adaptive dynamic programming
US4348553A (en) * 1980-07-02 1982-09-07 International Business Machines Corporation Parallel pattern verifier with dynamic time warping
US4363102A (en) * 1981-03-27 1982-12-07 Bell Telephone Laboratories, Incorporated Speaker identification system using word recognition templates
US4446531A (en) * 1980-04-21 1984-05-01 Sharp Kabushiki Kaisha Computer for calculating the similarity between patterns
US4561105A (en) * 1983-01-19 1985-12-24 Communication Intelligence Corporation Complex pattern recognition method and system
US4573196A (en) * 1983-01-19 1986-02-25 Communications Intelligence Corporation Confusion grouping of strokes in pattern recognition method and system
US4601054A (en) * 1981-11-06 1986-07-15 Nippon Electric Co., Ltd. Pattern distance calculating equipment
US4608708A (en) * 1981-12-24 1986-08-26 Nippon Electric Co., Ltd. Pattern matching system
US4739398A (en) * 1986-05-02 1988-04-19 Control Data Corporation Method, apparatus and system for recognizing broadcast segments
US4752957A (en) * 1983-09-07 1988-06-21 Kabushiki Kaisha Toshiba Apparatus and method for recognizing unknown patterns
US4910782A (en) * 1986-05-23 1990-03-20 Nec Corporation Speaker verification system
US5091948A (en) * 1989-03-16 1992-02-25 Nec Corporation Speaker recognition with glottal pulse-shapes
US5167004A (en) * 1991-02-28 1992-11-24 Texas Instruments Incorporated Temporal decorrelation method for robust speaker verification
US5271088A (en) * 1991-05-13 1993-12-14 Itt Corporation Automated sorting of voice messages through speaker spotting
US5414755A (en) * 1994-08-10 1995-05-09 Itt Corporation System and method for passive voice verification in a telephone network
US5548647A (en) * 1987-04-03 1996-08-20 Texas Instruments Incorporated Fixed text speaker verification method and apparatus
US5581650A (en) * 1990-11-27 1996-12-03 Sharp Kabushiki Kaisha Learning dynamic programming
US5617507A (en) * 1991-11-06 1997-04-01 Korea Telecommunication Authority Speech segment coding and pitch control methods for speech synthesis systems
US5625747A (en) * 1994-09-21 1997-04-29 Lucent Technologies Inc. Speaker verification, speech recognition and channel normalization through dynamic time/frequency warping
WO1997039420A1 (en) * 1996-04-18 1997-10-23 Sarnoff Corporation Computationally efficient digital image warping
WO1998034216A2 (en) * 1997-01-31 1998-08-06 T-Netix, Inc. System and method for detecting a recorded voice
US5799276A (en) * 1995-11-07 1998-08-25 Accent Incorporated Knowledge-based speech recognition system and methods having frame length computed based upon estimated pitch period of vocalic intervals
US6260011B1 (en) * 2000-03-20 2001-07-10 Microsoft Corporation Methods and apparatus for automatically synchronizing electronic audio files with electronic text files
US6263308B1 (en) * 2000-03-20 2001-07-17 Microsoft Corporation Methods and apparatus for performing speech recognition using acoustic models which are improved through an interactive process
US6298323B1 (en) * 1996-07-25 2001-10-02 Siemens Aktiengesellschaft Computer voice recognition method verifying speaker identity using speaker and non-speaker data
US20040249639A1 (en) * 2001-10-11 2004-12-09 Bernhard Kammerer Method for producing reference segments describing voice modules and method for modelling voice units of a spoken test model
US20050096900A1 (en) * 2003-10-31 2005-05-05 Bossemeyer Robert W. Locating and confirming glottal events within human speech signals
US20060020458A1 (en) * 2004-07-26 2006-01-26 Young-Hun Kwon Similar speaker recognition method and system using nonlinear analysis
US20060178885A1 (en) * 2005-02-07 2006-08-10 Hitachi, Ltd. System and method for speaker verification using short utterance enrollments
US20090248412A1 (en) * 2008-03-27 2009-10-01 Fujitsu Limited Association apparatus, association method, and recording medium
US20090259465A1 (en) * 2005-01-12 2009-10-15 At&T Corp. Low latency real-time vocal tract length normalization
US20110112838A1 (en) * 2009-11-10 2011-05-12 Research In Motion Limited System and method for low overhead voice authentication
US20120232899A1 (en) * 2009-09-24 2012-09-13 Obschestvo s orgranichennoi otvetstvennost'yu "Centr Rechevyh Technologij" System and method for identification of a speaker by phonograms of spontaneous oral speech and by using formant equalization
US20140025376A1 (en) * 2012-07-17 2014-01-23 Nice-Systems Ltd Method and apparatus for real time sales optimization based on audio interactions analysis
US20140081638A1 (en) * 2008-12-10 2014-03-20 Jesus Antonio Villalba Lopez Cut and paste spoofing detection using dynamic time warping
US9098467B1 (en) * 2012-12-19 2015-08-04 Rawles Llc Accepting voice commands based on user identity

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3466394A (en) * 1966-05-02 1969-09-09 Ibm Voice verification system
US3509280A (en) * 1968-11-01 1970-04-28 Itt Adaptive speech pattern recognition system
US3525811A (en) * 1968-12-26 1970-08-25 Fred C Trice Remote control voting system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3466394A (en) * 1966-05-02 1969-09-09 Ibm Voice verification system
US3509280A (en) * 1968-11-01 1970-04-28 Itt Adaptive speech pattern recognition system
US3525811A (en) * 1968-12-26 1970-08-25 Fred C Trice Remote control voting system

Cited By (64)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3896266A (en) * 1971-08-09 1975-07-22 Nelson J Waterbury Credit and other security cards and card utilization systems therefore
US3883850A (en) * 1972-06-19 1975-05-13 Threshold Tech Programmable word recognition apparatus
US3919479A (en) * 1972-09-21 1975-11-11 First National Bank Of Boston Broadcast signal identification system
US4069393A (en) * 1972-09-21 1978-01-17 Threshold Technology, Inc. Word recognition apparatus and method
US3989896A (en) * 1973-05-08 1976-11-02 Westinghouse Electric Corporation Method and apparatus for speech identification
US4060694A (en) * 1974-06-04 1977-11-29 Fuji Xerox Co., Ltd. Speech recognition method and apparatus adapted to a plurality of different speakers
US4059725A (en) * 1975-03-12 1977-11-22 Nippon Electric Company, Ltd. Automatic continuous speech recognition system employing dynamic programming
US4032711A (en) * 1975-12-31 1977-06-28 Bell Telephone Laboratories, Incorporated Speaker recognition arrangement
DE2659083A1 (en) * 1975-12-31 1977-07-14 Western Electric Co METHOD AND DEVICE FOR SPEAKER RECOGNITION
US4053710A (en) * 1976-03-01 1977-10-11 Ncr Corporation Automatic speaker verification systems employing moment invariants
US4092493A (en) * 1976-11-30 1978-05-30 Bell Telephone Laboratories, Incorporated Speech recognition system
JPS53114648U (en) * 1977-02-21 1978-09-12
US4282403A (en) * 1978-08-10 1981-08-04 Nippon Electric Co., Ltd. Pattern recognition with a warping function decided for each reference pattern by the use of feature vector components of a few channels
WO1981002943A1 (en) * 1980-04-08 1981-10-15 Western Electric Co Continuous speech recognition system
US4349700A (en) * 1980-04-08 1982-09-14 Bell Telephone Laboratories, Incorporated Continuous speech recognition system
US4446531A (en) * 1980-04-21 1984-05-01 Sharp Kabushiki Kaisha Computer for calculating the similarity between patterns
JPS56158387A (en) * 1980-05-10 1981-12-07 Fujitsu Ltd Voice recognizing method using adaptive dynamic programming
JPS6120878B2 (en) * 1980-05-10 1986-05-24 Hiroya Fujisaki
US4348553A (en) * 1980-07-02 1982-09-07 International Business Machines Corporation Parallel pattern verifier with dynamic time warping
US4363102A (en) * 1981-03-27 1982-12-07 Bell Telephone Laboratories, Incorporated Speaker identification system using word recognition templates
US4601054A (en) * 1981-11-06 1986-07-15 Nippon Electric Co., Ltd. Pattern distance calculating equipment
US4608708A (en) * 1981-12-24 1986-08-26 Nippon Electric Co., Ltd. Pattern matching system
US4561105A (en) * 1983-01-19 1985-12-24 Communication Intelligence Corporation Complex pattern recognition method and system
US4573196A (en) * 1983-01-19 1986-02-25 Communications Intelligence Corporation Confusion grouping of strokes in pattern recognition method and system
US4752957A (en) * 1983-09-07 1988-06-21 Kabushiki Kaisha Toshiba Apparatus and method for recognizing unknown patterns
US4739398A (en) * 1986-05-02 1988-04-19 Control Data Corporation Method, apparatus and system for recognizing broadcast segments
US4910782A (en) * 1986-05-23 1990-03-20 Nec Corporation Speaker verification system
US5548647A (en) * 1987-04-03 1996-08-20 Texas Instruments Incorporated Fixed text speaker verification method and apparatus
US5091948A (en) * 1989-03-16 1992-02-25 Nec Corporation Speaker recognition with glottal pulse-shapes
US5581650A (en) * 1990-11-27 1996-12-03 Sharp Kabushiki Kaisha Learning dynamic programming
US5167004A (en) * 1991-02-28 1992-11-24 Texas Instruments Incorporated Temporal decorrelation method for robust speaker verification
US5271088A (en) * 1991-05-13 1993-12-14 Itt Corporation Automated sorting of voice messages through speaker spotting
US5617507A (en) * 1991-11-06 1997-04-01 Korea Telecommunication Authority Speech segment coding and pitch control methods for speech synthesis systems
US5414755A (en) * 1994-08-10 1995-05-09 Itt Corporation System and method for passive voice verification in a telephone network
US5625747A (en) * 1994-09-21 1997-04-29 Lucent Technologies Inc. Speaker verification, speech recognition and channel normalization through dynamic time/frequency warping
US5799276A (en) * 1995-11-07 1998-08-25 Accent Incorporated Knowledge-based speech recognition system and methods having frame length computed based upon estimated pitch period of vocalic intervals
WO1997039420A1 (en) * 1996-04-18 1997-10-23 Sarnoff Corporation Computationally efficient digital image warping
US6061477A (en) * 1996-04-18 2000-05-09 Sarnoff Corporation Quality image warper
US6298323B1 (en) * 1996-07-25 2001-10-02 Siemens Aktiengesellschaft Computer voice recognition method verifying speaker identity using speaker and non-speaker data
WO1998034216A2 (en) * 1997-01-31 1998-08-06 T-Netix, Inc. System and method for detecting a recorded voice
WO1998034216A3 (en) * 1997-01-31 2001-12-20 T Netix Inc System and method for detecting a recorded voice
US6263308B1 (en) * 2000-03-20 2001-07-17 Microsoft Corporation Methods and apparatus for performing speech recognition using acoustic models which are improved through an interactive process
US6260011B1 (en) * 2000-03-20 2001-07-10 Microsoft Corporation Methods and apparatus for automatically synchronizing electronic audio files with electronic text files
US20040249639A1 (en) * 2001-10-11 2004-12-09 Bernhard Kammerer Method for producing reference segments describing voice modules and method for modelling voice units of a spoken test model
US7398208B2 (en) * 2001-10-11 2008-07-08 Siemens Atkiengesellschaft Method for producing reference segments describing voice modules and method for modeling voice units of a spoken test model
US20050096900A1 (en) * 2003-10-31 2005-05-05 Bossemeyer Robert W. Locating and confirming glottal events within human speech signals
US20100145697A1 (en) * 2004-07-06 2010-06-10 Iucf-Hyu Industry-University Cooperation Foundation Hanyang University Similar speaker recognition method and system using nonlinear analysis
US20060020458A1 (en) * 2004-07-26 2006-01-26 Young-Hun Kwon Similar speaker recognition method and system using nonlinear analysis
US9165555B2 (en) 2005-01-12 2015-10-20 At&T Intellectual Property Ii, L.P. Low latency real-time vocal tract length normalization
US8909527B2 (en) * 2005-01-12 2014-12-09 At&T Intellectual Property Ii, L.P. Low latency real-time vocal tract length normalization
US20090259465A1 (en) * 2005-01-12 2009-10-15 At&T Corp. Low latency real-time vocal tract length normalization
US20060178885A1 (en) * 2005-02-07 2006-08-10 Hitachi, Ltd. System and method for speaker verification using short utterance enrollments
US7490043B2 (en) * 2005-02-07 2009-02-10 Hitachi, Ltd. System and method for speaker verification using short utterance enrollments
US20090248412A1 (en) * 2008-03-27 2009-10-01 Fujitsu Limited Association apparatus, association method, and recording medium
US20140081638A1 (en) * 2008-12-10 2014-03-20 Jesus Antonio Villalba Lopez Cut and paste spoofing detection using dynamic time warping
US9002706B2 (en) * 2008-12-10 2015-04-07 Agnitio Sl Cut and paste spoofing detection using dynamic time warping
US20120232899A1 (en) * 2009-09-24 2012-09-13 Obschestvo s orgranichennoi otvetstvennost'yu "Centr Rechevyh Technologij" System and method for identification of a speaker by phonograms of spontaneous oral speech and by using formant equalization
US9047866B2 (en) * 2009-09-24 2015-06-02 Speech Technology Center Limited System and method for identification of a speaker by phonograms of spontaneous oral speech and by using formant equalization using one vowel phoneme type
US20110112838A1 (en) * 2009-11-10 2011-05-12 Research In Motion Limited System and method for low overhead voice authentication
US8321209B2 (en) * 2009-11-10 2012-11-27 Research In Motion Limited System and method for low overhead frequency domain voice authentication
US8510104B2 (en) * 2009-11-10 2013-08-13 Research In Motion Limited System and method for low overhead frequency domain voice authentication
US20140025376A1 (en) * 2012-07-17 2014-01-23 Nice-Systems Ltd Method and apparatus for real time sales optimization based on audio interactions analysis
US8914285B2 (en) * 2012-07-17 2014-12-16 Nice-Systems Ltd Predicting a sales success probability score from a distance vector between speech of a customer and speech of an organization representative
US9098467B1 (en) * 2012-12-19 2015-08-04 Rawles Llc Accepting voice commands based on user identity

Also Published As

Publication number Publication date
CA938725A (en) 1973-12-18

Similar Documents

Publication Publication Date Title
US3700815A (en) Automatic speaker verification by non-linear time alignment of acoustic parameters
Chauhan et al. Speaker recognition using LPC, MFCC, ZCR features with ANN and SVM classifier for large input database
Davis et al. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences
Burton Text-dependent speaker verification using vector quantization source coding
Hasan et al. Speaker identification using mel frequency cepstral coefficients
US4908865A (en) Speaker independent speech recognition method and system
Peacocke et al. An introduction to speech and speaker recognition
US5339385A (en) Speaker verifier using nearest-neighbor distance measure
US4032711A (en) Speaker recognition arrangement
CA1175570A (en) Speaker recognizer in which a significant part of a preselected one of input and reference patterns is pattern matched to a time normalized part of the other
JPS61262799A (en) Hydon type markov model voice recognition equipment
JPS6217240B2 (en)
US6308153B1 (en) System for voice verification using matched frames
JPS6226039B2 (en)
Akhmetov et al. Determination of input parameters of the neural network model, intended for phoneme recognition of a voice signal in the systems of distance learning
Pandit et al. Feature selection for a DTW-based speaker verification system
Dash et al. Speaker identification using mel frequency cepstralcoefficient and bpnn
Kekre et al. Performance comparison of speaker recognition using vector quantization by LBG and KFCG
Chetouani et al. A New Nonlinear speaker parameterization algorithm for speaker identification
Sharma et al. A modified MFCC feature extraction technique for robust speaker recognition
Asda et al. Development of Quran reciter identification system using MFCC and neural network
Charisma et al. Speaker recognition using mel-frequency cepstrum coefficients and sum square error
Ozaydin Design of a text independent speaker recognition system
Li et al. Recent advancements in automatic speaker authentication
CA1252567A (en) Individual recognition by voice analysis