CN103474067B

CN103474067B - speech signal transmission method and system

Info

Publication number: CN103474067B
Application number: CN201310361783.1A
Authority: CN
Inventors: 江源; 周明; 凌震华; 何婷婷; 胡国平; 胡郁; 刘庆峰
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2013-08-19
Filing date: 2013-08-19
Publication date: 2016-08-24
Anticipated expiration: 2033-08-19
Also published as: CN103474067A

Abstract

The invention discloses a kind of speech signal transmission method and system, the method comprises determining that the content of text that continuous speech signal to be sent is corresponding；The phonetic synthesis parameter model of each synthesis unit is determined according to described content of text and described continuous speech signal；The phonetic synthesis parameter model splicing each synthesis unit obtains phonetic synthesis parameter model sequence；Determine the sequence number string that described phonetic synthesis parameter model sequence pair is answered；Described sequence number string is sent to receiving terminal, so that described receiving terminal recovers described continuous speech signal according to described sequence number string.Utilize the present invention, the signal transmission of extremely low rate bit stream on the premise of recovering tonequality minimization of loss ensureing voice, can be realized.

Description

Voice signal transmission method and system

Technical Field

The invention relates to the technical field of signal transmission, in particular to a voice signal transmission method and system.

Background

With the popularization of the internet and the popularization of portable devices, various handheld device-based chat software comes along. The natural humanization of voice interaction cannot be surpassed by other interaction means, particularly in the application of handheld small-screen equipment which is not beneficial to handwriting key input. For these products, the Voice interaction function is supported, and the Voice signal received by a certain terminal is transmitted to the destination terminal, for example, the Voice Message transmission function supporting Voice Message is supported by a micro-communication product pushed by Tencent. However, the amount of voice signal data transmitted directly is often very large, and a large economic burden is brought to the user in a channel charged by the flow, such as the internet or a communication network. It is clear how to compress the amount of data transmitted as much as possible without affecting the voice quality is a prerequisite for improving the application value of voice signal transmission.

In order to solve the problem of voice signal transmission, researchers have tried various voice coding methods to perform digital quantization and compression transmission on voice signals, so as to reduce coding rate and improve transmission efficiency under the condition of improving voice quality recovery of voice signals. The current common speech signal compression methods include waveform coding and parameter coding. Wherein:

the waveform coding is to form digital signals by sampling, quantizing and coding the analog signal waveform of the time domain, and the coding mode has the advantages of strong adaptability and high voice quality. However, since the waveform shape of the original speech signal needs to be restored, the code flow rate of the scheme is required to be high, and the sound quality can be better than 16 kb/s.

Parameter coding is to extract parameters representing pronunciation characteristics from an original voice signal and code the parameters. The goal of this approach is to preserve the semantic meaning of the original speech, ensuring intelligibility. The advantage is that the code flow rate is lower but the recovered sound quality is more impaired.

In the traditional voice communication era, a time charging mode is often adopted, and the coding method mainly considers algorithm delay and communication quality; while voice is one of the data signals in the mobile internet era, traffic is generally used to charge a fee, and the level of the code flow rate of the coded voice directly affects the cost of the user. In addition, conventional telephone channel speech uses only 8k sample rate, belongs to narrowband speech, suffers from impaired sound quality and has an upper limit. Obviously, if the traditional coding mode is continuously used for processing the wideband or ultra-wideband voice, the code flow rate needs to be increased, and the flow consumption is doubled.

Disclosure of Invention

The embodiment of the invention provides a voice signal transmission method and a voice signal transmission system, which can realize signal transmission with extremely low code flow rate on the premise of ensuring the minimum voice quality loss during voice recovery.

The embodiment of the invention provides a voice signal transmission method, which comprises the following steps:

determining text content corresponding to a continuous voice signal to be sent;

determining a voice synthesis parameter model of each synthesis unit according to the text content and the continuous voice signal;

splicing the voice synthesis parameter models of all the synthesis units to obtain a voice synthesis parameter model sequence;

determining a sequence number string corresponding to the speech synthesis parameter model sequence;

and sending the sequence number string to a receiving end so that the receiving end recovers the continuous voice signal according to the sequence number string.

An embodiment of the present invention further provides a voice signal transmission system, including:

the text acquisition module is used for determining text contents corresponding to continuous voice signals to be sent;

the parameter model determining module is used for determining a voice synthesis parameter model of each synthesis unit according to the text content and the continuous voice signal;

the splicing module is used for splicing the voice synthesis parameter models of the synthesis units to obtain a voice synthesis parameter model sequence;

a serial number string determining module, configured to determine a serial number string corresponding to the speech synthesis parameter model sequence;

and the sending module is used for sending the sequence number string to a receiving end so that the receiving end recovers the continuous voice signal according to the sequence number string.

The voice signal transmission method and the voice signal transmission system provided by the embodiment of the invention adopt the statistical analysis model coding, the processing mode is irrelevant to the voice sampling rate, the transmission code flow rate is greatly reduced on the premise of ensuring the minimum voice quality loss during voice recovery, the flow consumption is reduced, the problem that the traditional voice coding method cannot give consideration to both voice quality and flow is solved, and the user communication demand experience in the mobile network era is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and other drawings can be obtained by those skilled in the art according to the drawings.

Fig. 1 is a flowchart of a voice signal transmission method according to an embodiment of the present invention;

FIG. 2 is a flow chart of determining a speech synthesis parameter model for each synthesis unit in an embodiment of the present invention;

FIG. 3 is a flow chart of the construction of a binary decision tree in an embodiment of the present invention;

FIG. 4 is a schematic diagram of a binary decision tree in an embodiment of the present invention;

FIG. 5 is a flow chart of joint optimization of the initial fundamental frequency model according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a speech signal transmission system according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of an exemplary embodiment of a parametric model determination module;

FIG. 8 is a schematic structural diagram of a binary decision tree building block in the speech signal transmission system according to an embodiment of the present invention;

FIG. 9 is a schematic structural diagram of a first optimization unit in an embodiment of the present invention;

fig. 10 is a schematic structural diagram of a second optimization unit in the embodiment of the present invention.

Detailed Description

In order to make the technical field of the invention better understand the scheme of the embodiment of the invention, the embodiment of the invention is further described in detail with reference to the drawings and the implementation mode.

Aiming at the problems that the traditional coding mode is used for processing broadband or ultra-wideband voice, the code flow rate needs to be increased, and the traffic consumption is large, the embodiment of the invention provides a voice signal transmission method and a voice signal transmission system, which are suitable for coding various voices (such as ultra-wideband voice with a 16KHz sampling rate, narrowband voice with an 8KHz sampling rate and the like), and realize signal transmission with extremely low code flow rate on the premise of ensuring that the voice recovery tone quality loss is minimized.

As shown in fig. 1, it is a flowchart of a method for transmitting a voice signal according to an embodiment of the present invention, and the method includes the following steps:

step 101, determining text content corresponding to a continuous voice signal to be sent.

Specifically, the text content may be automatically obtained through a speech recognition algorithm, and may of course be obtained in a manual labeling manner. In addition, in order to further ensure the correctness of the text content obtained by voice recognition, manual editing and correction can be carried out on the text content obtained by voice recognition.

And 102, determining a speech synthesis parameter model of each synthesis unit according to the text content and the continuous speech signal.

The synthesis unit is a preset minimum synthesis object, such as a syllable unit, a phoneme unit, even a state unit in a phoneme HMM model, and the like.

In order to reduce the loss of tone quality recovery of the receiving end as much as possible and enable the receiving end to recover continuous voice signals in a voice synthesis mode, a voice synthesis parameter model obtained by the sending end from original voice signals should conform to the characteristics of original voice signals as much as possible so as to reduce the loss of signal compression and recovery.

Specifically, the continuous speech signal may be segmented according to the text content to obtain speech segments corresponding to the synthesis units, further obtain durations and initialized speech synthesis parameter models corresponding to the synthesis units, and then perform joint optimization on the initialized speech synthesis parameter models by using the collected speech signal, where a specific process will be described in detail later.

And 103, splicing the voice synthesis parameter models of the synthesis units to obtain a voice synthesis parameter model sequence.

And 104, determining a serial number string corresponding to the speech synthesis parameter model sequence.

And 105, sending the sequence number string to a receiving end so that the receiving end recovers the continuous voice signal according to the sequence number string.

Correspondingly, after receiving the serial number string sent by the sender, the receiver can obtain the speech synthesis parameter model sequence from the codebook according to the serial number string.

Because each speech synthesis parameter model has a unique serial number, and the same codebook is stored in both the sender and the receiver, the codebook contains all the speech synthesis parameter models. Therefore, after receiving the sequence number string, the receiving party can obtain the speech synthesis parameter models corresponding to the sequence numbers from the codebook according to the sequence number string, and then the speech synthesis parameter models are spliced to obtain the speech synthesis parameter model sequence. And then, determining a speech synthesis parameter sequence according to the speech synthesis parameter model sequence, and recovering the speech signal in a speech synthesis mode.

The voice signal transmission method provided by the embodiment of the invention adopts the statistical analysis model coding, the processing mode is independent of the voice sampling rate, the 16kHz ultra-wideband voice coding does not need to pay extra code flow rate cost, the voice quality effect is good, and the coding flow is low. Taking a typical chinese speech segment as an example, the valid speech segment lasts 10s, and has 80 initials and finals (phonemes), and each phoneme has 5 fundamental frequency states, 5 spectrum states, and 1 duration state, and each state is encoded with 1 byte (8 bits), and the code flow rate is m: m [80 × (5+5+1) ] × 8bit/10s ═ 704b/s, less than 1kb/s, belong to the very low code rate coding method, the code flow rate is greatly less than each coding standard in the present mainstream voice communication field, the flowrate of the network communication will be greatly reduced. Compared with the current mainstream speech coding method in the communication field, the speech coding method can process ultra-wideband speech (16kHz sampling rate) and has higher tone quality; and has lower code flow rate (below 1 kb/s), effectively reducing network communication traffic.

According to the voice signal transmission method, the voice synthesis parameter models corresponding to the continuous voice signals are extracted and the signals are synthesized, so that the voice signals are greatly compressed, the signal loss is minimized, and the signal distortion is effectively reduced.

Fig. 2 is a flowchart of determining a speech synthesis parameter model of each synthesis unit according to an embodiment of the present invention, which includes the following steps:

step 201, performing voice segment segmentation on the continuous voice signal according to the text content to obtain the voice segments corresponding to each synthesis unit.

Specifically, the continuous speech signal may be forcibly aligned with an acoustic model corresponding to a preset synthesis unit, that is, a speech recognition decoding of the speech signal corresponding to the acoustic model sequence is calculated, so as to obtain a speech segment corresponding to each synthesis unit.

It should be noted that the synthesis unit may select different specifications according to different application requirements. Generally, if the demand on the code flow rate is high, larger phonetic units, such as syllable units, phoneme units, etc., are selected; otherwise, if the requirement for the sound quality is higher, smaller speech units, such as state units and feature stream units of the model, may be selected.

Under the acoustic Model setting based on HMM (Hidden Markov Model), each state of the HMM Model can be further selected as a synthesis unit, and a corresponding speech segment based on a state layer is obtained. And then respectively determining a fundamental frequency model and a spectrum model corresponding to each state from the corresponding fundamental frequency binary decision tree and spectrum binary decision tree for each state. Therefore, the acquired speech synthesis parameter model can describe the characteristics of the speech signal more finely.

Step 202, sequentially determining the duration of the speech segment corresponding to each synthesis unit and an initial speech synthesis parameter model, wherein the initial speech synthesis parameter model comprises: and the initial fundamental frequency model and the initial spectrum model are used for obtaining a fundamental frequency model sequence and a spectrum model sequence corresponding to the continuous voice signal.

Specifically, firstly, a base frequency binary decision tree corresponding to a currently investigated synthesis unit is obtained; performing text analysis on the synthesis unit to obtain context information of the synthesis unit, such as context information of a phoneme unit, a tone, a part of speech, a prosodic hierarchy and the like; then, a path decision is made in the fundamental frequency binary tree according to the context information to obtain corresponding leaf nodes, and a fundamental frequency model corresponding to the leaf nodes is used as a fundamental frequency model of the synthesis unit.

The path decision process is as follows:

according to the context information of the synthesis unit, sequentially answering the split questions of each node from the root node of the fundamental frequency binary decision tree; acquiring a top-down matching path according to the answer result; and obtaining leaf nodes according to the matching paths.

Similarly, the spectrum model corresponding to the leaf node can be obtained by querying in the spectrum binary decision tree corresponding to the currently investigated synthesis unit, and the spectrum model corresponding to the leaf node is used as the initial spectrum model of the currently investigated synthesis unit. Specifically, a spectrum binary decision tree corresponding to the synthesis unit is obtained first; and analyzing the text of the synthesis unit to obtain the context information of the synthesis unit. Then according to the context information, performing path decision in the spectrum binary decision tree to obtain corresponding leaf nodes; and taking the spectrum model corresponding to the leaf node as an initial spectrum model corresponding to the synthesis unit.

The path decision process is as follows:

and according to the context information of the synthesis unit, sequentially replying the split questions of each node from the root node of the spectrum binary decision tree, acquiring an up-down matching path according to the reply result, and making a decision to obtain the leaf nodes.

The sequence of fundamental frequency models corresponding to the continuous speech signal is a sequence composed of initial fundamental frequency models corresponding to the respective synthesis units, and similarly, the sequence of spectrum models corresponding to the continuous speech signal is a sequence composed of initial spectrum models corresponding to the respective synthesis units.

And 203, performing joint optimization on the initial fundamental frequency models corresponding to the synthesis units by using the continuous voice signals and the fundamental frequency model sequence to obtain the fundamental frequency models of the synthesis units.

And 204, performing joint optimization on the initial spectrum models corresponding to the synthesis units by using the continuous voice signals and the spectrum model sequence to obtain the spectrum models of the synthesis units.

In the embodiment of the present invention, the quality of the initial speech synthesis parameter model corresponding to the synthesis unit has a direct relationship with the construction of the binary decision tree (including the fundamental frequency binary decision tree and the spectrum binary decision tree). In the embodiment of the invention, a binary decision tree is constructed by adopting a bottom-up clustering method.

As shown in fig. 3, it is a flowchart of binary decision tree construction in the embodiment of the present invention, and includes the following steps:

step 301, training data is obtained.

Specifically, a large amount of speech training data may be collected and subjected to text labeling, then the speech segments of the basic speech unit and even the synthesis unit (e.g., the state unit of the basic speech unit model) are segmented according to the labeled text content, the speech segment set corresponding to each synthesis unit is obtained, and the speech segment in the speech segment set corresponding to each synthesis unit is used as the training data corresponding to the synthesis unit.

Step 302, extracting synthesis parameters of the speech segment set corresponding to the synthesis unit from the training data.

The synthesis parameters include: fundamental frequency features and spectral features, etc.

Step 303, initializing the binary decision tree corresponding to the synthesis unit according to the extracted synthesis parameters, and setting a root node as a current investigation node.

And initializing the binary decision tree, namely constructing the binary decision tree only with the root node.

And step 304, judging whether the current investigation node needs to be split. If so, go to step 305; otherwise, step 306 is performed.

And selecting the remaining problems in the preset problem set to perform splitting attempt on the data of the current investigation node to obtain the child nodes. The remaining questions refer to questions that have not been asked.

Specifically, the sample aggregation of the current investigation node, that is, the degree of dispersion of the samples in the voice segment set, may be first calculated. Generally, the greater the degree of dispersion, the greater the likelihood of splitting the node, otherwise the less likely the node is to split. Specifically, the sample variance can be used to measure the sample aggregation of the node, i.e. calculating the mean value of the distances (or squares) from the class center of all samples under the node. The sample concentrations of the split child nodes are then calculated and the problem with the largest sample concentration drop is selected as the preferred problem.

And then splitting according to the preferred problem to obtain child nodes. And if the clustering degree of splitting is reduced to be smaller than a set threshold value according to the preferred problem or training data in the split child nodes is lowest to be lower than a set threshold value, determining that the current investigation node does not continue to split.

And 305, splitting the current investigation node, and acquiring the split child nodes and training data corresponding to the child nodes. Then, step 307 is executed.

In particular, the current investigation node may be split according to the preference problem.

Step 306, marking the current investigation node as a leaf node.

Step 307, determine whether there are any unexplored non-leaf nodes in the binary decision tree. If so, go to step 308; otherwise, step 309 is performed.

And 308, acquiring the next unexplored non-leaf node as the current investigation node. Then, return to step 304.

Step 309, output binary decision tree.

It should be noted that, in the embodiment of the present invention, both the baseband binary decision tree and the spectrum binary decision tree may be established according to the flow shown in fig. 3.

Fig. 4 is a schematic diagram of a binary decision tree according to an embodiment of the present invention.

Fig. 4 shows a construction diagram of a binary decision tree for the third state of the phoneme ". about. -aa +". As shown in fig. 4, when a root node is split, the training data corresponding to the root node may be split according to the answer to the preset question "whether the right adjacent phoneme is a nasal sound", and then when the next-layer node is split, as when the left node is split, the training data corresponding to the node may be further split according to the answer to the preset question "whether the left adjacent phoneme is a voiced consonant". And finally, setting the node as a leaf node when the node cannot be further split, training by using corresponding training data to obtain a mathematical statistical model, such as a Gaussian model, and taking the mathematical statistical model as a synthetic parameter model corresponding to the current leaf node.

Obviously, in the embodiment shown in fig. 2, the selection of the initial speech synthesis parameter model mainly depends on a binary decision tree based on text analysis, such as the phoneme category of the context of the currently investigated synthesis unit, the pronunciation type of the current phoneme, etc., so that the initial speech synthesis parameter model can be obtained conveniently and quickly.

Further, based on the principle that the loss of the synthesized speech signal of the actual speech signal and the coding model is minimized, in the embodiment of the present invention, the initial fundamental frequency model and the initial spectrum model need to be jointly optimized, and the joint optimization process is described in detail below.

As shown in fig. 5, it is a flowchart of performing joint optimization on the initial fundamental frequency model in the embodiment of the present invention, and the flowchart includes the following steps:

step 501, extracting an original fundamental frequency characteristic sequence corresponding to a continuous voice signal.

Step 502, the first synthesis unit is obtained to be the currently optimized synthesis unit.

Step 503, obtaining an initial fundamental frequency model and a related fundamental frequency model set corresponding to the currently optimized synthesis unit, where the related fundamental frequency model set includes all or part of leaf nodes of a fundamental frequency binary decision tree corresponding to the currently optimized synthesis unit.

Step 504, selecting a preferred model of the initial fundamental frequency model from the relevant fundamental frequency model set according to the original fundamental frequency feature sequence.

That is, the initial fundamental frequency model is jointly optimized according to the original fundamental frequency feature sequence and the relevant fundamental frequency model set.

Specifically, a fundamental frequency model in the relevant fundamental frequency model set may be sequentially selected to replace a corresponding initial fundamental frequency model in the fundamental frequency model sequence, so as to obtain a new fundamental frequency model sequence; and then determining a synthesized new fundamental frequency characteristic sequence according to the new fundamental frequency model sequence. Then calculating the distance between the new fundamental frequency characteristic sequence and the original fundamental frequency characteristic sequence; and selecting a fundamental frequency model corresponding to the minimum distance as a preferred model of the initial fundamental frequency model.

When determining the synthesized new fundamental frequency feature sequence according to the new fundamental frequency model sequence, specifically, the fundamental frequency model parameters may be determined according to the new fundamental frequency model sequence and the time length sequence corresponding to the synthesis unit, so as to generate the synthesized new fundamental frequency feature sequence.

For example, the synthesized new fundamental frequency feature sequence is obtained according to the following formula:

O_max＝arg maxP(O|,λ,T)

wherein, O is a characteristic sequence, lambda is a given fundamental frequency model sequence, and T is a time length sequence corresponding to each synthesis unit.

O_maxThe finally generated fundamental frequency characteristic sequence is used for solving the fundamental frequency characteristic sequence O with the maximum likelihood value corresponding to the given fundamental frequency model sequence lambda in the range of the unit time length sequence T_max。

When calculating the distance between the new fundamental frequency feature sequence and the original fundamental frequency feature sequence, an euclidean distance calculation method may be adopted, that is:

D (O, C) = Σ_{i = 1}^{N} {(O_{i} - C_{i})}^{T} (O_{i} - C_{i});

wherein, O_i,C_iThe ith original fundamental frequency feature vector and the ith new fundamental frequency feature vector are respectively.

And 505, taking the preferred model as a fundamental frequency model of the currently optimized synthesis unit, and replacing the preferred model with a corresponding initial fundamental frequency model in the fundamental frequency model sequence.

Step 506, determine if there are any more non-optimized synthesis units. If yes, go to step 507; otherwise, step 508 is performed.

Step 507, acquiring the next synthesis unit as the currently optimized synthesis unit. Then, the process returns to step 503.

And step 508, outputting the fundamental frequency model of each synthesis unit.

As mentioned above, the relevant fundamental frequency model set may be all leaf nodes of the fundamental frequency binary decision tree corresponding to the synthesis unit, and considering that the number of leaf nodes in the fundamental frequency binary decision tree is often large, calculation and comparison one by one will consume a large amount of operation resources, which is not favorable for the requirement of coding real-time. Therefore, some leaf nodes with larger preferred possibility can be selected from all leaf nodes as the relevant fundamental frequency model set to participate in the optimization of the subsequent fundamental frequency model. The specific process can be as follows:

(1) firstly, calculating the likelihood between the original fundamental frequency characteristic sequence corresponding to the synthesis unit and the fundamental frequency models of all leaf nodes of the fundamental frequency binary decision tree.

Let the original fundamental frequency characteristic sequence be(N is the number of frames of the speech signal) and the currently investigated fundamental frequency model is λ_j(λ_jJ1.. J, J being the entire model set size), then the likelihood between the two is:

wherein,d is the dimension of the signature sequence o. m is_iIs the mean of the fundamental frequency model and the corresponding ∑ is the variance.

In this embodiment, a gaussian model is selected as the synthetic parametric model, so the model is determined from two parameters, the mean and the variance.

(2) Selecting the fundamental frequency model with the largest possible frequency to form the relevant fundamental frequency model set.

Specifically, M fundamental frequency models having the maximum likelihood may be preferred as the pre-selected models, and fundamental frequency models having all likelihoods larger than a preset threshold may also be selected as the pre-selected models. The parameter M (typically a positive integer, such as 50) and the threshold (typically a negative number, such as-200, at the Log likelihood setting) are preset by the system and are typically controllable according to a preselected number of models M.

It should be noted that the aforementioned synthesis unit can select different specifications according to different application requirements. Generally, if the code flow rate requirement is higher, a larger synthesis unit, such as syllable unit, phoneme unit, etc., is selected; on the contrary, if the requirement for sound quality is higher, smaller synthesis units, such as state units of models, feature stream units, etc., may be selected. Under the condition of adopting acoustic model setting based on the HMM, each state of the HMM model can be further selected as a basic synthesis unit, and a corresponding speech segment based on the state layer is obtained. Therefore, the obtained synthesis parameter model can describe the characteristics of the voice signal more finely, and the transmission quality of the voice signal is further improved.

In addition, in the embodiment of the present invention, the process of jointly optimizing the initial spectrum model is similar to the process of jointly optimizing the initial fundamental frequency model, and is not described in detail here.

Therefore, the voice signal transmission method of the embodiment of the invention greatly reduces the transmission code flow rate and the flow consumption on the premise of ensuring the minimum voice quality loss during voice recovery, solves the problem that the traditional voice coding method can not give consideration to both voice quality and flow, and improves the user communication demand experience in the mobile network era.

Correspondingly, an embodiment of the present invention further provides a speech signal transmission system, as shown in fig. 6, which is a schematic structural diagram of the system.

In this embodiment, the system includes:

a text obtaining module 601, configured to determine text content corresponding to a continuous voice signal to be sent;

a parameter model determining module 602, configured to determine a speech synthesis parameter model of each synthesis unit according to the text content and the continuous speech signal;

a splicing module 603, configured to splice the speech synthesis parameter models of the synthesis units to obtain a speech synthesis parameter model sequence;

a serial number string determining module 604, configured to determine a serial number string corresponding to the speech synthesis parameter model sequence;

a sending module 605, configured to send the sequence number string to a receiving end, so that the receiving end recovers the continuous speech signal according to the sequence number string.

In practical applications, the text obtaining module 601 may obtain the text content automatically through a speech recognition algorithm, and certainly, the text content may also be obtained in a manual labeling manner. For this purpose, a speech recognition unit and/or a label information acquisition unit may be disposed in the text acquisition module 601, so that the user can select different modes to obtain text contents corresponding to the continuous speech signal to be transmitted. The voice recognition unit is used for determining text contents corresponding to continuous voice signals to be sent through a voice recognition algorithm; the marking information acquisition unit is used for acquiring text contents corresponding to continuous voice signals to be sent in a manual marking mode.

In order to reduce the loss of tone quality recovery at the receiving end as much as possible, so that the receiving end can recover continuous speech signals through a speech synthesis mode, the speech synthesis parameter model obtained by the parameter model determining module 602 should conform to the characteristics of the original speech signal as much as possible, so as to reduce the loss of signal compression and recovery. Specifically, the continuous speech signal may be segmented according to the text content to obtain speech segments corresponding to the synthesis units, and further obtain durations, initial fundamental frequency models, and initial spectrum models of the synthesis units, and then the acquired speech signal is used to perform joint optimization on the initialized speech synthesis parameter models to obtain the fundamental frequency models and the spectrum models of the synthesis units.

Accordingly, the receiving party can obtain the speech synthesis parameter model sequence from the codebook according to the serial number string. Because each speech synthesis parameter model has a unique serial number, and the same codebook is stored in both the sender and the receiver, the codebook contains all the speech synthesis parameter models. Therefore, after receiving the sequence number string, the receiving party can obtain the speech synthesis parameter models corresponding to the sequence numbers from the codebook according to the sequence number string, and then the speech synthesis parameter models are spliced to obtain the speech synthesis parameter model sequence. And then, determining a speech synthesis parameter sequence according to the speech synthesis parameter model sequence, and recovering the speech signal in a speech synthesis mode.

Therefore, the voice signal transmission system of the embodiment of the invention greatly reduces the transmission code flow rate and the flow consumption on the premise of ensuring the minimum voice quality loss during voice recovery, solves the problem that the traditional voice coding method can not give consideration to both voice quality and flow, and improves the user communication demand experience in the mobile network era.

Fig. 7 is a schematic structural diagram of a parametric model determining module according to an embodiment of the present invention.

The parametric model determination module comprises:

and the segmentation unit 701 is configured to perform speech segment segmentation on the continuous speech signal according to the text content to obtain speech segments corresponding to each synthesis unit.

Specifically, the continuous speech signal may be forcibly aligned with the acoustic model sequence corresponding to the synthesis unit in the text content, that is, the speech recognition decoding of the speech signal corresponding to the acoustic model sequence is calculated, so as to obtain the speech segment corresponding to each synthesis unit.

It should be noted that the synthesis unit may select different specifications according to different application requirements. Generally, if the demand on the code flow rate is high, larger phonetic units, such as syllable units, phoneme units, etc., are selected; otherwise, if the requirement for the sound quality is higher, smaller speech units, such as state units and feature stream units of the model, may be selected. Under the acoustic Model setting based on HMM (Hidden Markov Model), each state of the HMM Model can be further selected as a synthesis unit, and a corresponding speech segment based on a state layer is obtained. And then respectively determining a fundamental frequency model and a spectrum model corresponding to each state from the corresponding fundamental frequency binary decision tree and spectrum binary decision tree for each state. Therefore, the acquired speech synthesis parameter model can describe the characteristics of the speech signal more finely.

A duration determining unit 702, configured to sequentially determine durations of the speech segments corresponding to the synthesizing units.

A model determining unit 703, configured to sequentially determine an initial speech synthesis parameter model corresponding to each synthesis unit, where the initial speech synthesis parameter model includes: an initial fundamental frequency model and an initial spectral model.

A model sequence obtaining unit 704, configured to obtain a fundamental frequency model sequence and a spectral model sequence corresponding to the continuous speech signal.

A first optimizing unit 705, configured to perform joint optimization on the initial fundamental frequency models corresponding to the synthesizing units by using the continuous speech signal and the fundamental frequency model sequence, so as to obtain a fundamental frequency model of each synthesizing unit.

A second optimization unit 706, configured to perform joint optimization on the initial spectrum models corresponding to the synthesis units by using the continuous speech signal and the spectrum model sequence, so as to obtain a spectrum model of each synthesis unit.

It should be noted that, in the embodiment of the present invention, the model determining unit 703 may determine the initial speech synthesis parameter model corresponding to each synthesis unit based on a binary decision tree.

For this reason, in the embodiment of the present invention, the method may further include: and a binary decision tree construction module.

Fig. 8 is a schematic structural diagram of a binary decision tree building module in the speech signal transmission system according to the embodiment of the present invention.

The binary decision tree construction module comprises:

a training data acquisition unit 801 for acquiring training data;

a parameter extracting unit 802, configured to extract, from the training data, synthesis parameters of the speech segment set corresponding to the synthesizing unit, where the synthesis parameters include: fundamental frequency features and spectral features;

an initializing unit 803, configured to initialize the binary decision tree corresponding to the synthesizing unit according to the synthesizing parameters, that is, construct a binary decision tree with only root nodes;

a node examining unit 804, configured to examine each non-leaf node in sequence from a root node of the binary decision tree; if the current investigation node needs to be split, splitting the current investigation node, and acquiring a split child node and training data corresponding to the child node; otherwise, marking the current investigation node as a leaf node;

a binary decision tree output unit 805, configured to output the binary decision tree of the synthesis unit after the node investigation unit completes investigation of all non-leaf nodes.

In this embodiment, the training data obtaining unit 801 may specifically collect a large amount of speech training data and perform text labeling on the speech training data, then perform speech fragment segmentation on a basic speech unit or even a synthesis unit (e.g., a status unit of a basic speech unit model) according to the labeled text content, obtain a speech fragment set corresponding to each synthesis unit, and use a speech fragment in the speech fragment set corresponding to each synthesis unit as training data corresponding to the synthesis unit.

When the node examining unit 804 determines whether the currently examined node needs to be split, the problem with the largest sample aggregation degree reduction range can be selected as the preferred problem to perform the splitting attempt according to the sample aggregation degree of the currently examined node, so as to obtain the child node. And if the clustering degree of splitting is reduced to be smaller than a set threshold value according to the preferred problem or training data in the split child nodes is lowest to be lower than a set threshold value, determining that the current investigation node does not continue to split.

The above examination and splitting process can refer to the description in the speech signal transmission method according to the embodiment of the present invention, and will not be described herein again.

It should be noted that, in the embodiment of the present invention, both the fundamental frequency binary decision tree and the spectrum binary decision tree can be established by the binary decision tree building module, and the implementation processes thereof are similar, and are not described in detail herein.

In this embodiment of the present invention, the model determining unit 703 may include: an initial fundamental frequency model determining unit and an initial spectrum model determining unit (not shown).

The initial fundamental frequency model determination unit includes:

the first acquisition unit is used for acquiring a base frequency binary decision tree corresponding to the synthesis unit;

the first parsing unit is used for performing text parsing on the synthesis unit to obtain context information of the synthesis unit, such as context information of a phoneme unit, a tone, a part of speech, a prosody hierarchy and the like;

a first decision unit, configured to perform a path decision in the baseband binary decision tree according to the context information to obtain a corresponding leaf node; the path decision process is as follows:

according to the context information of the synthesis unit, sequentially answering the split questions of each node from the root node of the fundamental frequency binary decision tree; acquiring a top-down matching path according to the answer result; obtaining leaf nodes according to the matching paths;

and the first output unit is used for taking the fundamental frequency model corresponding to the leaf node as the initial fundamental frequency model corresponding to the synthesis unit.

Similar to the initial fundamental frequency model determination unit, the initial spectrum model determination unit includes:

the second obtaining unit is used for obtaining the spectrum binary decision tree corresponding to the synthesizing unit;

the second analysis unit is used for performing text analysis on the synthesis unit to obtain context information of the synthesis unit;

a second decision unit, configured to perform a path decision in the spectrum binary decision tree according to the context information to obtain a corresponding leaf node; the path decision process is as follows:

according to the context information of the synthesis unit, sequentially replying the split questions of each node from the root node of the spectrum binary decision tree, acquiring an up-down matching path according to the reply result, and making a decision to obtain leaf nodes;

and the second output unit is used for taking the spectrum model corresponding to the leaf node as the initial spectrum model corresponding to the synthesis unit.

It should be noted that, the initial fundamental frequency model determining unit and the initial spectrum model determining unit may be implemented by separate physical units, or may be implemented by one physical unit in a unified manner, which is not limited in this embodiment of the present invention.

In this embodiment of the present invention, the first optimization unit 705 and the second optimization unit 706 perform joint optimization on the initial fundamental frequency model and the initial spectrum model respectively based on the principle that the loss of the actual speech signal and the synthesized speech signal of the coding model is minimized.

Fig. 9 is a schematic structural diagram of a first optimization unit in the embodiment of the present invention.

The first optimization unit includes:

a fundamental frequency feature sequence extracting unit 901, configured to extract an original fundamental frequency feature sequence corresponding to the continuous speech signal;

a first obtaining unit 902, configured to sequentially obtain an initial fundamental frequency model and a related fundamental frequency model set corresponding to each synthesis unit, where the related fundamental frequency model set includes all or part of leaf nodes of a fundamental frequency binary decision tree corresponding to the synthesis unit;

a first selecting unit 903, configured to select a preferred model of the initial fundamental frequency model from the relevant fundamental frequency model set according to the original fundamental frequency feature sequence; that is, the initial fundamental frequency model is jointly optimized according to the original fundamental frequency feature sequence and the relevant fundamental frequency model set;

a first replacing unit 904, configured to use the preferred model as the fundamental frequency model of the synthesizing unit, and replace the corresponding initial fundamental frequency model in the fundamental frequency model sequence with the preferred model.

Wherein the first selecting unit 903 comprises:

a fundamental frequency model sequence updating unit, configured to sequentially select a fundamental frequency model in the relevant fundamental frequency model set to replace a corresponding initial fundamental frequency model in the fundamental frequency model sequence, so as to obtain a new fundamental frequency model sequence; determining a synthesized new fundamental frequency characteristic sequence according to the new fundamental frequency model sequence;

the first calculation unit is used for calculating the distance between the new fundamental frequency characteristic sequence and the original fundamental frequency characteristic sequence;

and the fundamental frequency model selection unit is used for selecting a fundamental frequency model corresponding to the minimum distance as a preferred model of the initial fundamental frequency model.

The second optimization unit includes:

a spectral feature sequence extracting unit 1001 configured to extract an original spectral feature sequence corresponding to the continuous speech signal;

a second obtaining unit 1002, configured to sequentially obtain an initial spectrum model and a related spectrum model set corresponding to each synthesis unit, where the related spectrum model set includes all or part of leaf nodes of a spectrum binary decision tree corresponding to the synthesis unit;

a second selecting unit 1003, configured to select a preferred model of the initial spectrum model from the set of relevant spectrum models according to the original spectrum feature sequence;

a second replacing unit 1004, configured to use the preferred model as the spectrum model of the synthesizing unit, and replace the corresponding initial spectrum model in the spectrum model sequence with the preferred model.

The second selecting unit 1003 includes:

the spectrum model sequence updating unit is used for sequentially selecting the spectrum models in the relevant spectrum model set to replace the corresponding initial spectrum models in the spectrum model sequence to obtain a new spectrum model sequence; determining a synthesized new frequency spectrum characteristic sequence according to the new frequency spectrum model sequence;

the second calculating unit is used for calculating the distance between the new spectrum characteristic sequence and the original spectrum characteristic sequence;

and a spectrum model selection unit for selecting a spectrum model corresponding to the minimum distance as a preferred model of the initial spectrum model.

It should be noted that the first optimization unit 705 and the second optimization unit 706 may be implemented by separate physical units, or may be implemented by one physical unit, which is not limited in this embodiment of the present invention.

It should be noted that, the synthesis unit described in the embodiment of the present invention may select different specifications according to different application requirements. Generally, if the code flow rate requirement is higher, a larger synthesis unit, such as syllable unit, phoneme unit, etc., is selected; on the contrary, if the requirement for sound quality is higher, smaller synthesis units, such as state units of models, feature stream units, etc., may be selected. Under the condition of adopting acoustic model setting based on the HMM, each state of the HMM model can be further selected as a basic synthesis unit, and a corresponding speech segment based on the state layer is obtained. Therefore, the obtained synthesis parameter model can describe the characteristics of the voice signal more finely, and the transmission quality of the voice signal is further improved.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, they are described in a relatively simple manner, and reference may be made to some descriptions of method embodiments for relevant points. The above-described system embodiments are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The above detailed description of the embodiments of the present invention, and the detailed description of the embodiments of the present invention used herein, is merely intended to facilitate the understanding of the methods and apparatuses of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method for transmitting a voice signal, comprising:

determining text content corresponding to a continuous voice signal to be sent;

2. The method of claim 1, wherein the determining the text content corresponding to the continuous speech signal to be transmitted comprises:

determining text contents corresponding to continuous voice signals to be sent through a voice recognition algorithm; or

And acquiring text contents corresponding to the continuous voice signals to be sent in a manual labeling mode.

3. The method of claim 1, wherein determining a speech synthesis parameter model for each synthesis unit based on the text content and the continuous speech signal comprises:

performing voice fragment segmentation on the continuous voice signal according to the text content to obtain voice fragments corresponding to each synthesis unit;

sequentially determining the duration of the voice fragments corresponding to each synthesis unit and an initial voice synthesis parameter model, wherein the initial voice synthesis parameter model comprises: an initial fundamental frequency model and an initial spectrum model are obtained, and a fundamental frequency model sequence and a spectrum model sequence corresponding to the continuous voice signal are obtained;

performing joint optimization on the initial fundamental frequency models corresponding to the synthesis units by using the continuous voice signals and the fundamental frequency model sequence to obtain fundamental frequency models of the synthesis units;

and performing joint optimization on the initial spectrum model corresponding to each synthesis unit by using the continuous voice signal and the spectrum model sequence to obtain the spectrum model of each synthesis unit.

4. The method of claim 3, wherein the determining the initial fundamental frequency model corresponding to the synthesis unit comprises:

obtaining a base frequency binary decision tree corresponding to the synthesis unit;

analyzing the text of the synthesis unit to obtain the context information of the synthesis unit;

according to the context information, performing path decision in the fundamental frequency binary decision tree to obtain corresponding leaf nodes;

and taking the fundamental frequency model corresponding to the leaf node as an initial fundamental frequency model corresponding to the synthesis unit.

5. The method of claim 3, wherein the determining the initial spectrum model corresponding to the synthesis unit comprises:

acquiring a spectrum binary decision tree corresponding to the synthesis unit;

according to the context information, performing path decision in the spectrum binary decision tree to obtain corresponding leaf nodes;

and taking the spectrum model corresponding to the leaf node as an initial spectrum model corresponding to the synthesis unit.

6. The method according to claim 4 or 5, characterized in that the method further comprises: constructing a binary decision tree corresponding to the synthesis unit according to the following modes:

acquiring training data;

extracting synthesis parameters of the voice segment set corresponding to the synthesis unit from the training data, wherein the synthesis parameters comprise: a fundamental frequency characteristic or a spectral characteristic;

initializing a binary decision tree corresponding to the synthesis unit according to the synthesis parameters;

sequentially investigating each non-leaf node from a root node of the binary decision tree;

if the current investigation node needs to be split, splitting the current investigation node, and acquiring a split child node and training data corresponding to the child node; otherwise, marking the current investigation node as a leaf node;

and when all the non-leaf nodes are examined, obtaining a binary decision tree of the synthesis unit.

7. The method of claim 3, wherein jointly optimizing the initial fundamental frequency model corresponding to each synthesis unit by using the continuous speech signal and the sequence of fundamental frequency models to obtain the fundamental frequency model of each synthesis unit comprises:

extracting an original fundamental frequency characteristic sequence corresponding to the continuous voice signal;

the following treatments were performed in sequence for each synthesis unit:

acquiring an initial fundamental frequency model and a related fundamental frequency model set corresponding to the synthesis unit, wherein the related fundamental frequency model set comprises all or part of leaf nodes of a fundamental frequency binary decision tree corresponding to the synthesis unit;

selecting a preferred model of the initial fundamental frequency model from the relevant fundamental frequency model set according to the original fundamental frequency feature sequence;

and taking the preferred model as a fundamental frequency model of the synthesis unit, and replacing the preferred model with a corresponding initial fundamental frequency model in the fundamental frequency model sequence.

8. The method according to claim 7, wherein said selecting a preferred model of the initial fundamental frequency model from the set of related fundamental frequency models according to the original fundamental frequency feature sequence comprises:

sequentially selecting the fundamental frequency models in the relevant fundamental frequency model set to replace the corresponding initial fundamental frequency models in the fundamental frequency model sequence to obtain a new fundamental frequency model sequence;

determining a synthesized new fundamental frequency characteristic sequence according to the new fundamental frequency model sequence;

calculating the distance between the new fundamental frequency characteristic sequence and the original fundamental frequency characteristic sequence;

and selecting a fundamental frequency model corresponding to the minimum distance as a preferred model of the initial fundamental frequency model.

9. The method of claim 3, wherein the jointly optimizing the initial spectrum model corresponding to each synthesis unit by using the continuous speech signal and the spectrum model sequence to obtain the spectrum model of each synthesis unit comprises:

extracting an original frequency spectrum characteristic sequence corresponding to the continuous voice signal;

the following treatments were performed in sequence for each synthesis unit:

acquiring an initial spectrum model and a related spectrum model set corresponding to the synthesis unit, wherein the related spectrum model set comprises all or part of leaf nodes of a spectrum binary decision tree corresponding to the synthesis unit;

selecting a preferred model of the initial spectrum model from the set of related spectrum models according to the original spectrum feature sequence;

and taking the preferred model as a spectrum model of the synthesis unit, and replacing the preferred model with a corresponding initial spectrum model in the spectrum model sequence.

10. The method according to claim 9, wherein the selecting a preferred model of the initial spectral model from the set of related spectral models according to the original sequence of spectral features comprises:

sequentially selecting the spectrum models in the relevant spectrum model set to replace the corresponding initial spectrum models in the spectrum model sequence to obtain a new spectrum model sequence;

determining a synthesized new spectrum characteristic sequence according to the new spectrum model sequence;

calculating the distance between the new spectrum characteristic sequence and the original spectrum characteristic sequence;

and selecting the spectrum model corresponding to the minimum distance as the preferred model of the initial spectrum model.

11. A voice signal transmission system, comprising:

12. The system of claim 11, wherein the text acquisition module comprises:

the voice recognition unit is used for determining text contents corresponding to continuous voice signals to be sent through a voice recognition algorithm; or

And the marking information acquisition unit is used for acquiring the text content corresponding to the continuous voice signal to be sent in a manual marking mode.

13. The system of claim 11, wherein the parametric model determination module comprises:

the segmentation unit is used for segmenting the voice fragments of the continuous voice signal according to the text content to obtain the voice fragments corresponding to each synthesis unit;

the time length determining unit is used for sequentially determining the time lengths of the voice fragments corresponding to the synthesizing units;

a model determining unit, configured to sequentially determine an initial speech synthesis parameter model corresponding to each synthesis unit, where the initial speech synthesis parameter model includes: an initial fundamental frequency model and an initial spectrum model;

a model sequence obtaining unit, configured to obtain a fundamental frequency model sequence and a spectrum model sequence corresponding to the continuous speech signal;

the first optimization unit is used for carrying out joint optimization on the initial fundamental frequency models corresponding to the synthesis units by using the continuous voice signals and the fundamental frequency model sequences to obtain fundamental frequency models of the synthesis units;

and the second optimization unit is used for performing joint optimization on the initial spectrum model corresponding to each synthesis unit by using the continuous voice signal and the spectrum model sequence to obtain the spectrum model of each synthesis unit.

14. The system of claim 13, wherein the model determination unit comprises: an initial fundamental frequency model determining unit and an initial spectrum model determining unit;

the initial fundamental frequency model determination unit includes:

the first analysis unit is used for carrying out text analysis on the synthesis unit to obtain the context information of the synthesis unit;

a first decision unit, configured to perform a path decision in the baseband binary decision tree according to the context information to obtain a corresponding leaf node;

a first output unit, configured to use the fundamental frequency model corresponding to the leaf node as an initial fundamental frequency model corresponding to the synthesis unit;

the initial spectrum model determination unit includes:

a second decision unit, configured to perform a path decision in the spectrum binary decision tree according to the context information to obtain a corresponding leaf node;

15. The system of claim 14, further comprising: a binary decision tree construction module, the binary decision tree construction module comprising:

a training data acquisition unit for acquiring training data;

a parameter extracting unit, configured to extract, from the training data, a synthesis parameter of the speech segment set corresponding to the synthesizing unit, where the synthesis parameter includes: a fundamental frequency characteristic or a spectral characteristic;

the initialization unit is used for initializing the binary decision tree corresponding to the synthesis unit according to the synthesis parameters;

a node investigation unit for sequentially investigating each non-leaf node from a root node of the binary decision tree; if the current investigation node needs to be split, splitting the current investigation node, and acquiring a split child node and training data corresponding to the child node; otherwise, marking the current investigation node as a leaf node;

and the binary decision tree output unit is used for obtaining the binary decision tree of the synthesis unit after all the non-leaf node investigation is completed.

16. The system of claim 13, wherein the first optimization unit comprises:

a fundamental frequency characteristic sequence extracting unit, configured to extract an original fundamental frequency characteristic sequence corresponding to the continuous speech signal;

the first acquisition unit is used for sequentially acquiring an initial fundamental frequency model and a related fundamental frequency model set corresponding to each synthesis unit, wherein the related fundamental frequency model set comprises all or part of leaf nodes of a fundamental frequency binary decision tree corresponding to the synthesis unit;

a first selecting unit, configured to select a preferred model of the initial fundamental frequency model from the relevant fundamental frequency model set according to the original fundamental frequency feature sequence;

and the first replacing unit is used for taking the preferred model as the fundamental frequency model of the synthesizing unit and replacing the preferred model with the corresponding initial fundamental frequency model in the fundamental frequency model sequence.

17. The system of claim 16, wherein the first selection unit comprises:

18. The system of claim 13, wherein the second optimization unit comprises:

the spectrum characteristic sequence extraction unit is used for extracting an original spectrum characteristic sequence corresponding to the continuous voice signal;

the second acquisition unit is used for sequentially acquiring the initial spectrum model and the related spectrum model set corresponding to each synthesis unit, and the related spectrum model set comprises all or part of leaf nodes of the spectrum binary decision tree corresponding to the synthesis unit;

a second selecting unit, configured to select a preferred model of the initial spectrum model from the set of related spectrum models according to the original spectrum feature sequence;

and a second replacing unit, configured to use the preferred model as the spectrum model of the synthesizing unit, and replace the corresponding initial spectrum model in the spectrum model sequence with the preferred model.

19. The system according to claim 18, wherein the second selection unit comprises: