Nothing Special   »   [go: up one dir, main page]

CN118197315A - Cabin voice interaction method, system and computer readable medium - Google Patents

Cabin voice interaction method, system and computer readable medium Download PDF

Info

Publication number
CN118197315A
CN118197315A CN202410609091.2A CN202410609091A CN118197315A CN 118197315 A CN118197315 A CN 118197315A CN 202410609091 A CN202410609091 A CN 202410609091A CN 118197315 A CN118197315 A CN 118197315A
Authority
CN
China
Prior art keywords
cabin
interaction
voice
information
instruction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410609091.2A
Other languages
Chinese (zh)
Inventor
蒋磊
蔡勇
刘新
陆晨昱
蔡超
葛德发
方露雨
李娜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hozon New Energy Automobile Co Ltd
Original Assignee
Hozon New Energy Automobile Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hozon New Energy Automobile Co Ltd filed Critical Hozon New Energy Automobile Co Ltd
Priority to CN202410609091.2A priority Critical patent/CN118197315A/en
Publication of CN118197315A publication Critical patent/CN118197315A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/59Context or environment of the image inside of a vehicle, e.g. relating to seat occupancy, driver state or inner lighting conditions
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S5/00Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations
    • G01S5/18Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations using ultrasonic, sonic, or infrasonic waves
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • G06V40/166Detection; Localisation; Normalisation using acquisition arrangements
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Remote Sensing (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The invention provides a cabin voice interaction method, a cabin voice interaction system and a computer readable medium. The cabin voice interaction method is suitable for a vehicle cabin and comprises the following steps: acquiring a voice instruction of a user, determining the position of the user through the voice instruction, and converting the voice instruction into a text; the method comprises the steps of obtaining visual orientation information of a user, determining an interactive object of a voice instruction through the visual orientation information, and obtaining the interactive information, wherein the interactive object comprises multimedia content, multimedia equipment and vehicle cabin hardware; inputting the text and the interaction information to a cloud control center, and the cloud control center obtains the interaction instruction according to the text and the interaction information and outputs the interaction instruction to the vehicle to be executed.

Description

Cabin voice interaction method, system and computer readable medium
Technical Field
The invention mainly relates to the technical field of vehicle-mounted voice interaction, in particular to a cabin voice interaction method, a cabin voice interaction system and a computer readable medium.
Background
With the rapid development of the vehicle industry and the development of vehicle intellectualization, vehicles with voice interaction function are increasingly popular. However, the traditional automobile cabin functional area layout is fragmented, information overload brings a barrier for human-vehicle interaction, so that the value of the automobile as an interaction entrance is underestimated, and as the voice technology is increasingly widely applied to the automobile, the human-vehicle interaction modes are enriched, and the riding experience of a user is improved. The vehicle interior can be provided with a plurality of intelligent terminal display devices, such as a central control large screen positioned in the front row, a display device arranged at the back of the seat and the like, and the intelligent terminal display devices have a voice interaction function. With the increase of interactive devices and voice instructions in the cabin, the current voice interaction method provides higher requirements on the accuracy of the voice instructions of the user, has poor understanding capability on the voice instructions, and can cause unsmooth interaction process and influence user experience.
Disclosure of Invention
The invention aims to provide a cabin voice interaction method, a cabin voice interaction system and a computer readable medium, wherein the cabin voice interaction method, the cabin voice interaction system and the computer readable medium are more accurate in recognition and smoother in interaction.
In order to solve the technical problems, the invention provides a cabin voice interaction method which is applicable to a vehicle cabin and comprises the following steps: acquiring a voice instruction of a user, determining the position of the user through the voice instruction, and converting the voice instruction into a text; the method comprises the steps of obtaining visual orientation information of a user, determining an interactive object of a voice instruction through the visual orientation information, and obtaining the interactive information, wherein the interactive object comprises multimedia content, multimedia equipment and vehicle cabin hardware; inputting the text and the interaction information to a cloud control center, and the cloud control center obtains the interaction instruction according to the text and the interaction information and outputs the interaction instruction to the vehicle to be executed.
In one embodiment of the present invention, determining the user location via voice instructions includes: collecting sound signals in a vehicle cabin through a microphone array, preprocessing the sound signals and estimating time; and determining the user position through a beam forming algorithm according to the time delay estimation result; wherein the microphone array comprises a plurality of microphones.
In one embodiment of the invention, the conversion of speech instructions into text uses ASR technology.
In an embodiment of the present invention, acquiring visual orientation information of a user includes: the user is facial tracked and eye tracked via visual signals from the OMS/DMS devices within the vehicle cabin.
In an embodiment of the present invention, when the interactive object is multimedia content, the cabin voice interactive method further includes: determining at least one first display device corresponding to visual orientation information among at least one display device within a vehicle cabin; determining an interactable area in at least one first display device; and when the interaction information in the interactable area can be acquired through the vehicle machine, inputting the interaction information to the cloud control center.
In an embodiment of the present invention, when the interaction information in the interactable area cannot be obtained through the vehicle, the step of obtaining the interaction information by the cloud control center includes: and acquiring an image in the interactable area, and carrying out entity identification on the image to obtain interaction information.
In an embodiment of the present invention, when there are a plurality of first display devices, the cloud control center obtaining an interaction instruction according to text and interaction information includes: comparing the similarity of the text and the interactive information to obtain a plurality of alternative operation items; and sorting the similarity of the plurality of candidate operation items, wherein the candidate operation item with the highest similarity is used as an interaction instruction.
In an embodiment of the present invention, when the interactive object is a multimedia device, the cabin voice interactive method further includes: determining a first display device corresponding to visual orientation information in at least one display device within a vehicle cabin; and acquiring interaction information of the first display device, wherein the interaction information comprises opening, closing and brightness adjustment.
In an embodiment of the present invention, when the interactive object is cabin hardware, the cabin voice interactive method further includes: determining first cabin hardware corresponding to visual orientation information in at least one cabin hardware in a vehicle cabin and acquiring interaction information of the first cabin hardware; wherein the first cabin hardware includes windows, seats, rearview mirrors, sound and lighting systems.
The invention also provides a cabin voice interaction system, which comprises: a memory for storing instructions executable by the processor; and a processor for executing the instructions to implement the method of any of the previous embodiments.
The invention also provides a computer readable medium storing computer program code which, when executed by a processor, implements the cabin voice interaction method of any of the previous embodiments.
Compared with the prior art, the invention has the following advantages: aiming at the control of multimedia content, multimedia equipment and vehicle cabin hardware, the voice command and visual orientation information of a user are combined, the understanding capability of the voice command is improved, the interaction accuracy is improved, the use feeling of the user is improved, the image in the interactable area can be obtained through means such as a camera or screen capturing when the multimedia content cannot be called, and the entity recognition is carried out on the image so as to obtain the interaction information.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the principles of the application. In the accompanying drawings:
Fig. 1 is a flow chart of a cabin voice interaction method according to an embodiment of the invention.
Fig. 2 is a partial flow chart of a cabin voice interaction method in accordance with another embodiment of the invention.
Detailed Description
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are used in the description of the embodiments will be briefly described below. It is apparent that the drawings in the following description are only some examples or embodiments of the present application, and it is apparent to those of ordinary skill in the art that the present application may be applied to other similar situations according to the drawings without inventive effort. Unless otherwise apparent from the context of the language or otherwise specified, like reference numerals in the figures refer to like structures or operations.
As used in the specification and in the claims, the terms "a," "an," "the," and/or "the" are not specific to a singular, but may include a plurality, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that the steps and elements are explicitly identified, and they do not constitute an exclusive list, as other steps or elements may be included in a method or apparatus.
The relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present application unless it is specifically stated otherwise. Meanwhile, it should be understood that the sizes of the respective parts shown in the drawings are not drawn in actual scale for convenience of description. Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but should be considered part of the specification where appropriate. In all examples shown and discussed herein, any specific values should be construed as merely illustrative, and not a limitation. Thus, other examples of the exemplary embodiments may have different values. It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further discussion thereof is necessary in subsequent figures.
In the description of the present application, it should be understood that the azimuth or positional relationships indicated by the azimuth terms such as "front, rear, upper, lower, left, right", "lateral, vertical, horizontal", and "top, bottom", etc., are generally based on the azimuth or positional relationships shown in the drawings, merely to facilitate description of the present application and simplify the description, and these azimuth terms do not indicate and imply that the apparatus or elements referred to must have a specific azimuth or be constructed and operated in a specific azimuth, and thus should not be construed as limiting the scope of protection of the present application; the orientation word "inner and outer" refers to inner and outer relative to the contour of the respective component itself.
Spatially relative terms, such as "above … …," "above … …," "upper surface on … …," "above," and the like, may be used herein for ease of description to describe one device or feature's spatial location relative to another device or feature as illustrated in the figures. It will be understood that the spatially relative terms are intended to encompass different orientations in use or operation in addition to the orientation depicted in the figures. For example, if the device in the figures is turned over, elements described as "above" or "over" other devices or structures would then be oriented "below" or "beneath" the other devices or structures. Thus, the exemplary term "above … …" may include both orientations "above … …" and "below … …". The device may also be positioned in other different ways (rotated 90 degrees or at other orientations) and the spatially relative descriptors used herein interpreted accordingly.
In addition, the terms "first", "second", etc. are used to define the components, and are only for convenience of distinguishing the corresponding components, and the terms have no special meaning unless otherwise stated, and therefore should not be construed as limiting the scope of the present application. Furthermore, although terms used in the present application are selected from publicly known and commonly used terms, some terms mentioned in the present specification may be selected by the applicant at his or her discretion, the detailed meanings of which are described in relevant parts of the description herein. Furthermore, it is required that the present application is understood, not simply by the actual terms used but by the meaning of each term lying within.
A flowchart is used in the present application to describe the operations performed by a system according to embodiments of the present application. It should be understood that the preceding or following operations are not necessarily performed in order precisely. Rather, the various steps may be processed in reverse order or simultaneously. At the same time, other operations are added to or removed from these processes.
Fig. 1 is a flow chart of a cabin voice interaction method according to an embodiment of the invention. As described with reference to fig. 1, the present invention provides a cabin voice interaction method 10 suitable for a vehicle cabin, comprising the steps of:
s11: acquiring a voice instruction of a user, determining the position of the user through the voice instruction, and converting the voice instruction into a text;
s12: the method comprises the steps of obtaining visual orientation information of a user, determining an interactive object of a voice instruction through the visual orientation information, and obtaining the interactive information;
S13: inputting the text and the interaction information to a cloud control center, and the cloud control center obtains the interaction instruction according to the text and the interaction information and outputs the interaction instruction to the vehicle to be executed.
In this embodiment, determining the user position by voice command in step S11 employs Beamforming (Beamforming) technology, specifically including: the method comprises the steps of collecting sound signals in a vehicle cabin through a microphone array, preprocessing the sound signals, estimating time delay, and determining the position of a user through a beam forming algorithm according to the result of the time delay estimation, wherein the microphone array comprises a plurality of microphones.
Specifically, the microphone array disposed in the vehicle cabin includes a plurality of microphones, preferably may include a dual microphone or four microphones, wherein the dual microphone is generally disposed between the main driver's seat and the co-driver's seat, and the four microphones are newly added with two microphones based on the dual microphones, and are disposed at the left rear and the right rear of the cabin, respectively. The plurality of microphones can respectively collect sound signals in the vehicle cabin, and synchronize and preprocess the sound signals, such as filtering, denoising and the like, so as to reduce the influence of environmental noise and interference on positioning accuracy.
It will be appreciated that due to the different locations, different microphones will receive sound waves at different times, even for the same speech signal from the same user, and that the position of the sound source can be estimated by measuring the time difference between the Arrival of the sound at the different microphones using a time delay estimate (TIME DELAY of Arrival, TDOA). Further, using a specific algorithm, such as a Delay-and-Sum (Delay-Sum) beamforming algorithm, adjusting the signal of each microphone in the array according to the result of the Delay estimation may cause the microphone array to form a main beam in the direction of the sound source (i.e., the direction of the user who is giving the voice command), where the peak direction of the beam is regarded as the azimuth of the sound source.
In this embodiment, the conversion of speech commands into text uses ASR techniques, which include preprocessing the speech commands and sequentially inputting the processed audio features into an acoustic model and a language model, followed by decoding and post-processing to output a continuous text stream. Specifically, the pretreatment includes: sampling and quantizing, the audio signal is first sampled and quantized into a digital signal; noise is eliminated, and the influence of background noise is reduced through a noise suppression technology; and feature extraction, extracting features from the audio signal, such as mel-frequency cepstral coefficients (MFCCs), filter bank energy (FBANK), spectrograms, and the like.
The acoustic model and the language model are required to be trained in advance and then put into use, wherein the training of the acoustic model requires the use of a large amount of labeled voice data and corresponding texts, the acoustic model is trained to recognize the relations between different sound features and language units (such as phonemes), the preprocessed audio features are input into the trained acoustic model, and the acoustic model can output the probability distribution of the phonemes or states corresponding to each time frame. The language model is trained to predict the probability distribution of words after a given pre-term, which helps identify the entire sentence rather than just individual words, in conjunction with the output of the acoustic model, which helps identify the most likely word sequence.
Further, the decoder uses a search algorithm (such as Viterbi algorithm) to find the most probable word sequence under the guidance of the acoustic model and the language model, and outputs the most probable word sequence, and the output group of words most likely form the speaker's voice. Post-processing adds punctuation and capitalization to the converted text according to grammar rules and context, and uses the grammar rules or additional models to correct errors that may occur during recognition.
In one embodiment of the present invention, the obtaining of the user' S visual orientation information in step S12 includes facial tracking and eye tracking of the user via visual signals from the OMS/DMS device within the vehicle cabin. Specifically, the face orientation tracking includes the steps of:
(1) Face detection: capturing face images using an on-board camera OMS/DMS typically involves using face detection algorithms such as Haar cascading, deep learning models (e.g., convolutional neural network CNN), etc.;
(2) Characteristic point positioning: when a face is detected, the next step is to locate key feature points of the face, such as eyes, nose, mouth, cheekbones, etc. by face detection to describe the geometry of the face;
(3) Fitting a face model: fitting a face model to the detected face using a set of predefined facial feature points;
(4) Posture estimation: estimating the pose of the head, including pitch, yaw and roll angles, by analyzing the relative positions between the feature points and parameters of the facial model;
(5) Three-dimensional reconstruction: if depth cameras or stereo cameras are used, the three-dimensional shape of the face may also be reconstructed by triangulation or other stereo vision techniques;
(6) Data smoothing and filtering: to reduce noise and tracking instability, it is often necessary to smooth the tracking data using a kalman filter or other smoothing algorithm;
(7) Tracking and updating: over time, the appearance and pose of the face may change, so the tracking system needs to continually update the face model and the locations of the feature points to reflect these changes.
Eye tracking comprises the following steps:
image capturing: capturing an image of the eye using an on-board camera OMS/DMS, which may be an infrared camera, a visible camera, or a combination of both;
Pupil detection: detecting the position of the pupil by image processing algorithms, typically involves identifying the contrast between the pupil and the iris, sclera, and eyelid;
characteristic point positioning: once the pupil is detected, the system will further locate other key feature points of the eye, such as pupil edges, corneal reflection points, etc.;
Eye model fitting: fitting an eye model to the captured image using the located feature points can help explain the geometry and optical characteristics of the eye;
Line of sight estimation: estimating the gaze direction by analyzing the positions of the pupil and cornea reflection points, and the eye model, typically involving geometric or optical calculations, to determine three-dimensional spatial points of eye gaze;
head pose correction: if necessary, the system corrects the posture change of the head to ensure the accuracy of eyeball tracking, and the step can be realized by an additional head tracking technology;
Data smoothing and filtering: to reduce noise and tracking instability, it is often necessary to smooth the tracking data using a kalman filter or other smoothing algorithm;
Tracking and updating: over time, the position of the eye and gaze point may change, and thus the tracking system needs to continually update the eye model and the position of the feature points to reflect these changes.
Further, the interactive object includes multimedia content, multimedia devices, and vehicle cabin hardware in step S12. Fig. 2 is a partial flow chart of a cabin voice interaction method in accordance with another embodiment of the invention. Referring to fig. 1-2 in combination, when the interactive object is multimedia content, the cabin voice interactive method further includes:
S21: determining at least one first display device corresponding to visual orientation information among at least one display device within a vehicle cabin;
S22: determining an interactable area in at least one first display device;
s23: judging whether the interactive information in the interactive area can be acquired through the vehicle machine, if so, executing a step S25, otherwise, executing a step S24;
S24: acquiring an image in the interactable area, performing entity identification on the image to obtain interaction information, and executing step S25;
S25: and inputting the interaction information to a cloud control center.
In particular, multimedia content may be understood as content displayed on a display device (including a main display screen, a sub display screen, etc.) in a cabin of a vehicle, including some interactable objects, such as when a user looks at the main display screen, at which time the main display screen is determined to be the first display device, and interactable areas therein, such as various software UIs, etc., are determined. The interactive information in the interactable area can be directly acquired through a system or can access an interface provided by a third party application under some conditions so as to acquire the content, and can not be directly acquired under other conditions, at the moment, the image in the interactable area can be acquired through a camera or a screen capturing method and the like, and entity identification is carried out on the image so as to acquire the interactive information. For example, different video software exists on a plurality of screens, and the user can start the video software only by looking at the screen which wants to start and speaking "video".
In an embodiment of the present invention, when there are a plurality of first display devices, the cloud control center obtaining an interaction instruction according to text and interaction information includes: and comparing the similarity between the text and the interaction information to obtain a plurality of alternative operation items, and sorting the similarity between the plurality of alternative operation items, wherein the alternative operation item with the highest similarity is used as the interaction instruction. For example, a plurality of adjacent screens exist in the interactable area pointed by the user's sight, the screens are all first display devices, at the moment, the user sends out a voice command of opening video, the method can acquire interaction information (i.e. operable video software) on the plurality of screens, and compares the similarity between texts and the interaction information, and finally only the alternative operation item with the highest similarity is output as the interaction command.
Further, in an embodiment of the present invention, when the interactive object is a multimedia device, the cabin voice interactive method further includes: and determining a first display device corresponding to the visual orientation information in at least one display device in the vehicle cabin, and acquiring interaction information of the first display device, wherein the interaction information comprises opening, closing and brightness adjustment.
Still further, in an embodiment of the present invention, when the interactive object is cabin hardware, the cabin voice interactive method further includes: a first cabin hardware corresponding to the visual orientation information in at least one cabin hardware within the vehicle cabin is determined and interaction information for the first cabin hardware is obtained, wherein the first cabin hardware includes windows, seats, rearview mirrors, sound and illumination systems.
The invention also provides a cabin voice interaction system, which comprises: a memory for storing instructions executable by the processor; and a processor for executing the instructions to implement the method of any of the previous embodiments.
Some aspects of the application may be performed entirely by hardware, entirely by software (including firmware, resident software, micro-code, etc.) or by a combination of hardware and software. The above hardware or software may be referred to as a "data block," module, "" engine, "" unit, "" component, "or" system. The processor may be one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital signal processing devices (DAPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), processors, controllers, microcontrollers, microprocessors, or a combination thereof. Furthermore, aspects of the application may take the form of a computer product, comprising computer-readable program code, embodied in one or more computer-readable media. For example, computer-readable media can include, but are not limited to, magnetic storage devices (e.g., hard disk, floppy disk, magnetic strips … …), optical disks (e.g., compact disk CD, digital versatile disk DVD … …), smart cards, and flash memory devices (e.g., card, stick, key drive … …).
The invention also provides a computer readable medium storing computer program code which, when executed by a processor, implements the cabin voice interaction method of any of the previous embodiments.
The computer readable medium may comprise a propagated data signal with the computer program code embodied therein, for example, on a baseband or as part of a carrier wave. The propagated signal may take on a variety of forms, including electro-magnetic, optical, etc., or any suitable combination thereof. A computer readable medium can be any computer readable medium that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code located on a computer readable medium may be propagated through any suitable medium, including radio, cable, fiber optic cable, radio frequency signals, or the like, or a combination of any of the foregoing.
While the basic concepts have been described above, it will be apparent to those skilled in the art that the foregoing disclosure is by way of example only and is not intended to be limiting. Although not explicitly described herein, various modifications, improvements and adaptations of the application may occur to one skilled in the art. Such modifications, improvements, and modifications are intended to be suggested within the present disclosure, and therefore, such modifications, improvements, and adaptations are intended to be within the spirit and scope of the exemplary embodiments of the present disclosure.
Meanwhile, the present application uses specific words to describe embodiments of the present application. Reference to "one embodiment," "an embodiment," and/or "some embodiments" means that a particular feature, structure, or characteristic is associated with at least one embodiment of the application. Thus, it should be emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various positions in this specification are not necessarily referring to the same embodiment. Furthermore, certain features, structures, or characteristics of one or more embodiments of the application may be combined as suitable.
Similarly, it should be noted that in order to simplify the description of the present disclosure and thereby aid in understanding one or more inventive embodiments, various features are sometimes grouped together in a single embodiment, figure, or description thereof. This method of disclosure does not imply that the subject application requires more features than are set forth in the claims. Indeed, less than all of the features of a single embodiment disclosed above.
In some embodiments, numbers describing the components, number of attributes are used, it being understood that such numbers being used in the description of embodiments are modified in some examples by the modifier "about," approximately, "or" substantially. Unless otherwise indicated, "about," "approximately," or "substantially" indicate that the number allows for a 20% variation. Accordingly, in some embodiments, numerical parameters set forth in the specification and claims are approximations that may vary depending upon the desired properties sought to be obtained by the individual embodiments. In some embodiments, the numerical parameters should take into account the specified significant digits and employ a method for preserving the general number of digits. Although the numerical ranges and parameters set forth herein are approximations in some embodiments for use in determining the breadth of the range, in particular embodiments, the numerical values set forth herein are as precisely as possible.
While the application has been described with reference to the specific embodiments presently, it will be appreciated by those skilled in the art that the foregoing embodiments are merely illustrative of the application, and various equivalent changes and substitutions may be made without departing from the spirit of the application, and therefore, all changes and modifications to the embodiments are intended to be within the scope of the appended claims.

Claims (9)

1. The cabin voice interaction method is suitable for a vehicle cabin and is characterized by comprising the following steps of:
Acquiring a voice instruction of a user, determining the position of the user through the voice instruction, and converting the voice instruction into a text;
The method comprises the steps of obtaining visual orientation information of a user, determining an interactive object of the voice instruction through the visual orientation information, and obtaining interactive information, wherein the interactive object comprises multimedia content, multimedia equipment and vehicle cabin hardware;
inputting the text and the interaction information to a cloud control center, obtaining an interaction instruction by the cloud control center according to the text and the interaction information, outputting the interaction instruction to a vehicle machine for execution,
The cabin voice interaction method further comprises the following steps:
determining at least one first display device corresponding to the visual orientation information among at least one display device within the vehicle cabin;
Determining an interactable area in the at least one first display device; and
When the interaction information in the interactable area can be obtained through the vehicle machine, inputting the interaction information to a cloud control center;
when the interaction information in the interactable area cannot be obtained through the vehicle, the step of obtaining the interaction information by the cloud control center further includes: and acquiring an image in the interactable area, and carrying out entity identification on the image to obtain the interaction information.
2. The cabin voice interaction method of claim 1, wherein determining the user location via the voice command comprises:
Collecting sound signals in the vehicle cabin through a microphone array, preprocessing the sound signals and estimating time; and
Determining the user position through a beam forming algorithm according to the time delay estimation result; wherein,
The microphone array includes a plurality of microphones.
3. A cabin voice interaction method as in claim 1, wherein the conversion of the voice command into text employs ASR techniques.
4. The cabin voice interaction method of claim 1, wherein acquiring visual orientation information of the user comprises: the user is facial tracked and eye tracked via visual signals of the OMS/DMS devices within the vehicle cabin.
5. The cabin voice interaction method of claim 1, wherein when there are a plurality of first display devices, the cloud control center obtaining an interaction instruction according to the text and the interaction information comprises:
Performing similarity comparison on the text and the interaction information to obtain a plurality of alternative operation items; and
And sorting the similarity of the plurality of candidate operation items, and taking the candidate operation item with the highest similarity as the interaction instruction.
6. The cabin voice interaction method of any one of claims 1-4, wherein when the interaction object is a multimedia device, the cabin voice interaction method further comprises:
Determining a first display device corresponding to the visual orientation information among at least one display device within the vehicle cabin; and
And acquiring the interaction information of the first display device, wherein the interaction information comprises opening, closing and brightness adjustment.
7. The cabin voice interaction method of any one of claims 1-4, wherein when the interaction object is cabin hardware, the cabin voice interaction method further comprises:
Determining a first cabin hardware corresponding to the visual orientation information in at least one cabin hardware in the vehicle cabin and acquiring the interaction information of the first cabin hardware; wherein,
The first cabin hardware includes windows, seats, rearview mirrors, sound and lighting systems.
8. A cabin voice interactive system, comprising:
a memory for storing instructions executable by the processor; and
A processor for executing the instructions to implement the method of any one of claims 1-7.
9. A computer readable medium storing computer program code which, when executed by a processor, implements the cabin voice interaction method of any one of claims 1-7.
CN202410609091.2A 2024-05-16 2024-05-16 Cabin voice interaction method, system and computer readable medium Pending CN118197315A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410609091.2A CN118197315A (en) 2024-05-16 2024-05-16 Cabin voice interaction method, system and computer readable medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410609091.2A CN118197315A (en) 2024-05-16 2024-05-16 Cabin voice interaction method, system and computer readable medium

Publications (1)

Publication Number Publication Date
CN118197315A true CN118197315A (en) 2024-06-14

Family

ID=91412618

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410609091.2A Pending CN118197315A (en) 2024-05-16 2024-05-16 Cabin voice interaction method, system and computer readable medium

Country Status (1)

Country Link
CN (1) CN118197315A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118398011A (en) * 2024-06-26 2024-07-26 广州小鹏汽车科技有限公司 Voice request processing method, server device and storage medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170235361A1 (en) * 2016-01-20 2017-08-17 Panasonic Automotive Systems Company Of America, Division Of Panasonic Corporation Of North America Interaction based on capturing user intent via eye gaze
CN107123421A (en) * 2017-04-11 2017-09-01 广东美的制冷设备有限公司 Sound control method, device and home appliance
US20190139547A1 (en) * 2017-11-08 2019-05-09 Alibaba Group Holding Limited Interactive Method and Device
CN109941231A (en) * 2019-02-21 2019-06-28 初速度(苏州)科技有限公司 Vehicle-mounted terminal equipment, vehicle-mounted interactive system and exchange method
CN111816189A (en) * 2020-07-03 2020-10-23 斑马网络技术有限公司 Multi-tone-zone voice interaction method for vehicle and electronic equipment
CN112073639A (en) * 2020-09-11 2020-12-11 Oppo(重庆)智能科技有限公司 Shooting control method and device, computer readable medium and electronic equipment
CN112951216A (en) * 2021-05-11 2021-06-11 宁波均联智行科技股份有限公司 Vehicle-mounted voice processing method and vehicle-mounted information entertainment system
CN113352986A (en) * 2021-05-20 2021-09-07 浙江吉利控股集团有限公司 Vehicle voice atmosphere lamp partition interaction control method and system
CN116978372A (en) * 2022-04-22 2023-10-31 华为技术有限公司 Voice interaction method, electronic equipment and storage medium
CN117995184A (en) * 2023-12-29 2024-05-07 华人运通(上海)云计算科技有限公司 Man-machine interaction method, device and equipment under low attention and storage medium

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170235361A1 (en) * 2016-01-20 2017-08-17 Panasonic Automotive Systems Company Of America, Division Of Panasonic Corporation Of North America Interaction based on capturing user intent via eye gaze
CN107123421A (en) * 2017-04-11 2017-09-01 广东美的制冷设备有限公司 Sound control method, device and home appliance
US20190139547A1 (en) * 2017-11-08 2019-05-09 Alibaba Group Holding Limited Interactive Method and Device
CN109941231A (en) * 2019-02-21 2019-06-28 初速度(苏州)科技有限公司 Vehicle-mounted terminal equipment, vehicle-mounted interactive system and exchange method
CN111816189A (en) * 2020-07-03 2020-10-23 斑马网络技术有限公司 Multi-tone-zone voice interaction method for vehicle and electronic equipment
CN112073639A (en) * 2020-09-11 2020-12-11 Oppo(重庆)智能科技有限公司 Shooting control method and device, computer readable medium and electronic equipment
CN112951216A (en) * 2021-05-11 2021-06-11 宁波均联智行科技股份有限公司 Vehicle-mounted voice processing method and vehicle-mounted information entertainment system
CN113352986A (en) * 2021-05-20 2021-09-07 浙江吉利控股集团有限公司 Vehicle voice atmosphere lamp partition interaction control method and system
CN116978372A (en) * 2022-04-22 2023-10-31 华为技术有限公司 Voice interaction method, electronic equipment and storage medium
CN117995184A (en) * 2023-12-29 2024-05-07 华人运通(上海)云计算科技有限公司 Man-machine interaction method, device and equipment under low attention and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118398011A (en) * 2024-06-26 2024-07-26 广州小鹏汽车科技有限公司 Voice request processing method, server device and storage medium

Similar Documents

Publication Publication Date Title
US20210035586A1 (en) System and method of correlating mouth images to input commands
US10468032B2 (en) Method and system of speaker recognition using context aware confidence modeling
US11854550B2 (en) Determining input for speech processing engine
Sahoo et al. Emotion recognition from audio-visual data using rule based decision level fusion
CN118197315A (en) Cabin voice interaction method, system and computer readable medium
CN106157956A (en) The method and device of speech recognition
WO2014062521A1 (en) Emotion recognition using auditory attention cues extracted from users voice
US11508374B2 (en) Voice commands recognition method and system based on visual and audio cues
KR102290186B1 (en) Method of processing video for determining emotion of a person
CN111341350A (en) Man-machine interaction control method and system, intelligent robot and storage medium
CN115169507A (en) Brain-like multi-mode emotion recognition network, recognition method and emotion robot
Ivanko et al. Multimodal speech recognition: increasing accuracy using high speed video data
CN112992191B (en) Voice endpoint detection method and device, electronic equipment and readable storage medium
US20240135956A1 (en) Method and apparatus for measuring speech-image synchronicity, and method and apparatus for training model
Navarathna et al. Multiple cameras for audio-visual speech recognition in an automotive environment
CN114466179B (en) Method and device for measuring synchronism of voice and image
Loh et al. Speech recognition interactive system for vehicle
Saudi et al. Improved features and dynamic stream weight adaption for robust Audio-Visual Speech Recognition framework
CN114466178A (en) Method and device for measuring synchronism of voice and image
CN114494930A (en) Training method and device for voice and image synchronism measurement model
Ibrahim A novel lip geometry approach for audio-visual speech recognition
Yasui et al. Multimodal speech recognition using mouth images from depth camera
CN110338747A (en) Householder method, storage medium, intelligent terminal and the auxiliary device of eye test
Murai et al. Face-to-talk: audio-visual speech detection for robust speech recognition in noisy environment
Pao et al. A motion feature approach for audio-visual recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination