VoiceFirst Lingo

Most of the definitions below were written by other companies, including Witlingo and Pragmatic. Please see the sources section at the bottom for links to these sites.

One of the first steps to learning about any subject is understanding the common lingo/jargon used by people in the industry. Without that basic knowledge, it is often hard to follow along when reading, listening or watching anything related to the topic.

So... we suggest you start by learning some basic VoiceFirst lingo first, and then come back here throughout your journey. Many of these terms will not make sense now but will become clear to you later.

Voice-First

Let's start with breaking down the term Voice-First...

The input and interaction with a Voice-First device is via voice commands and conversation. There are various types of Voice-First devices.

Voice-Only Devices with voice as the only input and output, such as smart speakers, are categorized as Voice-Only. Examples include Amazon Echo, Sonos One and Google Home Mini.

Multimodal Devices which support input and output via voice and screens, such as smart displays, are categorized as Multimodal. Each input changes the way a customer can interact with the experience, but the two should work together fluidly. Examples include Amazon Echo Show, Google Home Hub, and Mercedes-Benz's MBUX (powered by SoundHound's Houndify).

Voice-Overlay Devices where voice is not the primary method for input or output but instead is used as an option for assisting with input, such as speech-to-text on mobile devices and smart watches. Examples include Siri on an Apple Watch and Dictation on a Macbook.

https://dzone.com/articles/voice-user-interfaces-vuithe-ultimate-ux-guide

Platform Specific Terms

Amazon - Alexa Skills An Alexa skill is to Alexa what a mobile app is to the iOS and Android mobile platforms. Skills enhance and personalize Alexa devices by enabling users to do anything from play a game, get information, or even read a story.

Amazon - Alexa Flash Briefing A Flash Briefing is a news update that Alexa can read or play. Users can customize their Flash Briefings by choosing from a list of news sources and setting the order in which they are read.

Google Assistant - Actions Actions enable users to extend the functionality of the Google Assistant which powers the Google Home. They let users do things that range from a quick command such as turning on a light or playing music to a longer conversation such as playing a game.

Samsung Bixby - Capsules Developers teach Bixby by building Bixby Capsules. Each Capsule contains everything Bixby needs to know, from defining use cases and models to dialog and layouts. You connect your Capsule to your existing APIs and add natural language training to create conversational experiences.

Apple Siri - Shortcuts Shortcuts help users quickly accomplish tasks related to your app with their voice or with a tap with the Shortcuts API. Siri intelligently pairs users’ daily routines with your apps to suggest convenient shortcuts right when they’re needed on the lock screen, in Search or from the Siri watch face.

Microsoft Cortana - Skills Cortana is a personal digital assistant that keeps users informed and productive, helping them get things done across devices and platforms. Skills define the tasks that Cortana can accomplish. You can extend Cortana by adding your own skills that let your users interact with your service via Cortana. Cortana invokes the skills based on input from the user, either spoken or typed.

Basic Voice User Interface Terms

Always Listening Device A device that is always listening for a “wake word” and that sends the audio captured after the wake word has been detected for additional processing.

Voice User Interface (VUI) A Voice User Interface (VUI) is the voice equivalent of a Graphical User Interface (GUI) for computers. VUI's allow users to interact with devices by speaking and listening.

Interaction Model Alexa calls their VUI an "Interaction Model." Below are examples of how Alexa breaks down a user request in to components that can then be used to determine the correct response, prompt and/or action. Below this image are definitions for each part. Other platforms may use different terms, but they typically have a similar syntax.

Wake Word The spoken word or phrase that “wakes up” an always listening device.

Launch Phrase A word or phrase that tells Alexa what kind of action she's about to take. For example, "open" and "load" are used to start a skill, "play" is used for music and "turn on" is used for smart home products.

Invocation Name A name that represents the custom skill the user wants to use. The user says a supported launch phrase in combination with the invocation name for a skill to begin interacting with that skill. For example, "Alexa, open Kids Court." In this example, Kids Court is the invocation name.

Utterance The words the user says to Alexa to convey what they want to do, or to provide a response to a question Alexa asks. For custom skills, you provide a set of sample utterances mapped to intents as part of your custom interaction model. For smart home skills, the smart home skill API Message Reference provides a predefined set of utterances.

Slot Value An argument to an intent that gives Alexa more information about that request. For example, "Alexa, ask History Buff what happened on June third". In this statement, "…June third" is the value of a date slot that refines the request.

Intents An intent represents an action that fulfills a user's spoken request. Intents can optionally have arguments called slots.

Dialog Model

Dialog Model A structure that identifies the steps for a multi-turn conversation between your skill and the user to collect all the information needed to fulfill each intent. This simplifies the code you need to write to ask the user for information.

Turn A single request to or a response from Alexa.

Prompt The instruction or response that a system “speaks” to the user.

Confirmation A voice-activated assistants response to make sure the user knows it understood correctly.

If you are new to VoiceFirst, or are just looking for a basic understanding of Voice User Interfaces, you may want to stop reading this page and come back once you've explored other aspects of VoiceFirst. The next two sections get pretty technical. If continue reading, we may want to come back and read it again later. It may take a few times for all of this to really sink in.

Artificial Intelligence (AI) Terms

Automatic Speech Recognition (ASR) Software that is able to take audio input and map that input to a word or a language utterance.

Far Field Speech Recognition Speech recognition technology that is able to process speech spoken by a user from a distance (usually 10 feet away or more) to the receiving device, usually in a context where there is ambient noise. The first performing mainstream Far Field Speech Recognition device and system was the Amazon Echo, which launched its product to the market in November 2014. The Speech Recognition technology that handles speech recognition on hand held, mobile devices (e.g., Siri) is called Near Field Speech Recognition.

Natural Language Processing (NLP) A branch of artificial intelligence that helps computers understand, interpret, and manipulate human language. It refers to the process of analyzing text and extracting data (basically translating voice into text into data that the software can understand).

Natural Language Understanding (NLU) The aspect of NLP that deals with machine reading comprehension. A combination of components including a comprehensive lexicon, grammar rules, and semantic theory work together to break down sentences and guide comprehension.

Natural Language Generation (NLG) The natural language processing task of generating natural language from a machine representation system. Basically the reverse of NLU, this is the process that translates data back into a written narrative that becomes the spoken response back to a user.

Near Field Speech Recognition In contrast to “Far Field” speech recognition, which processes speech spoken by a human to a device from a distance (usually of 10 feet or more), Near Field speech recognition technology is used for handing spoken input from hand held mobile devices (e.g., Siri on the iPhone) that are used within inches or two feet away at most.

Speech To Text (STT) Software that converts an audio signal to words (text). “Speech to Text” is a term that is less frequently used in the industry than “Speech Recognition,” “Speech Reco,” or “ASR.”

Text to Speech (TTS) Technology that converts text to audio that is spoken by the system. TTS is usually used in the context of dynamically retrieved information (a product ID), or when the list of possible items to be spoken by the system (e.g., full addresses) is very large, and therefore, recording all of the options is not practical.

Voice Biometrics Technology that identifies specific markers within a given piece of audio that was spoken by a human being and uses those markers to uniquely model the speaker’s voice. The technology is the voice equivalent of technology that takes a visual finger print of a person and associates that unique finger print with the person’s identity. Voice Biometrics technology is used for both Voice Identification and Voice Verification.

Voice Identification (Voice ID) The capability of discriminating a speaker’s identity among a list of possible speaker identities based on the characteristics of the speaker’s voice input. Voice ID systems are usually trained by being provided with samples of speaker voices.

Voice Verification The capability of confirming an identity claim based on a speaker’s voice input. Unlike Voice Identification, which attempts to match a given speaker’s voice input against a universe of speaker voices, Voice Verification compares a voice input against a given speaker’s voice and provides a likelihood match score. Voice Verifications are usually done in an “Identity Claim” setting: the user claims to be someone and then is “challenged” to verify their identity by speaking.

Conversational AI Design Terms

Barge-in The ability of the user to interrupt system prompts while those prompts are being played. If barge-in is enabled in an application, then as soon as the user begins to speak, the system stops playing its prompt and begins processing the user’s input.

Confirmation Explicit A prompt that repeats back what was heard and explicitly asks the user to confirm whether the assistant is correct. Example: User: Hey Google, ask Astrology Daily for my horoscope. Astrology Daily: You wanted a horoscope from Astrology Daily, right?

Confirmation Implicit A prompt that subtly repeats back what was heard to give the user assurance that they were correctly understood. Example: User: Hey Google, ask Astrology Daily for my horoscope. Astrology Daily: Horoscope for what sign?

Contextual Speech Refers to the context, or surrounding words, phrases, and paragraph of writing that defines the meaning of a particular word. In contextual speech, this definition expands to also include the pronunciation in speech. For example, "Forty times" when spoken can be interpreted as: 40 times, 4 tee times, or even, 4 tea times but the meaning can be derived from the words around it in requests such as “tell me 40 times” or “book me 4 tee times at Cog Hill.”

Conversation Repair In conversational analysis, repair refers to the process by which a speaker recognizes a speech error and repeats what has been said with some sort of correction. In the context of voice, repair refers to the process by which the software recognizes an error in understanding or request and then seeks to correct it.

Directed Dialog Interactions where the exchange between the user and the system is guided by the application: the system asks questions or offers options and the user responds to them. Directed dialogs stand in contract to “Mixed Initiative” dialogs, since they require the user to specifically answer the question asked and won’t accept any other piece of information, whether additive (the user provided an answer to the question, but also an additional piece of information) or substitutive (the user provided instead an altogether different piece of information that is relevant and will be asked for by the system at some point).

Earcon The audio equivalent of an “icon” in graphical user interfaces. Earcons are used to signal conversation marks (e.g., when the system starts listening, when the system stops listening) as well as to communicate brand, mood, and emotion during a voice first based interaction.

End-pointing The marking of the start and the end of a speaker’s utterance for the purposes of ASR processing.

False Accept An instance where the ASR accepted mistakenly an utterance as a valid response.

False Reject An instance where the ASR mistakenly rejected an utterance as a invalid response.

Mixed-initiative Dialog Interactions where the user may unilaterally issue a request rather than simply provide exactly the information asked for by system prompts. For instance, while making a flight reservation, the system may ask the user, “What day are you planning to flight out?” Instead of answering that question, the user may say, “I’m flying to Denver, Colorado.” A Mixed-initiative system would recognize that the user provided not the exact answer to the question asked, but also (additive), or instead (substitutive), volunteered information that was going to be requested by the system later on. Such a system would accept this information, remember it, and continue the conversation. In contrast, a “Directed Dialog” system would rigidly insist on the departure date and won’t proceed successfully unless it received that piece of information.

No-input Error A situation where the system did not detect any speech input from the user.

No-match Error A situation where the system was not able to match the user’s response to the responses that it expected the user to provide.

Progressive Prompting The technique of beginning an exchange by providing the user with minimal instructions and elaborating on those instructions only if encountering response errors (e.g., no-input, no-match, etc.).

Speech Synthesis Markup Language (SSML) SSML is a markup language that provides a standard way to mark up text for the generation of synthetic speech.

Tapered Prompting The technique of eliding a prompt or a piece of a prompt in the context of a multistep interaction or a multi-part system response. For example, instead of the system asking repetitively, “What is your level of satisfaction with our service?” “What is your level of satisfaction with our pricing?” “What is your level of satisfaction with our cleanliness,” the system would ask: “What is your level of satisfaction with our service?” “How about our pricing?” “And our cleanliness?” The technique is used to provide a more natural and less robotic-sounding user experience.

Sources

The definitions above were pulled from a variety of sources, including:

https://www.pragmatic.digital/blog/basics-of-voice

https://www.witlingo.com/voice-first-glossary-of-terms

https://developer.amazon.com/docs/alexa-design/glossary.html

https://developer.amazon.com/docs/ask-overviews/alexa-skills-kit-glossary.html

https://www.voxprotocol.com/faqs