As technology evolves, the way people interact with it also changes. With a slew of industry leaders spearheading the innovation with digital virtual assistants, it is hardly a stretch to say that Intelligent Voice Assistants will be the technological trend for this decade. Voice assistants have now taken over some of the most mundane tasks of my life. As a Curious PM, I take a peek under the hood of Alexa, one of the most popular intelligent voice assistants.
The age of touch could soon come to an end. From smartphones and smartwatches, to home devices, to in-car infotainment systems, touch is no longer the primary user interface.
Donn Morill, Solutions Architect Amazon Alexa
Alexa, play me a song. Sure…
What happens when you ask Alexa to play a song? Let’s break it down into steps and then deep dive into some of the frameworks that make it such a scalable and powerful service.

The echo devices use Far-Field voice recognition technology that allows the users to speak to Alexa from across the room.
Decoding the task
Echo devices are always listening for the wake word. As soon as it grabs that, it starts to record the rest of your sentence. No processing takes place on the device itself. Once your sentence has been recorded, the entire data is sent to the Alexa Voice Service (AVS) in the cloud for processing.

Automatic Speech Recognition (ASR)
Once in cloud, AVS uses a module called as ASR that converts Speech to text. Most phones use this service today. ASR can understand different accents, different tones and work across variable loudness.
Natural Language Understanding (NLU)
Next, AVS uses NLU to understand what the user wants. This is where the magic really happens. A user can say play music or play a song or shuffle songs, your device could play video or audio only, you may have 3 skills that give similar results, NLU can handle it all.
In this case of playing music on an Echo Dot, NLU will understand that your Echo Dot does not have a screen so you want to listen to an album by Drake and not see a video by Drake.
Skills
Once NLU understands your request, it will invoke a Skill that will fulfill your request. Skills are similar to apps on your phone. Amazon has various inbuilt and third-party skills.
It will go through your list of skills and determine that Spotify is the best Skill to satisfy your request. NLU will then send some details to Spotify:
- What does the user want to do?– playMusic (Also called Intent)
- Which kind of music – Drake (Also called Slot Value)
Spotify skill will now return a text, “Playing songs by Drake” and also a URL to Drake’s latest album to AVS.
Text to Speech (TTS)
AVS then uses TTS module to do exactly what the name says, convert the text from Spotify to Speech. Finally, the Speech and URL are sent to the Echo dot to play on the device.
Learning
AVS also stores metadata that is used in separate machine learning modules to improve the entire suite of services
The Alexa Ecosystem
The Alexa ecosystem is an open platform with two sets of API’s, the Alexa Skills Kit and the Alexa Voice Services. This open ecosystem is what makes Alexa one of the most powerful and scalable Voice assistants in the market.

Alexa skills kit allows developers to add more capabilities to Alexa. There are now more than 80,000 Alexa skills worldwide. I recently created one for this blog. Just say, Alexa open Curious Product Manager to listen to the articles on my blog.
The Alexa Voice Services gives scalability to the digital assistant. You need not have an Amazon device to use Alexa. Any device that has a microphone, internet and some output capability (Voice/Video) can be used as an Alexa enabled device, thanks to AVS. AVS allows you to incorporate the Alexa persona to devices, be it your smart car, smart refrigerator or even a smart toaster.
I am personally very excited about this space. I hope that Amazon now starts focussing on Voice User Experience and make the assistant more conversational. Here is an interesting article that talks about the Future of Alexa.