On Thursday, the Amazon AWS blogpost announced that the company has removed most of its cloud processing from the Nvidia GPU for its Alexa personal assistant and on its own Infernatia application specific integrated circuit (ASIC). Amazon god Sebastian Stormack describes the hardware design of the Inferentia as follows:
AWS Inferentia is a custom chip to accelerate machine learning interface workloads and optimize their cost. Each AWS inferentia chip has four neurons. Each neuroncore implements a high-performance systolic array matrix multiply engine, which greatly accelerates the typical deep learning operations such as convolutions and transformers. Neuroncore is also equipped with a wide-n-chip cache, which helps cut out external memory cess caches, dramatically reducing delays and increasing throughput.
When an Amazon customer uses an Alexa Personal Assistant, usually owned by Echo or Echo Dot, very little processing is done on the device. The workload for a specific Alexa request looks something like this:
- One human speaking with Amazon Echo says: “Alexa, what is the special ingredient in Earl Gray tea?”
- Echo uses its on-board processing to find the word wake – Alexa dete
- Echo Amazon streams the request to data centers
- Within the Amazon data center, voice is converted into stream streammen (reference AI workload)
- Still in the data center, phonemen are converted into words (reference AI workload)
- Words are assembled into phrases (reference AI workload)
- Phrases are distilled in purpose (Reference AI workload)
- Objective leads to proper fulfillment service, which responds as a JSON document
- The JSON document is analyzed, including the text for Alexa’s reply
- The text form of Alexa’s answer is converted to natural-sounding speech (Reference AI workload)
- Natural speech audio is streamed back to the Echo device for playback: “It’s bergamot orange oil.”
As you can see, almost all the real work done to fulfill the Alexa request is done in the cloud – not just in units or Echo Dot devices. And most of that cloud work is done by conventional if not logic but conjecture – which is the side that provides the answer to neural network processing.
According to Stormack, the Nvidia GPU Shifting this estimated workload from hardware to Amazon’s own Infantia chip has resulted in a 30-percent lower cost on Alexa’s text-to-speech workloads and a 25 percent improvement from end to end. Amazon isn’t the only company to use the Infantia processor – the chip powers the Amazon AWS Inf1 example, which is available to the general public and competes with Amazon’s GPU-powered G4 models.
Amazon’s AWS Neuron software software development kit allows machine-learning developers to use Inferentia as a target for popular frameworks, including TenserFlow, Pytorch and MXNet.
List of images by Amazon