[ad_1]
Amazon’s cloud computing voice service Alexa is about to become much more powerful as the Amazon Alexa team has migrated the vast majority of its inference workloads from GPU-based machines at the behest of Amazon EC2 Inf1.
These new instances are powered by AWS Inferentia and the upgrade has resulted in 25% lower end-to-end latency and 30% lower cost compared to GPU-based instances for text-to-text workloads. Alexa voice.
As a result of the switch to EC2 Inf1 instances, Alexa engineers will now be able to start using more complex algorithms to improve the overall experience for owners of the new Amazon Echo and other Alexa-powered devices.
In addition to Amazon Echo devices, more than 140,000 models of smart speakers, lights, plugs, smart TVs, and cameras are powered by Amazon’s cloud-based voice service. Every month, tens of millions of customers interact with Alexa to control their home devices, listen to music and the radio, stay informed, or to be educated and entertained with the more than 100,000 Alexa Skills available for the platform.
In a press release, AWS Technical Evangelist Sébastien Stormacq explained why the Amazon Alexa team decided to move from GPU-based machine inference workloads, saying:
“Alexa is one of the most popular hyperscale machine learning services in the world, with billions of inference requests every week. Of Alexa’s top three inference workloads (ASR, NLU, and TTS), the TTS workloads were initially run on GPU-based instances. But the Alexa team decided to move to Inf1 instances as quickly as possible to improve the customer experience and reduce the cost of processing the service. “
AWS Inferentia
AWS Inferentia is a custom chip created by AWS to accelerate machine learning inference workloads while optimizing their cost.
Each chip contains four NeuronCores and each core implements a high-performance systolic matrix matrix multiplication engine that helps greatly accelerate deep learning operations like convolution and transformers. NeuronCores also comes equipped with a large on-chip cache that reduces external memory accesses to dramatically reduce latency while increasing performance.
For users who want to take advantage of AWS Inferentia, the custom chip can be used natively from popular machine learning frameworks, including TensorFlow, PyTorch, and MXNet with the AWS Neuron Software Development Kit.
In addition to the Alexa team, Amazon Rekognition is also embracing the new chip, as running models like Object Sorting on Inf1 instances resulted in eight times lower latency and doubled performance compared to running these models on GPU instances. .