Instead of making one chip do all the work, the system splits tasks: Trainium3 handles the "prefill" (turning requests into tokens), while Cerebras chips take care of the "decode."
This teamwork is powered by custom networking designed for speedy responses.
Cerebras's WSE-3 chip is seriously powerful: with 4 trillion transistors and 900,000 AI cores, a CS-3 system can process up to 1,800 tokens per second (for the Llama 3.1 8B example, about 20 times faster than comparable NVIDIA GPU-based solutions on some models).
Meanwhile, AWS's Trainium3 handles the prefill stage and is designed to provide high performance for that role, great for heavy-duty AI tasks.
Amazon says Trainium3—and future Trainium4—are expected to lead in price-performance versus merchant GPUs.
Contact to : xlf550402@gmail.com
Copyright © boyuanhulian 2020 - 2023. All Right Reserved.