Space to Watch: Specialized Hardware for AI Inference
Could there be an opening for smaller, more nimble companies to compete with NVIDIA on designing more efficient hardware to run transformer…
Could there be an opening for smaller, more nimble companies to compete with NVIDIA on designing more efficient hardware to run transformer models? What would be the multiples and advantages required for performance per watt and performance per dollar to make this interesting?
It seems that there is more demand than capacity for deploying and utilizing AI models; there is a hunger for chips that can handle projects from AI researchers and companies doing innovative work. It looks like there is a bottleneck, as we are currently relying on a single provider.
Existing NVIDIA chips are designed to handle intense training workloads, but the same chips don’t do as well with inference workloads; during training, the chips are cranking, but during inference the memory bandwidth utilization drops, and much of the electricity used to power the chips ends up into wasted heat energy.
NVIDIA is focused on training, while AMD and Intel are gaining more interest in the AI inference market. It’s clear there is a need for training; however, when discussing inference, the path is not clear for calculating consumer willingness to pay. [1]
Another vector to think about is the fast advancement in algorithms. For example, it cost OpenAI around $5M to train GPT-3 in 2020; training an equivalent model 2 years later costs around 10x less. MosaicML demonstrated GPT-3 parity with 1/10th of the training costs about 2 years later. [2]
MosaicML trained their model on NVIDIA A100 clusters in 2022, while OpenAI trained their GPT-3 model on V100s, which were the top of the line product in 2020. The A100s are about 3x faster than the V100s. It is interesting to observe that MosaicML achieved better performance due to both compute scaling, as well as better algorithms. How this balance will play out in the future remains uncertain, but interesting to watch. [3] Also interesting to note that Meta is building their own custom inference chip. [4]
What is Groq up to? These guys know inference, and they are often in the news for incredible speed. Groq chips have 230MB of SRAM [5]; an A100 has 40GB; many models are too large to fit on Groq chips, but you can link a lot of Groq chips together to make up for it. Groq has impressive speed, however, since you need many more chips tied up together to run a large model, the price for the hardware as well as the electricity used to power the hardware goes up significantly, more than 10x. Models are getting larger and larger, so the cost to run them on Groq infrastructure may not be feasible.
Many thanks to my partner Ryan Cunningham for putting most of these materials together as we are thinking through possible investments in the space.
[1] https://digitstodollars.com/2023/12/08/is-anyone-going-to-make-money-in-ai-inference/
[2] https://www.databricks.com/blog/gpt-3-quality-for-500k
[3] https://lambdalabs.com/blog/nvidia-a100-vs-v100-benchmarks