Cerebras Challenges Nvidia with AI Model on Its Massive Chip

A technician displays Cerebras Systems' Wafer Scale Engine, an impressively large computer chip.

Artificial Intelligence is now ubiquitous, with chatbots responding to queries as if they were all-knowing or creating impressive visuals. These responses are termed inferences, and the extensive software systems generating them operate from vast data centers commonly known as the cloud.

Get ready for a significant change.

Cerebras Systems, recognized for its groundbreaking wafer-scale computer chip comparable in size to a dinner plate, is preparing to deploy one of the leading AI models—Meta’s open-source LLaMA 3.1—directly on its chip. This innovative method could surpass traditional inference capabilities.

Additionally, Cerebras asserts that its inference costs are only one-third of those associated with Microsoft’s Azure cloud services, while consuming just one-sixth of the energy.

“Cerebras Inference is especially attractive for developers of AI applications requiring real-time processing or handling substantial volumes, offering speeds that redefine performance and competitive pricing,” noted Micah Hill-Smith, co-founder and CEO of Artificial Analysis Inc., an independent evaluator of AI models.

Raindrops On The Water

This could set off ripples throughout the AI landscape. As inference becomes quicker and more efficient, developers may extend the possibilities of AI. Applications previously constrained by hardware limitations could now thrive, fostering innovations once deemed unattainable.

In natural language processing, for instance, models could produce more precise and coherent replies. This advancement could transform automated customer service, where grasping the complete context of a conversation is essential for effective responses. Similarly, in healthcare, AI models could analyze larger datasets faster, leading to quicker diagnoses and tailored treatment plans.

In the corporate arena, the capability to conduct inference at unprecedented speeds presents new possibilities for real-time analytics and decision-making. Businesses could implement AI systems that assess market trends, consumer behavior, and operational data in real-time, enabling them to adapt swiftly and accurately to market fluctuations. This could spur a fresh wave of AI-driven business strategies, with companies leveraging real-time insights for a competitive advantage.

However, whether this represents a light rain or a torrential downpour remains uncertain.

As AI workloads shift toward inference and away from training operations, the demand for more efficient processors becomes crucial. Numerous companies are addressing this challenge.

“Cerebras’ wafer-scale integration presents a unique approach that mitigates some of the limitations faced by generic GPUs and shows considerable promise,” stated Jack Gold, founder of J. Gold Associates, a technology analysis firm. He cautions, however, that Cerebras is still a startup amidst larger competitors.

Inference As A Service

Cerebras' AI inference service not only accelerates the execution of AI models but may also transform how businesses deploy and interact with AI in practical applications.

In standard AI inference workflows, large language models such as Meta’s LLaMA or OpenAI’s GPT-4 are stored in data centers, where application programming interfaces (APIs) call upon them to generate responses to user inquiries. These models are massive and demand substantial computational resources to function efficiently. Graphics processing units (GPUs), currently the backbone of AI inference, bear the heavy lifting but often struggle with the extensive data transfer between the model’s memory and compute cores.

Cerebras’ new inference service, however, stores all layers of a model—currently the 8 billion parameter and 70 billion parameter versions of LLaMA 3.1—directly on the chip. When a prompt is sent, the data can be processed almost instantaneously since it doesn’t have to traverse long distances within the hardware.

For instance, while a state-of-the-art GPU might process around 260 tokens (data pieces like words) per second for an 8-billion parameter LLaMA model, Cerebras claims it can manage 1,800 tokens per second. This unprecedented level of performance, verified by Artificial Analysis, Inc., sets a new benchmark for AI inference.

“Cerebras is achieving speeds significantly faster than GPU-based solutions for Meta’s LLaMA 3.1 8B and 70B models,” stated Hill-Smith. “We are recording speeds above 1,800 output tokens per second for LLaMA 3.1 8B and over 446 output tokens per second for LLaMA 3.1 70B—new records in these benchmarks.”

Cerebras is launching its inference service via an API to its own cloud, but it is already engaging with major cloud providers to deploy its model-equipped chips elsewhere. This opens a vast new market for the company, which has faced challenges in encouraging adoption of its Wafer Scale Engine.

Breaking The Bottle

The speed of inference today is hampered by bottlenecks in the connections between GPUs and memory/storage. The electrical pathways linking memory to cores can only transmit a limited amount of data at any given time. Although electrons travel rapidly in conductors, the actual data transfer rate is constrained by the frequency at which signals can be reliably sent and received, influenced by signal degradation, electromagnetic interference, material properties, and the length of the wires over which data must travel.

In conventional GPU setups, model weights are stored in memory separate from processing units. This separation necessitates constant data transfer between memory and compute cores via narrow wires during inference. Nvidia and others have explored various configurations to minimize this distance, such as stacking memory vertically atop compute cores in a GPU package.

Cerebras’ innovative approach fundamentally alters this paradigm. Instead of etching transistor cores onto a silicon wafer and cutting them into chips, Cerebras etches up to 900,000 cores on a single wafer, eliminating the need for external wiring between separate chips. Each core on the Wafer Scale Engine combines both computation (processing logic) and memory (static random access memory), forming a self-sufficient unit that can function independently or collaboratively with other cores.

The model weights are distributed across these cores, with each core retaining a portion of the total model. As a result, no single core contains the entire model; instead, the model is partitioned and distributed across the entire wafer.

“We effectively load the model weights onto the wafer, so they are right next to the core,” explains Andy Hock, Cerebras’ senior vice president of product and strategy.

This configuration facilitates much quicker data access and processing since the system does not need to shuttle data back and forth over relatively slow interfaces.

According to Cerebras, its architecture can deliver performance “10 times faster than anything else on the market” for inference on models like LLaMA 3.1, though this claim requires further validation. Importantly, Hock asserts that due to the memory bandwidth limitations in GPU architectures, “there’s actually no number of GPUs that you could stack up to match our speed” for these inference tasks.

By optimizing for inference on large models, Cerebras is positioning itself to meet a rapidly expanding market demand for fast, efficient AI inference capabilities.

Picking Nvidia’s Lock

One reason for Nvidia's stronghold on the AI market is the prevalence of its Compute Unified Device Architecture (CUDA), a parallel computing platform and programming system. CUDA offers a software layer that grants developers direct access to the GPU’s virtual instruction set and parallel computational elements.

For years, Nvidia’s CUDA programming environment has served as the standard for AI development, fostering a vast ecosystem of tools and libraries. This has led to a scenario where developers are often confined to the GPU ecosystem, even when alternative hardware solutions might provide better performance.

Cerebras’ Wafer Scale Engine (WSE) features a fundamentally different architecture compared to traditional GPUs, necessitating adaptations or rewrites of software to fully leverage its capabilities. Developers and researchers must familiarize themselves with new tools and potentially new programming paradigms to effectively work with the WSE.

Cerebras has attempted to mitigate this by supporting high-level frameworks like PyTorch, easing the transition for developers to utilize its WSE without mastering a new low-level programming model. It has also created its own software development kit to enable lower-level programming, potentially presenting an alternative to CUDA for specific applications.

By offering an inference service that is not only faster but also user-friendly—allowing developers to interact via a straightforward API, similar to other cloud services—Cerebras is enabling organizations new to the AI space to bypass the complexities of CUDA while still achieving top-tier performance.

This aligns with an industry trend towards open standards, granting developers the freedom to select the most suitable tools for their needs rather than being restricted by their existing infrastructure.

It’s All About Context

The implications of Cerebras’ innovation, if substantiated and production can scale up, are significant. Primarily, consumers will benefit from drastically quicker responses. Whether it’s a chatbot handling customer inquiries, a search engine fetching information, or an AI-powered assistant producing content, reduced latency will create a smoother, more immediate user experience.

However, the advantages may extend beyond just speed. A major challenge in AI today is the concept of the “context window”—the volume of text or data a model can consider simultaneously when generating an inference. Inference processes requiring extensive context, such as summarizing lengthy documents or analyzing intricate datasets, face limitations.

Larger context windows demand more model parameters to be accessed actively, which raises memory bandwidth requirements. As the model processes each token in context, it needs rapid retrieval and manipulation of relevant parameters stored in memory.

In high-inference applications with numerous concurrent users, the system must manage multiple inference requests simultaneously, compounding the memory bandwidth requirements since each user’s request necessitates access to model weights and intermediate computations.

Even the most advanced GPUs, like Nvidia’s H100, can transfer only about 3 terabytes of data per second between high-bandwidth memory and compute cores. This is significantly lower than the 140 terabytes per second required to run a large language model efficiently at high throughput without encountering substantial bottlenecks.

Andy Hock, Cerebras’ senior VP of product and strategy, asserts, “Our effective bandwidth between memory and compute isn’t just 140 terabytes; it’s 21 petabytes per second.”

Of course, without industry benchmarks, it's difficult to assess the company's claims, and independent testing will be crucial to validating this performance.

By eliminating the memory bottleneck, Cerebras’ system can accommodate much larger context windows and enhance token throughput. If the performance assertions are accurate, this could revolutionize applications that require extensive information analysis, such as legal document review, medical research, or large-scale data analytics. With the capacity to process more data in less time, these applications can function more efficiently.

More Models To Come

Hock indicated that the company will soon make the larger LLaMA 405 billion parameter model available on its WSE, followed by models from Mistral and Cohere’s Command-R. Organizations with proprietary models (such as OpenAI) are welcome to collaborate with Cerebras to load their models onto the chips as well.

Moreover, since Cerebras’ solution is offered as an API-based service, it can be easily integrated into existing workflows. Organizations that have already invested in AI development can switch to Cerebras’ service without needing to overhaul their entire infrastructure. This ease of adoption, coupled with the anticipated performance improvements, could position Cerebras as a strong contender in the AI landscape.

“However, until we have more concrete real-world benchmarks and operations at scale,” cautioned analyst Gold, “it’s too early to determine just how superior it will be.”

arsalandywriter.com

Cerebras Challenges Nvidia with AI Model on Its Massive Chip

Raindrops On The Water

Inference As A Service

Breaking The Bottle

Picking Nvidia’s Lock

It’s All About Context

More Models To Come

Share the page:

Recent Post:

Transforming Darkness into Radiance: A Journey to Inner Joy

# Is AI the Ultimate End of Language Learning? Exploring the Future of Linguistics

Navigating Life's Harsh Realities: 5 Painful Truths for Balance

Establish Your Online Presence: Create a Google Site and Link Analytics

Understanding Python's MAP, FILTER, and REDUCE Functions

Nuclear Fusion Breakthrough: Paving the Way for Clean Energy

The Resilience of Best Buy: A Case Study of Survival and Strategy

Understanding the Impact of COVID-19 on Aged Care Facilities