Amid the fight between AMD and Nvidia (and soon Intel), a new China-based player is emerging: Biren Technology, founded in 2019 and based in Shanghai. At Hot Chips 34, Biren Co-Founder and President Lingjie Xu and Biren CTO Mike Hong took the (virtual) stage to detail the company’s inaugural product: the Biren BR100 General Purpose GPU (GPGPU ).
“It is an honor for me to introduce our first-generation computing product: BR100,” said Xu. “The BR100 is dedicated to addressing the challenges of training and inferring AI in the data center, with increased goals of increasing productivity and reducing overall cost of ownership.”
At 1074mm2, the 77 billion-transistor, dual-chip Biren BR100 (shown in the header) will be fabricated using TSMC’s 7nm process and capable of 256 teraflops FP32. The die-to-die interconnect provides 896 GB/s bandwidth. The BR100 comes with up to 64 GB of HBM2E memory (across four stacks) and can handle up to 2.3 TB/s of external I/O bandwidth among its eightfold BLink connections. This all adds up to a maximum TDP of 550W and a targeted clock rate of 1GHz.
Given the BR100’s targeted use cases, the point of comparison was unsurprising: Nvidia’s A100 GPU, which has become the de facto benchmark in the broader accelerator field. The BR100’s peak teraflops, of course, compare extraordinarily favorably to the A100 – 19.5 for the A100, 256 for the BR100 (“one of the fastest GPUs in the world,” Xu said). Looking beyond the flops, Xu said he saw promising results on workloads and benchmarks.
“Compared to the Nvidia A100, at the current stage, we see an average speedup of 2.6 (×) across a wide range of benchmarks in different areas, including computer vision, natural language processing and conversational AI “, did he declare. “Performance will continue to increase in the coming months as we continue to optimize hardware and software.”
“We had two design goals for the BR100,” Hong said. “The first: it must reach a petaflops of horsepower. The second: it should be a GPGPU, not a pure AI accelerator. »
With that, back to the flops for a moment: the BR100 supports FP32, BF16, FP16, INT32, INT16 and others – but there are two additional points to note: first, the BR100 does not support FP64 ( “We decided to dedicate the chip area to our target markets and use cases,” Xu commented); second, the BR100 supports a new 24-bit data type called TP32+. And, with 1024 teraflops of performance at BF16 it looks like the BR100 fits Biren’s bill “of a petaflops of power”.
The BR100 will also be available in another version: the BR104, a single-chip variant designed for use in PCIe cards. Xu said Biren is also working with manufacturers to build reference cluster designs. The chip itself has already been tested on real silicon. Moreover, “We have already submitted the last round of MLPerf inferences, and you should be able to see our results in two or three weeks,” Xu said. (Biren member of MLCommons.)
Biren Technology has jointly launched the Hearten 8-Way OAM Server in partnership with Inspur. The companies plan to start sampling the hardware in the fourth quarter of this year.
The devices will ship with Biren’s software platform and programming model, called BIRENSUPA. “Developers familiar with CUDA (from Nvidia) can easily write code for SUPA,” Hong said. Supported AI frameworks include PyTorch, TensorFlow, and PaddlePaddle. The company also provides the OpenCL compiler. The dual-die BR100 appears as a GPU for the software layer.
Since its Series B funding round, Biren has raised over CNY 5 billion (~US$730 million).