Two weeks ago marked the end of the OCP 2022 Global Summit at San Jose Convention Center. I personally had an incredible time connecting with the community, but OCP isn’t just about connecting, it is really about having the opportunity to see technology trends across leading consumers and producers of Silicon and Systems: what is important today, and what they are building for tomorrow. Given that, I wanted to write up the key takeaways I saw (with the caveat I could only attend so many sessions). There are many great sessions which will be released after the fact by the OCP Foundation, and when I find awesome nuggets, I promise to write those up, too.
So starting with key announcements and data from the Hyperscalers:
- Meta’s Alexis Bjorlin spoke about contributions in the domain of a new rack specification (ORV3), needed due to the thermal design requirements of AI (GPUs and ASIC-based solutions) and higher power CPU solutions forthcoming. They also contributed their next AI system specification, Grand Teton, which is 4x the performance and 2x the network bandwidth of their previous Zion EX solution, and their storage platform for AI, Grand Canyon, which offers security, power and performance improvements, and uses HDDs for AI! I followed up with some folks, and the HDD usage is for main AI servers–they are not using HDD for AI training or inference work (which was how I originally interpreted it–thanks to Vineet for helping clarify!) Grand Teton it is primarily SSD/Flash. For those listening in the room, the key nugget was definitely that AI performance, at least for DLRMs at Meta is gated SIGNIFICANTLY by network I/O. If we want to “feed the beast” we need faster network solutions (which isn’t necessarily a fancy fabric–could be standard Ethernet with better congestion management and control) and likely (eventually–no timeline specified) optics to the node to manage the power footprint of such high bandwidth SerDes. Alexis used to run the Silicon Photonics division for BRCM and Intel before that, so no surprise that optics is on her mind, but this was an impressive case study for where and why we will see the future of AI requires better network design and management.
- Dr. Partha Ranganathan of Google spoke passionately about the inflection point in our industry (at one point saying “a career making time”) where the rate of cheaper/faster systems are slowing just as computational demand is increasing due to cloud computing, machine learning, data analytics, video, and a more massively intelligent IoT edge. What we have done historically cannot achieve the scale we need, and it is an exciting time that will require leaders to come together. He spoke about 4 major contributions swimlanes to OCP: 1. Server: DC-MHS done in conjunction with Dell, HP, Intel, and Microsoft, this contribution helps ensure we build modular building blocks for every server to maximize reuse, design isolation, supply optionality, and reliability standardized management with OpenBMC and RedFish contributions; 2. Security: a standardized RoT implementation (Caliptra, donated in collaboration with AMD, Microsoft, and NVIDIA) this reusable IP block for RoT measurement is being hardened in the ecosystem actively to ensure chip-level attestation at the package or SoC is done effectively; 3. Reliability: Along with ARM, ARM, Intel, Meta, Microsoft, and NVIDIA they are leading the effort to create metrics about silent data errors and corruption for the broader industry to track. Google is contributing execution frameworks and suites to test environments with faulty devices. I have spoken about the DCDIAG effort before when I was at Intel–this is an important approach for the industry take as complexity rises–better design and management of aging systems requires automation and testing the same way tune-ups and proactive maintenance occur on cars; and 4. Sustainability: specifically how Google is sharing best practices with OCP and the broader community to standardize sustainability measurement and optimization techniques.
- Microsoft’s Zaid Kahn spoke on similar topics to Google (given that Caliptra and DC-MHS are contributions they coauthored), but they went even further and more specifically focused upon the future of Security for Open Compute projects. They announced Hydra, a secure BMC SoC developed with Nuvoton, which enables fine-grained control of the BMC interface authorization, so only trustworthy devices can be granted a debug interface and the access is temporary. They also announced Project Kirkland, which demonstrates how using firmware only, one can update the TPM and CPU RoT in a way that prevents substitution attacks, interposing, and eavesdropping. On the topic of modularity, the Mt. Shasta design was contributed, which is an ORV3 compliant design that supports high power devices with a 48V power feed, and supports hot-swappable modules.
In terms of manufacturers, whether Silicon or Systems, the theme was Sustainability. Samsung spoke about their renewal energy and water reduction goals. Intel and the major OxMs (HP, Dell, Mitac, etc.) showed up in modularity (leadership roles with DC-MHS and DC-SCM) and open system firmware to ensure circular economy/second life of servers can exist, and reduced embodied carbon/amortization of it over a longer period of time (given that fewer server/system components require upgrade when new CPUs, memory technologies, etc. come to market).
There was a lot of great discussion and debate on the future of AI fabrics (started by Alexis), and Ram Velaga was quite eloquent in his advocacy for Ethernet as the fabric for HPC and AI, bringing in the brilliant Dr. Mark Handley to speak about innovation in congestion management on Ethernet to unlock best-in-class performance. There was a fair amount of push on the interrelationship between compute, network, and storage for different workload scaling, and some poking at proprietary solutions addressing this (Inifiband and NVLink specifically, which seems fitting at a conference which cohosted the CXL Consortium and clearly advocates for coherent memory/accelerator pooling to move to open standard interfaces). Finally there were several sessions on optics and innovation in packaging (from Broadcom, Marvell, and Intel) demonstrated in person at the Celestica and Ragile booths, which again reinforces this attempt to use open standards to drive innovation so vendors don’t make big investments on bets that won’t have market alignment.
Conversations were also vibrant about the expansion of alternate architectures to x86 in the server ecosystem (teams from ARM, ARM-based server-class CPUs, and RISC-V server-class CPUs all were there), and Open Networking solutions (specifically a large-scale SONiC Workshop at the event). The feeling I got from collective sessions was that SONiC’s time has come, and while it still has a long way to go for feature parity, optimization, and usability enhancements compared to proprietary NOS options, the partnership with Linux Foundation for a more open and agile contribution model puts SONiC on the right track to real adoption in the industry. On the alternate architecture point of view, I feel a certain amount of conflation is occurring. Out-of-order execution exists on ARM and x86 products, so do SIMD execution units and branch prediction. From a core perspective, there is a lot more that has converged between RISC and CISC architectures since the time when RISC was introduced to the world, but where I think particular implementations have shone is the focus on the uncore portion of server-class CPU to be power efficient. This innovation is where we see real divergence in certain players (e.g. Ampere). There are a lot of things I personally still want to see in the “many cores, more power-efficient architecture” processing units (personal frustration has been that security is an “upsell” on top bin parts vs. being available across the sku stack–to me this is like saying one only gets a key for a Lexus because Toyota’s are too cheap to warrant them…ummm…tell that to the person getting their car jacked.) Security has to be ubiquitous for consumers to trust their providers, and hardware companies need to view these solutions as foundational, without significant performance impact as a foregone conclusion.
Anyway, those are my quick notes from the edge…it felt great to be there with everyone, see the innovation from Chiplets to cloud service models, disaggregated memory to further hardening of server and CPU root of trust technology. In my biased opinion the future is open, not because innovation is slowing, but rather because the complexity of the problems facing our world involve collaboration–we cannot solve sustainable datacenter design, security, performance, and reliability in isolation, and the generations to come are relying on us to succeed.