Categories
OCP

Where I’ll be at OCP Global Summit

Every year I try to do a write up on the content I’m most excited about for OCP’s Global Summit. Full disclosure: I’m a former Board Member and Chairperson, still involved in the Future Technologies Initiative/Symposium, and I was one of the content advisors on the Artificial Intelligence track this year, so I definitely am biased.

The full schedule for OCP Global Summit is jam packed with sessions covering Security, Reliability, Artificial Intelligence, Open Networking, Sustainability, Composable Memory, Chiplets, Optics, Coherent Interconnects, Automation, Facilities Innovation, Cooling, and a lot more. This three day industry event literally brings thousands of people, and hundreds of companies together to discuss the future of our industry, and how we need to collaborate to drive the changes that any one company alone cannot facilitate.

So first thing on Tuesday, we have Keynotes. There is an overwhelming theme across the keynotes: how Artificial Intelligence is changing the computing landscape exacerbating challenges with power, cooling, and networking, and creating new threats for security. If you want to hear from a broad range of hyperscalers and leading semiconductor companies on how they are preparing for this explosion of generative AI, you should not miss it.

Unlike many other conferences, the Keynotes are not the sole reason to attend. This conference is full of breakout sessions led by the engineers who are actually building systems, solutions, silicon, and software to solve novel challenges. On Tuesday afternoon I will be torn between the AI track and the SONiC track, but since I was on the content advisory committee for the AI track, I will definitely be there to see it live. There is everything from processor-in-memory inferencing solutions to open source efforts to align upon a hardware abstraction layer in order to unlock AI accelerator innovation while continuing to enable model development rapidly and consistently. Google, nVidia, Meta, Microsoft, and many more will be presenting on use cases, challenges, and opportunities across the Data Center from silicon to software, and facilities to network observability.

On Wednesday, Andy Bechtolsheim is kicking it off with network architecture and optics for large scale AI clusters. Then we will transition to the Future Technologies Symposium, which is one of my favorite sections of the conference because it isn’t just about the challenges of today, it is about the challenges facing us as an industry from researchers to start ups and well-established companies. Several critical projects for storage disaggregation, the evolution of the Network Operating System, from the Linux Foundation, the journey to composable memory, how we have to continue to evolve hardware management solutions in an increasingly complex system, and much more. DENT will have its first sessions at OCP this year along with SONiC, highlighting the breadth and depth of open source NOS solutions available, and how we will continue to evolve to optimize for usability. Optics, open edge servers, efforts on a RAS (reliability, availability, and serviceability) standard API, RAS to contain the impact of PCIe uncorrected errors, RAS and enhancements in MCTP, SPDM, PLDM, OpenBMC, and Redfish support for GPUs and accelerators (a critical workstream we need significant standardization upon), and memory fault management also will be providing updates on Wednesday. Last but in no way least, we will hear a lot about improvements in the sustainability and usability of immersive solutions, and circularity initiatives from Open System Firmware to eWaste reclamation improvements.

On Wednesday afternoon, we will go even further down the future-looking data center “rabbit hole”, discussing Quantum Computing, advancements in composable memory including much needed real-time telemetry, Silent Data Corruption research progress, automation and robotics in the data center, and hacking away at AI opportunities (predictive analytics, resource optimization using AI for sustainable operations, and localization) within the data center domain. My team will also be presenting our DC-SCM 2.0 compliant solution for hardware management designed in conjunction with Lenovo, and I definitely won’t miss that!

Thursday ushers in security, chiplets, time synchronization, more on modularity, and networking (beyond Optics and Open Networking well covered in the first two days). Updates on Caliptra, DC-MHS progress, SONiC, QUIC offload on the Linux kernel, and significantly more user updates in the domains of sustainability, storage, and facilities enhancements.

OCP Global Summit is designed for teams, where each can divide and conquer to expand insight across security, manageability, reliability, modularity, networking, thermal/mechanical design, and future-looking initiatives such as chiplets and composable memory. The opportunity to learn, connect, and share is how OCP continues to empower open communities, and I look forward to seeing you there!

Categories
OCP

OCP 2023 Regional Summit in Prague: What Not to Miss!

It is hard to imagine, but OCP’s Regional Summit is coming up in just 2 weeks. Our last Regional Summit in Europe was in 2019 before the covid19 pandemic, and I’ve been eager ever since joining the board to connect with our European leaders to hear their perspectives, and specific challenges of running data centers and building servers in the European market.

As for every OCP event, I like to write my own “unofficial” guide on what I’m most excited about, and where you’ll find me if you are there. As always, I want to meet with you, so please do reach out!

So, let’s start with Day 1:

  • We will start with Keynotes on the morning of the 19th, and it is interesting to see the regional flavor of this event. At the Global Summit the topics were very much centered on AI (including connectivity for workloads that cannot be encompassed on single nodes), Security, and Sustainability. Here in Europe Sustainability is taking center stage (for Silicon providers, and Cloud companies), and Ethernet for AI/ML and HPC, but we are also seeing discussions about the role and future Open Source and Open Empowerment, standardization of Edge Computing (which is going to be fascinating since I can think of nothing less standardized today), and Quantum Computing.
  • In the afternoon we will transition into several engineering workshops (and yes, as wonderful as the keynotes are, THESE are the heart of the conference). I’ll be excited to see the SONiC sessions from the vibrant European community (Criteo, STORDIS, Deutsche Telekom, Broadcom, Weaveworks, and Credo), and new OCP-Ready systems and contributions from Mitac, Inspur, HPE, Giga Computing, Murata, and 9elements.
  • The other track occurring during the engineering sessions is the Future Technologies Symposium, which once again has a very different regional flavor than FTS at the Global Summit. We are seeing sessions on Quantum computing, Neuromorphic computing, AI/HPC techniques for data center energy optimization, temperature and location impact on overall IT efficiency, heat reuse techniques, and more.

That evening we will have a welcome reception, and no my band is not playing, but we will have an incredible time connecting in beautiful Prague.

Day 2 will bring additional sessions on system management, modularity, and security as well as expanding on the key networking and sustainability activities. Here are the ones I’m so excited about:

  • DC-SCM with OpenBMC compliance suite (I find this personally relevant), TEE-agnostic attestation research, fault management, leveraging ChatGPT for SSD development, scope 3 emissions standards proposals, Caliptra updates, OSF, attestation with Redfish, CXL memory expansion, and immersion (from fluids that are more sustainable to the system designs and reuse methodologies).
  • There will also be updates on chiplets, a session on “SONiC Lite” (which I think is so critical–most of us want to start with a low risk SONiC use case, but if SONiC needs significant memory to run, we will be greatly impacted on being able to leverage it for these lower risk scenarios: console/management switches), and precision time protocol options, which I also think is such an important project for standardization and improved global network management.

Fundamentally our European community is thinking about the future of innovation from the silicon to systems, software, system-level firmware, management, and so much more. I really look forward to learning from everyone and meeting our local leaders and experts.

Categories
OCP

OCP Global Summit 2022: My Key Takeaways

Two weeks ago marked the end of the OCP 2022 Global Summit at San Jose Convention Center. I personally had an incredible time connecting with the community, but OCP isn’t just about connecting, it is really about having the opportunity to see technology trends across leading consumers and producers of Silicon and Systems: what is important today, and what they are building for tomorrow. Given that, I wanted to write up the key takeaways I saw (with the caveat I could only attend so many sessions). There are many great sessions which will be released after the fact by the OCP Foundation, and when I find awesome nuggets, I promise to write those up, too.

So starting with key announcements and data from the Hyperscalers:

  • Meta’s Alexis Bjorlin spoke about contributions in the domain of a new rack specification (ORV3), needed due to the thermal design requirements of AI (GPUs and ASIC-based solutions) and higher power CPU solutions forthcoming. They also contributed their next AI system specification, Grand Teton, which is 4x the performance and 2x the network bandwidth of their previous Zion EX solution, and their storage platform for AI, Grand Canyon, which offers security, power and performance improvements, and uses HDDs for AI! I followed up with some folks, and the HDD usage is for main AI servers–they are not using HDD for AI training or inference work (which was how I originally interpreted it–thanks to Vineet for helping clarify!) Grand Teton it is primarily SSD/Flash. For those listening in the room, the key nugget was definitely that AI performance, at least for DLRMs at Meta is gated SIGNIFICANTLY by network I/O. If we want to “feed the beast” we need faster network solutions (which isn’t necessarily a fancy fabric–could be standard Ethernet with better congestion management and control) and likely (eventually–no timeline specified) optics to the node to manage the power footprint of such high bandwidth SerDes. Alexis used to run the Silicon Photonics division for BRCM and Intel before that, so no surprise that optics is on her mind, but this was an impressive case study for where and why we will see the future of AI requires better network design and management.
  • Dr. Partha Ranganathan of Google spoke passionately about the inflection point in our industry (at one point saying “a career making time”) where the rate of cheaper/faster systems are slowing just as computational demand is increasing due to cloud computing, machine learning, data analytics, video, and a more massively intelligent IoT edge. What we have done historically cannot achieve the scale we need, and it is an exciting time that will require leaders to come together. He spoke about 4 major contributions swimlanes to OCP: 1. Server: DC-MHS done in conjunction with Dell, HP, Intel, and Microsoft, this contribution helps ensure we build modular building blocks for every server to maximize reuse, design isolation, supply optionality, and reliability standardized management with OpenBMC and RedFish contributions; 2. Security: a standardized RoT implementation (Caliptra, donated in collaboration with AMD, Microsoft, and NVIDIA) this reusable IP block for RoT measurement is being hardened in the ecosystem actively to ensure chip-level attestation at the package or SoC is done effectively; 3. Reliability: Along with ARM, ARM, Intel, Meta, Microsoft, and NVIDIA they are leading the effort to create metrics about silent data errors and corruption for the broader industry to track. Google is contributing execution frameworks and suites to test environments with faulty devices. I have spoken about the DCDIAG effort before when I was at Intel–this is an important approach for the industry take as complexity rises–better design and management of aging systems requires automation and testing the same way tune-ups and proactive maintenance occur on cars; and 4. Sustainability: specifically how Google is sharing best practices with OCP and the broader community to standardize sustainability measurement and optimization techniques.
  • Microsoft’s Zaid Kahn spoke on similar topics to Google (given that Caliptra and DC-MHS are contributions they coauthored), but they went even further and more specifically focused upon the future of Security for Open Compute projects. They announced Hydra, a secure BMC SoC developed with Nuvoton, which enables fine-grained control of the BMC interface authorization, so only trustworthy devices can be granted a debug interface and the access is temporary. They also announced Project Kirkland, which demonstrates how using firmware only, one can update the TPM and CPU RoT in a way that prevents substitution attacks, interposing, and eavesdropping. On the topic of modularity, the Mt. Shasta design was contributed, which is an ORV3 compliant design that supports high power devices with a 48V power feed, and supports hot-swappable modules.

In terms of manufacturers, whether Silicon or Systems, the theme was Sustainability. Samsung spoke about their renewal energy and water reduction goals. Intel and the major OxMs (HP, Dell, Mitac, etc.) showed up in modularity (leadership roles with DC-MHS and DC-SCM) and open system firmware to ensure circular economy/second life of servers can exist, and reduced embodied carbon/amortization of it over a longer period of time (given that fewer server/system components require upgrade when new CPUs, memory technologies, etc. come to market).

There was a lot of great discussion and debate on the future of AI fabrics (started by Alexis), and Ram Velaga was quite eloquent in his advocacy for Ethernet as the fabric for HPC and AI, bringing in the brilliant Dr. Mark Handley to speak about innovation in congestion management on Ethernet to unlock best-in-class performance. There was a fair amount of push on the interrelationship between compute, network, and storage for different workload scaling, and some poking at proprietary solutions addressing this (Inifiband and NVLink specifically, which seems fitting at a conference which cohosted the CXL Consortium and clearly advocates for coherent memory/accelerator pooling to move to open standard interfaces). Finally there were several sessions on optics and innovation in packaging (from Broadcom, Marvell, and Intel) demonstrated in person at the Celestica and Ragile booths, which again reinforces this attempt to use open standards to drive innovation so vendors don’t make big investments on bets that won’t have market alignment.

Conversations were also vibrant about the expansion of alternate architectures to x86 in the server ecosystem (teams from ARM, ARM-based server-class CPUs, and RISC-V server-class CPUs all were there), and Open Networking solutions (specifically a large-scale SONiC Workshop at the event). The feeling I got from collective sessions was that SONiC’s time has come, and while it still has a long way to go for feature parity, optimization, and usability enhancements compared to proprietary NOS options, the partnership with Linux Foundation for a more open and agile contribution model puts SONiC on the right track to real adoption in the industry. On the alternate architecture point of view, I feel a certain amount of conflation is occurring. Out-of-order execution exists on ARM and x86 products, so do SIMD execution units and branch prediction. From a core perspective, there is a lot more that has converged between RISC and CISC architectures since the time when RISC was introduced to the world, but where I think particular implementations have shone is the focus on the uncore portion of server-class CPU to be power efficient. This innovation is where we see real divergence in certain players (e.g. Ampere). There are a lot of things I personally still want to see in the “many cores, more power-efficient architecture” processing units (personal frustration has been that security is an “upsell” on top bin parts vs. being available across the sku stack–to me this is like saying one only gets a key for a Lexus because Toyota’s are too cheap to warrant them…ummm…tell that to the person getting their car jacked.) Security has to be ubiquitous for consumers to trust their providers, and hardware companies need to view these solutions as foundational, without significant performance impact as a foregone conclusion.

Anyway, those are my quick notes from the edge…it felt great to be there with everyone, see the innovation from Chiplets to cloud service models, disaggregated memory to further hardening of server and CPU root of trust technology. In my biased opinion the future is open, not because innovation is slowing, but rather because the complexity of the problems facing our world involve collaboration–we cannot solve sustainable datacenter design, security, performance, and reliability in isolation, and the generations to come are relying on us to succeed.