It is somewhat amazing to me how often I get the question, “where do I learn more about systems engineering?” The books for learning the fundamentals of computer architecture and systems engineering are excellent: Patterson and Hennessey’s Computer Architecture is the bible of computer architecture, but other essentials include Applied Cryptography: Protocols, Algorithms and Source Code in C, Compilers: Principles, Techniques, and Tools, Signals and Systems, Site Reliability Engineering: How Google Runs Production Systems, etc. and I must give a shout out to my alma mater for beating the Covid “remote learning” trend, and having launched MIT Open Courseware long before it was cool. There is free content on there from some of the best minds in computer science and architecture to learn the basics. Truly though, one of the things I find most incredible about this industry is how many innovations are driven by engineers in the industry, are spoken of openly at conferences, and are on corporate blogs and research pages. This field is constantly evolving, in a Renaissance of creative reimagining, and some of the greatest minds in the world are actively creating silicon and the specifications to build them into systems not based on theory, but through the hard-fought lessons of operation. So this is a list of some of my FAVORITE papers / blogs / videos in no particular order, the ones I think are seminal to understand critical innovation in distributed systems from the silicon to systems:
- Datacenter Networks are in my Way – by James Hamilton
What I love most about this is the economical approach to viewing the technical problem. So rarely do architects have that insight–it isn’t just solving the problem elegantly, it is doing it in a way that actually makes economic sense. Industry innovators know this and hold their teams accountable. If you’ve never read James Hamilton’s blog, you are welcome. Grab your popcorn and get ready to be schooled!
2. Towards a Next Generation Data Center Architecture: Scalability and Commoditization – by Albert Greenberg, Parantap Lahiri, Dave Maltz, Parveen Patel, Sudipta Sengupta
Of a similar time to the above blog, I view this as one of the seminal papers that started the Open Networking movement–speaking about the real challenges of managing large scale distributed systems with a chassis-based network design. It is BRILLIANT.
This paper shares how hardware systems have hit a level of complexity (shrinking die sizes, 2/3D stacking techniques, leakage issues with substrate thickness, aging effects, etc.) that you cannot always trust them to be “correct” and how software designers must change methodologies to embrace the next phase of “chaos engineering” to prepare for our industry’s transformation. I found this paper enlightening, and somewhat terrifying.
If someone has ever spoken about “tail at scale” and you wondered what it meant, read this paper.
5. Jupiter Evolving: Transforming Google’s Datacenter Network via Optical Circuit Switches and Software Defined Networking – by Leon Poutievski, Omid Mashayekh, Joon Ong, Arjun Singh, Mukarram Tariq, Rui Wang, Jianan Zhang, Virginia Beauregard, Patrick Conner, Steve Gribble, Rishi Kapoor, Stephen Kratzer, Nanfang Li, Hong Liu, Karthik Nagaraj, Jason Ornstein, Samir Sawhney, Ryohei Urata, Lorenzo Vicisano, Kevin Yasumura, Shidong Zhang, Junlan Zhou, Amin Vahdat
If you wonder about what Software Defined Networking really means, read this paper. It is probably the most logical introduction to what it means in application to a real problem.
6. Maglev: A Fast and Reliable Software Network Load Balancer – by Daniel E. Eisenbud, Cheng Yi, Carlo Contavalli, Cody Smith, Roman Kononov, Eric Mann-Hielscher, Ardas Cilingiroglu, Bin Cheyney, Wentao Shang, Jinnah Dylan Hosein
If you start with Jupiter, then you can deep dive here on dynamic reconfiguration of the function of load balancing–a critical aspect of SDN.
7. Large-Scale Cluster Management at Google with Borg – by Abhishek Verma, Luis Pedrosa, Madhukar R. Korupolu, David Oppenheimer, Eric Tune, John Wilkes
Before there was Kubernetes, there was Borg. Distributed computing at scale forces innovation not just in application development, but in the orchestration and management of workload distribution.
8. TMO: Transparent Memory Offloading in Datacenters – by Johannes Weiner, Niket Agarwal, Dan Schatzberg, Leon Yang, Hao Wang, Blaise Sanouillet, Bikash Sharma, Tejun Heo, Mayank Jain, Chunqiang Tang, Dimitrios Skarlatos
This is one of the many papers starting to really address the mounting memory problems facing us in datacenters. Memory continues to rise as a percentage of our cost of ownership–cores increase, while reducing cost/core consistently, but memory just continues to increase in cost structure. Add to that the increased percentage of memory getting stranded in systems, and the inability to unlock that more dynamically. MUCH more innovation needs to be done in this domain, but Meta is absolutely leading the charge in attempting to reasonably address the rising costs of memory as a portion of the bill of materials in an easily adoptable fashion. If we don’t have innovation in memory disaggregation hand-in-hand with transparent software adoption, it will be yet another buffer strategy gone awry.
9. Azure Accelerated Networking: Smart-NICs in the Public Cloud – by Daniel Firestone, Andrew Putnam, Sambhrama Mundkur, Derek Chiou, Alireza Dabagh, Mike Andrewartha, Hari Angepat, Vivek Bhanu, Adrian Caulfield, Eric Chung, Harish Kumar Chandrappa, Somesh Chaturmohta, Matt Humphrey, Jack Lavier, Norman Lam, Fengfen Liu, Kalin Ovtcharov, Jitu Padhye, Gautham Popuri, Shachar Raindel, Tejas Sapre, Mark Shaw, Gabriel Silva, Madhan Sivakumar, Nisheeth Srivastava, Anshuman Verma, Qasim Zuhair, Deepak Bansal, Doug Burger, Kushagra Vaid, David A. Maltz, Albert Greenberg
Much has been made of the IPU, DPU, etc. but it all started with this paper on SmartNICs. While many in the industry were innovating with NPUs, Microsoft is the first I know of who published on the topic. I love this paper, and the writers are the brains behind infrastructure acceleration, SONiC, and many more innovations in Open Networking.
10. A New Golden Age for Computer Architecture: History, Challenges and Opportunities – David Patterson
I often wax poetical about David Patterson–his book on computer architecture, which was part of the hardest course in my entire formal education, is the bible for electrical engineering. This presentation should be required for any student who wants to build something innovative. We don’t have enough Electrical Engineers graduating in the world…I think those folks haven’t listening to Dr. Patterson and if only they would, we would see a world of Electrical Engineers, Hardware Designers, Systems Engineers, SREs, and so much more. The time to embrace this industry is NOW. I cannot wait to see what we all will continue to do. If you have additional papers you recommend, please add them to comments.