In this profile series, we interview different mentors from across all walks of life - those who strive to solve the greatest challenges of our time, who work to spearhead technology advancements, and who collaborate with the developer community to enable scientific discoveries.
Interested in becoming a mentor? Apply today.
Meet Max Katz, Senior Solutions Architect at NVIDIA. Max works with the US Department of Energy on the deployment and use of their GPU-powered supercomputers, including Summit and Sierra. An astrophysicist who researches explosions of stars using fluid dynamics simulations on supercomputers, he holds a PhD in Physics from Stony Brook University and a BS in Physics from Rensselaer Polytechnic Institute.
How did you get started in High Performance Computing (HPC)?
My background is in computational physics and I did my PhD at Stony Brook University working under Mike Zingale. I originally didn't intend to go into HPC; I was studying to be an astrophysicist and didn't know about the connection with supercomputing. My PhD advisor was doing his work in computational astrophysics on some of the largest super computing systems—Titan at Oak Ridge National Laboratory and also the systems at NERSC at Lawrence Berkeley Lab—and that was what got me originally interested in HPC, especially at scale.
I hadn't had any previous experience with HPC or even had interest in it especially; but, I started to really enjoy the experience of working on these very large systems and I liked the way that it forced me to up my game as a computational scientist. Since I'm a very conscientious person, the concept of burning millions of hours on the world's fastest supercomputer wasn’t a concept that I took lightly. I wanted to ensure that the way I was using the system was effective. The combination of wanting to use the supercomputing system effectively and learning how to write my scientific code in a better way to do that was what really got me interested in high performance computing and is partly what still drives me today.
How did you get involved with the GPU hackathons and Boot Camp Program?
I got involved when the GPU Hackathon Program first started. For background context, my team from Stony Brook University was really interested in running on the Titan supercomputer at Oak Ridge National Laboratory. That system was one of the first large scale systems to have GPUs. At Stony Brook University, we were limited by compute power for the problems that we wanted to solve, so it made sense to see if we could get a performance benefit from using the GPUs. The team also had a sense that GPUs were not a passing phase, that they would be a relevant part of future supercomputing systems, and so we were “future-proofing” ourselves so to speak.
So, we started attending these GPU hackathons. The team not only had a lot of fun, but also accomplished a lot of work since we were able to sit in a room with experts on GPUs and GPU programming and ask them questions. That's how we got involved, and in fact, that team from Stony Brook still avails themselves of the GPU Hackathon series to this day to get their codes running effectively on these systems.
What kind of benefits do GPU Hackathons offer to the community?
In my mind, there are two benefits that apply to the people who attend a GPU Hackathon.
The first benefit is that developers can sit in a room with people who write the compilers or have been doing GPU programming for years. This is important because it helps to overcome an activation barrier. For example, when my team at Stony Brook University first started using OpenACC we didn't really have a good mental model of how GPUs worked and that made it very hard for us to get started. Our ability to use the tools effectively was limited by our lack of understanding of how it would actually map to the hardware that we were using. Being in the room with people who do understand this and can help translate these concepts is really nice. It allows you to ask questions without having to go through the psychological roadblock of writing an email that you think sounds stupid. It may sound trivial, but it's very beneficial to be able to just ask questions to people whose job is to answer your questions.
The second benefit that I would cite is the motivation. It feels like everybody is there for the same reason and working towards the same goal. Even though teams are working on different codes and their insights may not directly apply to your code, it feels like you are part of a community that is enthusiastic about working effectively on GPU-powered supercomputers. This is important because often it can seem like a lonely task, especially when you're frustrated. When things aren't working, you have the temptation to just give up and work on something else. Being in the room with people who are all engaged in the same larger project helps give you that energy to keep going and keep tackling these hard problems.
What is the benefit to the community at large?
Part of the benefit to the community at large is that many of the teams who attend aren't just focused on their own narrow project, but often they are developing codes that are used by the community at large. In that sense, the hackathons are an investment in computational science for the community. For example, if a team that attends builds on an optimized GROMACS or LAMMPS or NAMD or some other code that is used by a large section of the community and consumes 10% or 20% of the FLOPs on their facility’s supercomputers, that is an investment in the community. I think that is a big part of this program where the people who don't attend the hackathons still benefit.
The other part that benefits the community is that the people who attend then become experts at GPU computing themselves. The program seeds the community with people who have some experience running on GPUs and are empowered to answer questions from people who are just getting started. There's a virtuous cycle there where just by attending a hackathon you gain some facility in GPU programming and are now a resource to the community.
How did you make the transition from attendee to becoming a mentor?
As I mentioned, at Stony Brook University we really struggled to get our codes working effectively on GPUs when we attended our first two GPU Hackathons. We had some nice proof points in standalone “mini-app” type codes or unit tests with specific physics modules; but we really struggled to get things working on our real science code. We knew GPUs had lots of raw compute advantages but didn't really know the GPU programming model well enough at the time to really understand why that was. Even though the team invested quite a bit of time trying to get the code to run on GPUs, it didn't work for us. As a scientist I had to use my time effectively, and at some point, you just cut your losses and move on. At least that's how I felt at the time.
Part of my goal, both with working at NVIDIA and returning to be a mentor at these GPU Hackathons, was to inject some realism into this process. I know that this is a struggle and that as a developer or scientist, you don't just go to a hackathon and get 100X speedup by Day 3, or even Day 300 in many cases. I wanted to work with people experiencing the same frustration that I did to not only let them know that this is normal and will take some time; but also share my previous experiences to help prevent some of that frustration.
Finally, my motivation is to help give back to the community because the community had really invested in me. I strongly believe in the concept of community science. Scientists should not be siloed; they should be continually contributing their codes back and working collaboratively on projects at these GPU Hackathons.
What are some of the successes and challenges as a mentor?
In terms of successes, what I've realized over time was that one of the most effective things that you can do as a mentor is to guide the way that your participants are thinking about their problem.
Many times, very experienced computational scientists will come to a GPU Hackathon with an identified problem and a predetermined idea of how they want to approach that problem. If they don't have that mental model of both how the GPU works and how to use the tools that help identify performance bottlenecks in the code, then they might waste time chasing down the wrong problem because the standard ways of optimizing code on CPUs just don't apply to running it on GPUs effectively in many cases. It's really resetting expectations.
I would say that's both simultaneously the biggest challenge and success for mentors is getting people to be very empirical about what they're doing and what the tools tell them. If you can do that, then you can change the way they think about performance. They attend a GPU Hackathon and they not only learn about how GPUs work, but they also come away with much broader skills through the process of being a more effective computational scientist.
What advice would you then give to someone who wants to become a mentor at a GPU Hackathon or a Bootcamp?
I would recommend hands-on learning—taking a code that is not one that you have written or have any experience with and run it through the NVIDIA profiling tools or other third-party profiling tool that is GPU-aware. Gain proficiency in using the tool to identify where time is being spent on the application and how to identify potential performance issues, both in the large scale application and for individual GPU kernels.
That process is very important because the most important thing that you can do as a mentor is to empower your team to better think about their problem. The way that they think about their problem better is almost always by using profiling tools to understand what's going on in their code and identifying where to fix it. Once they know where to look, it's often a much more constrained and easy problem to identify how to optimize the code.
In terms of soft skills, I would say that you must be confident in yourself—that you can contribute to your team’s efforts even if you don't know their domain or even their performing model. Having the confidence that simply by understanding the way to think about the performance optimization or the language reporting process, to understand general principles will help you as a mentor.
What do you recommend to start the process?
The best way to get started is to do it yourself. If you can attend a GPU Hackathon or Bootcamp as a participant, that is the most effective thing you can do. If you can't, then find training materials for running on GPUs and just dive into that. For example, NVIDIA’s Deep Learning Institute offers self-paced courses that allow you to learn CUDA or learn OpenACC. You don't need to be a GPU expert to get started as a mentor. You just need the willingness and the passion to do it.
As I mentioned before, a very effective way to train yourself in being a mentor is to take somebody else's code, learn how to build and run it, analyze its performance and then contribute code back. The valuable part of what mentors do is guide the process of their participants—they guide where to look and what kinds of changes to make—and so just familiarizing yourself with that process is very beneficial.
If you don't have somebody at your company or organization that does this, there's plenty of open science codes that welcome community contributions and would be glad to have people joining their effort.
Interested in becoming a mentor? Apply today.