Stephen Herbein

I am a software engineer specializing in distributed systems and job scheduling. I am passionate about learning and making the world a better place.

Stephen's Profile Picture



Ph.D. in Computer Science

University of Delaware, 2018

4.0 GPA

M.S. in Computer Science

University of Delaware, 2016

4.0 GPA

B.S. in Computer Science

University of Delaware, 2014

3.932 GPA

Online Courses

Self-Driving Cars Specialization

Coursera, 2022


Senior Software Engineer

2022 - Present

Maglev: Creating MLOps infrastructure for autonomous vehicle development
Computer Scientist
Lawrence Livermore Nat'l Lab

2018 - 2022

  • Flux: developed features, supported users, and readied code for future Exascale system - El Capitan
  • HPC+Cloud: PI of $175K project to find gaps in LLNL’s converged computing capabilities
  • Exaworks: created a portable, performant, and hardened workflow SDK
  • ZFP: developed python bindings for this LLNL compression library
Research Assistant
University of Delaware

2012 - 2018

  • 2014-2018: Created next-generation I/O-aware and hierarchical job schedulers for HPC clusters utilizing the Flux resource manager.
  • 2014: Developed a suite of tools to profile, auto-tune, and optimize applications developed with the parallel I/O library ADIOS.
  • 2013: Integrated in-transit analysis and staging into the scientific application QMCPack to improve I/O performance and scalability.
  • 2012: Developed a crowdsourcing web application, ExSciTecH, to complement the volunteer computing project Docking@Home.
Research Intern
Lawrence Livermore Nat'l Lab


  • 2017: Integrated my hierarchical scheduler with the Uncertainty Quantification Pipeline (UQP), resulting in a 37% improvement in workload runtime
  • 2016: Added dynamicity to my hierarchical scheduler to eliminate resource fragmentation
  • 2015: Developed an automatic job aggregator and hierarchical scheduler, resulting in a 4x speed up over the existing scheduler
  • 2014: Developed a discrete-event simulator for the next-generation resource manager, Flux
Science Undergrad Lab Intern (SULI)
Oak Ridge Nat'l Lab


  • Integrated ADIOS, ORNL’s IO framework, into QMCPack, a quantum monte-carlo simulator.
  • Examined the performance of various IO methods and techniques on peta-scale systems like Titan.
Research Experience for Undergraduates (REU) Intern
University of Houston


  • Optimized VolpexMPI library for use on large-scale clusters.



Best Paper Award

International Conference on e-Science, 2022

R&D 100 Award - Flux

R&D World, 2021

Spot Award

LLNL ATDM Next Generation Computing Enablement, 2019

Dissertation Award

IEEE Technical Committee on Scalable Computing, 2019

Best Poster

Annual LLNL Poster Symposium, 2016

Undergraduate Student Research Competition Award

ACM, 2013


Outstanding Graduate Student Award

CIS Department, 2016

Outstanding Senior Student Award

CIS Department, 2014

Outstanding Junior Student Award

CIS Department, 2013

General Honors Award

University of Delaware, Fall 2012

Presidential Achievement Scholarship

University of Delaware, 2011 - 2014

Top Publications

Scalable Composition and Analysis Techniques for Massive Scientific Workflows
D. Ahn, X. Zhang, J. Mast, S. Herbein, F. Di Natale, D. Kirshner, S. Ade Jacobs, I. Karlin, D. Milroy, B. De Supinski, B. Van Essen, J. Allen, and F. C. Lightstone
Best Paper
18th IEEE International Conference on e-Science (e-Science), Oct 2022


Composite science workflows are gaining traction to manage the combined effects of (1) extreme hardware heterogeneity in new High Performance Computing (HPC) systems and (2) growing software complexity – effects necessitated by the convergence of traditional HPC with data sciences. Composing, analyzing, and optimizing a composite workflow remains highly challenging as the component technologies are generally developed in isolation and often feature widely varying levels of performance, scalability, and interoperability. In this paper, we propose novel workflow composition and analysis techniques to create and optimize a scalable and effective composite workflow for heterogeneous HPC centers, and define the performance space of variables that impact composite workflow performance. We present PerfFlowAspect, an Aspect Oriented Programming (AOP)-based tool to perform cross-cutting performance analysis of composite workflows and better understand the impact of key performance variables on workflows. Our solution directly addresses AOP concerns that can affect workflow performance and covers the full software lifecycle, ranging from the workflow’s initial composition through performance analysis and optimization. We use our science workflow composition techniques to implement the American Heart Association Molecule Screening (AHA MoleS) workflow. Through experimentation, we demonstrate that tuning a single performance variable can improve AHA MoleS workflow performance by a factor of up to 2.45x. Our evaluation suggests that our techniques can significantly enhance the ability of a multi-disciplinary research and development team to create a high performance composite workflow.
Flux: Overcoming Scheduling Challenges for Exascale Workflows
D.H. Ahn, N. Bass, A. Chu, J. Garlick, M. Grondona, S. Herbein, J. Koning, T. Patki, T.R.W. Scogland, B. Springmeyer, and M. Taufer
13th Workshop on Workflows in Support of Large-Scale Science (WORKS), November 2018


Many emerging scientific workflows that target high-end HPC systems require complex interplay with the resource and job management software~(RJMS). However, portable, efficient and easy-to-use scheduling and execution of these workflows is still an unsolved problem. We present Flux, a novel, hierarchical RJMS infrastructure that addresses the key scheduling challenges of modern workflows in a scalable, easy-to-use, and portable manner. At the heart of Flux lies its ability to be nested seamlessly within batch allocations created by other schedulers as well as itself. Once a hierarchy of Flux instance is created within each allocation, its consistent and rich set of well-defined APIs portably and efficiently support those workflows that can often feature non-traditional execution patterns such as requirements for complex co-scheduling, massive ensembles of small jobs and coordination among jobs in an ensemble.
Scalable I/O-Aware Job Scheduling for Burst Buffer Enabled HPC Clusters
S. Herbein, D. H. Ahn, D. Lipari, T. R.W. Scogland, M. Stearman, M. Grondona, J. Garlick, B. Springmeyer, and M. Taufer
25th International Symposium on High-Performance Parallel and Distributed Computing (HPDC), June 2016


The economics of flash vs. disk storage is driving HPC centers to incorporate faster solid-state burst buffers into the storage hierarchy in exchange for smaller parallel file system (PFS) bandwidth. In systems with an underprovisioned PFS, avoiding I/O contention at the PFS level will become crucial to achieving high computational efficiency. In this paper, we propose novel batch job scheduling techniques that reduce such contention by integrating I/O awareness into scheduling policies such as EASY backfilling. We model the available bandwidth of links between each level of the storage hierarchy (i.e., burst buffers, I/O network, and PFS), and our I/O-aware schedulers use this model to avoid contention at any level in the hierarchy. We integrate our approach into Flux, a next-generation resource and job management framework, and evaluate the effectiveness and computational costs of our I/O-aware scheduling. Our results show that by reducing I/O contention for underprovisioned PFSes, our solution reduces job performance variability by up to 33% and decreases I/O-related utilization losses by up to 21%, which ultimately increases the amount of science performed by scientific workloads.
PRIONN: Predicting Runtime and IO using Neural Networks
M. Wyatt, S. Herbein, T. Gamblin, A. Moody, D.H. Ahn, and M. Taufer
47th International Conference on Parallel Processing (ICPP), August 2018


For job allocation decision, current batch schedulers have access to and use only information on the number of nodes and runtime because it is readily available at submission time from user job scripts. User-provided runtimes are typically inaccurate because users overestimate or lack understanding of job resource requirements. Beyond the number of nodes and runtime, other system resources, including IO and network, are not available but play a key role in system performance. There is the need for automatic, general, and scalable tools that provide accurate resource usage information to schedulers so that, by becoming resource-aware, they can better manage system resources.

We tackle this need by presenting a tool for Predicting Runtime and IO using Neural Networks (PRIONN). PRIONN automates prediction of per-job runtime and IO resource usage, enabling IO-aware scheduling on HPC systems. The novelty of our tool is the input of whole job scripts into deep learning models that allows complete automation of runtime and IO resource predictions. We demonstrate the power of PRIONN with runtime and IO resource predictions applied to IO-aware scheduling for real HPC data. Specifically, we achieve over 75% mean and 98% median accuracy for runtime and IO predictions across 300,000 jobs from a real HPC machine. We combine our per-job runtime and IO predictions with queue and system simulations to predict future system IO usage accurately. We predict over 50% of IO bursts in advance on a real HPC system.

Advanced Schedulers For Next-Generation HPC Systems
S. Herbein
University of Delaware, 2018


High performance computing (HPC) is undergoing many changes at both the system and workload levels. At the system level, data movement is becoming more costly in relation to computation and HPC centers are becoming increasingly power-constrained. In an effort to adapt to these trends, HPC systems are including new resources such as burst buffers and GPUs which makes the resource set larger and more diverse. At the workload level, new ensemble workloads,such as uncertainty quantification (UQ), are emerging within HPC, driving up the workload scale in terms of the number of jobs. Existing HPC scheduling models are unable to adapt to these changes, leading to degraded system efficiency and application performance. In this thesis, we claim that new schedulers are needed to overcome the challenges mentioned above and efficiently manage the next-generation of HPC systems. To this end we design, implement, and evaluate three fundamental transformations to the existing scheduling models. First, we integrate I/O-awareness into existing scheduling policies and demonstrate that I/O-aware scheduling increase the efficiency of burst buffer-enabled HPC systems. Second,we expand our I/O-aware scheduler to incorporate the accurate knowledge of application I/O utilization patterns provided by machine learning models. Third, we design a prototype scheduler based on the fully hierarchical scheduling model and show that it reduces scheduler overhead and increases job throughput on synthetic and real-world ensemble workloads, such as UQ. Our work is the first step towards a new generation of scheduling models for HPC.