ABSTRACT
The Center for Computational Research (CCR) at the University at Buffalo has developed Grendel: a fast, easy to use, bare metal provisioning system for High Performance Computing (HPC). Grendel simplifies network booting racks of compute nodes by providing a robust PXE boot server, rest API, and node management in a single binary for easy installation. In this paper, we describe CCR’s HPC network architecture and how Grendel was used to provision the center’s Linux based compute clusters. We also present some modern features built into Grendel including automatic host discovery, deploying Live OS images to bare metal compute nodes, and delivering kernel, initramfs, and other provisioning assets using access tokens and trusted HTTPS.
Supplemental Material
- Mohammad Al-Fares, Alexander Loukissas, and Amin Vahdat. 2008. A scalable, commodity data center network architecture. ACM SIGCOMM computer communication review 38, 4 (2008), 63–74.Google Scholar
- John Blass and John Roberts. 2018. Stateless Provisioning: Modern Practice in HPC. In In HPCSYSPROS18: HPC System Professionals Workshop. Dallas, TX. https://github.com/HPCSYSPROS/Workshop18/tree/master/Stateless_Provisioning_Modern_Practice_in_HPCGoogle Scholar
- Branca. 2020. Branca. https://branca.io/.Google Scholar
- Center for Computational Research, University at Buffalo. 2020. UB CCR Support Portfolio. http://hdl.handle.net/10477/79221.Google Scholar
- Bruce Potter Andy Wray et al. Egan Ford, Jarrod Johnson. 2020. xCAT. http://xcat.org/.Google Scholar
- Gregory M. Kurtzer et al.2020. Warewulf. http://warewulf.lbl.gov/.Google Scholar
- Julia Kreger et al.2020. Ironic OpenStack project. https://github.com/openstack/ironic.Google Scholar
- Lennart Poettering et al.2020. mkosi. https://github.com/systemd/mkosi.Google Scholar
- Michael DeHaan et al.2020. Cobbler. https://cobbler.github.io.Google Scholar
- Foreman. 2020. Foreman. https://www.theforeman.org/.Google Scholar
- Nico Schottelius Horms Gero Kuhlmann, Martin Mares and Chris Novakovic. 2018. Mounting the root filesystem via NFS (nfsroot). https://www.kernel.org/doc/Documentation/filesystems/nfs/nfsroot.txt. Online; accessed: 2020-02-16.Google Scholar
- Harald Hoyer. 2013. dracut. http://www.kernel.org/pub/linux/utils/boot/dracut/dracut.html#_booting_live_images. Online; accessed: 2020-02-16.Google Scholar
- Harald Hoyer. 2013. dracut. http://www.kernel.org/pub/linux/utils/boot/dracut/dracut.html. Online; accessed: 2020-02-16.Google Scholar
- iPXE. 2020. iPXE. https://ipxe.org/.Google Scholar
- M. Johnston and Ed. S. Venaas. 2006. Dynamic Host Configuration Protocol (DHCP) Options for the Intel Preboot eXecution Environment (PXE). RFC 4578. RFC Editor. https://doi.org/10.17487/rfc4578Google Scholar
- Karl W Schulz, C Reese Baird, David Brayford, Yiannis Georgiou, Gregory M Kurtzer, Derek Simmel, Thomas Sterling, Nirmala Sundararajan, and Eric Van Hensbergen. 2016. Cluster computing with OpenHPC. In In HPCSYSPROS16: HPC System Professionals Workshop. http://hdl.handle.net/2022/21082Google Scholar
- Stéphane Thiell, Aurélien Degrémont, Henri Doreau, and Aurélien Cedeyn. 2012. ClusterShell, a scalable execution framework for parallel tasks. In Linux Symposium. Citeseer, 77.Google Scholar
Recommendations
Linux vs. lightweight multi-kernels for high performance computing: experiences at pre-exascale
SC '21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisThe long standing consensus in the High-Performance Computing (HPC) Operating Systems (OS) community is that lightweight kernel (LWK) based OSes have the potential to outperform Linux at extreme scale. To explore if LWKs live up to their expectation we ...
Linux Clusters Institute Workshops: Building the HPC and Research Computing Systems Professionals Workforce
HPCSYSPROS'17: Proceedings of the HPC Systems Professionals WorkshopWe discuss training workshops run by the Linux Clusters Institute (LCI), which provides education and advanced technical training for IT professionals who deploy and support High Performance Computing (HPC) Linux clusters, which have become the most ...
Evaluating parameter sweep workflows in high performance computing
SWEET '12: Proceedings of the 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and TechnologiesScientific experiments based on computer simulations can be defined, executed and monitored using Scientific Workflow Management Systems (SWfMS). Several SWfMS are available, each with a different goal and a different engine. Due to the exploratory ...
Comments