Research
In general, my research work has been at the intersection of computer systems and HPC. In particular, I have been working in the domain of transparent checkpointing. My recent work demonstrated:
-
novel techniques for transparent checkpoint-restart of hardware-accelerated (GPU, RDMA networks) applications; and
-
techniques for scalable checkpointing protocols and their use in efficient scheduling of processes for large-scale computing platforms — supercomputers and large data centers.
In the past, I have also worked on the problem of classification of proteins with unknown functions. Our work demonstrated a general and efficient computational method for protein classification.
Selected Publications
-
Patel, T., Garg, R., Tiwari, D., GIFT: A Coupon Based Throttle-and-Reward Mechanism for Fair and Efficient I/O Bandwidth Management on Parallel Storage Systems, FAST-2020
-
Garg, R., Price, G., and Cooperman, G., MANA for MPI: MPI-Agnostic Network-Agnostic Transparent Checkpointing, HPDC-2019.
(Transparent checkpointing of MPI using a novel “split process” approach) -
Garg, R., Mohan, A., Sullivan, M., and Cooperman, G., CRUM: Checkpoint-Restart for CUDA’s Unified Memory, CLUSTER-2018.
(Transparent checkpointing of CUDA, including UVM (Unified Virtual Memory), using a proxy approach) -
Mills, C., Garg, R., Lee, J., Tian, L., Suciu, A., Cooperman, G., Beuning, P., and Ondrechen, M., Functional Classification of Protein Structures by Local Structure Matching in Graph Representation, Protein Science, 2018.
(Computational technique for large-scale classification of proteins using Delaunay triangulation) -
Garg, R., Arya, K., Cao., J., Cooperman, G., Evans, J., Garg, A., Rosenberg, N., and Suresh, K., Adapting the DMTCP Plugin Model for Checkpointing of Hardware Emulation, SELSE-2017.
-
Cao, J., Arya, K., Garg, R., Matott, S., Panda, D.K., Subramoni, H., Vienne, J., and Cooperman., G, System-level Scalable Checkpoint-Restart for Petascale Computing, ICPADS-2016.
(Scalable transparent checkpointing: MPI-based HPCG over 32,752 CPU cores, and MPI-based NAMD over 16,368 CPU cores) -
Arya, K., Garg, R., Polyakov, A., and Cooperman, G., Design and Implementation for Checkpointing of Distributed Resources using Process-level Virtualization, CLUSTER-2016.
(Extensible, adaptable, transparent checkpointing via a plugin implementation of process virtualization) -
Garg, R., Vienne, J., and Cooperman, G., Scalable System-level Transparent Checkpointing for OpenSHMEM, OpenSHMEM-2016.