In general, my research work has been at the intersection of computer systems and HPC. In particular, I have been working in the domain of transparent checkpointing. My recent work demonstrated:

  • novel techniques for transparent checkpoint-restart of hardware-accelerated (GPU, RDMA networks) applications; and

  • techniques for scalable checkpointing protocols and their use in efficient scheduling of processes for large-scale computing platforms — supercomputers and large data centers.

In the past, I have also worked on the problem of classification of proteins with unknown functions. Our work demonstrated a general and efficient computational method for protein classification.

Selected Publications

  • Patel, T., Garg, R., Tiwari, D., GIFT: A Coupon Based Throttle-and-Reward Mechanism for Fair and Efficient I/O Bandwidth Management on Parallel Storage Systems, FAST-2020

  • Garg, R., Price, G., and Cooperman, G., MANA for MPI: MPI-Agnostic Network-Agnostic Transparent Checkpointing, HPDC-2019.
    (Transparent checkpointing of MPI using a novel “split process” approach)

  • Garg, R., Mohan, A., Sullivan, M., and Cooperman, G., CRUM: Checkpoint-Restart for CUDA’s Unified Memory, CLUSTER-2018.
    (Transparent checkpointing of CUDA, including UVM (Unified Virtual Memory), using a proxy approach)

  • Mills, C., Garg, R., Lee, J., Tian, L., Suciu, A., Cooperman, G., Beuning, P., and Ondrechen, M., Functional Classification of Protein Structures by Local Structure Matching in Graph Representation, Protein Science, 2018.
    (Computational technique for large-scale classification of proteins using Delaunay triangulation)

  • Garg, R., Arya, K., Cao., J., Cooperman, G., Evans, J., Garg, A., Rosenberg, N., and Suresh, K., Adapting the DMTCP Plugin Model for Checkpointing of Hardware Emulation, SELSE-2017.

  • Cao, J., Arya, K., Garg, R., Matott, S., Panda, D.K., Subramoni, H., Vienne, J., and Cooperman., G, System-level Scalable Checkpoint-Restart for Petascale Computing, ICPADS-2016.
    (Scalable transparent checkpointing: MPI-based HPCG over 32,752 CPU cores, and MPI-based NAMD over 16,368 CPU cores)

  • Arya, K., Garg, R., Polyakov, A., and Cooperman, G., Design and Implementation for Checkpointing of Distributed Resources using Process-level Virtualization, CLUSTER-2016.
    (Extensible, adaptable, transparent checkpointing via a plugin implementation of process virtualization)

  • Garg, R., Vienne, J., and Cooperman, G., Scalable System-level Transparent Checkpointing for OpenSHMEM, OpenSHMEM-2016.