National Prominence

Published April 27, 2015 This content is archived.

CCR develops monitoring tools for NSF's XSEDE program

Print
“While XDMoD has made reporting a much simpler and less time-consuming task, the range of metrics available has also provided insight into the operation of XSEDE that was not readily available, and in some cases not even possible previously. ”

In 2010, CCR was awarded a prestigious 5-year NSF Technology Audit Services for the XSEDE grant to develop an active set of tools and services to monitor (audit) XSEDE (formally TeraGrid) cyberinfrastructure.  In 2015, based on their performance in the initial award, CCR was awarded a 5-year renewal

XSEDE is the world’s largest distributed cyberinfrastructure for open, scientific research, and as such plays a crucial role in advancing simulation based engineering and science in the United States.   As NSF develops its strategy for XSEDE and the continued investment in advanced cyberinfrastructure to support scientific research, it is important that the process be informed by reliable, extensive usage and performance data.  Until recently, this would have been difficult.  However, through the NSF Technology Audit Service award, CCR has developed the XDMoD (XD Metrics on Demand) tool to make this sort of data and data analysis readily accessible.

XDMoD (https://xdmod.ccr.buffalo.edu) is a comprehensive auditing framework for use by high performance computing centers, which provides metrics regarding resource utilization, resource performance, application performance, quality of service, and impact on scholarship and research.  In addition to the XSEDE version of XDMoD, an open source version (Open XDMoD), targeted at academic and industrial HPC centers, has also been developed and is available for download at http://xdmod.sourceforge.net/.  

The XDMoD and Open XDMoD frameworks include a computationally lightweight application kernel auditing system that utilizes performance kernels to measure overall system performance.  This allows continuous resource auditing to measure all aspects of system performance including file-system performance, processor and memory performance, and network latency and bandwidth.    The frameworks also provide job level performance data for every job running on the cluster (without the need to recompile the application codes) and therefore provide system personnel with the ability to identify poorly performing codes and subsequently tune them for optimal performance. XDMoD and Open XDMoD are designed to meet the following objectives: 

(1) provide the user community with a tool to more effectively and efficiently use their allocations and optimize their use of HPC resources,

(2) provide operational staff with the ability to monitor diagnose, and tune system performance as well as measure the performance of all applications running on their system,

(3) provide software developers with the ability to easily obtain detailed analysis of application performance to aid in optimizing code performance,

(4) provide stakeholders with a diagnostic tool to facilitate HPC planning and analysis, and

(5) provide metrics to help measure scientific impact.

While XDMoD and Open XDMoD have made reporting a much simpler and less time-consuming task, the range of metrics available has also provided insight into the operation of XSEDE and HPC resources that was not readily available, and in some cases not even possible previously. 

This work was sponsored by NSF under grant number OCI 1025159 for the development of technology audit service for XSEDE.

PI: Tom Furlani (CCR)

CO-I: Matt Jones, Steve Gallo, Abani Patra, Gregor von Laszewski, Mark Green, Vipin Chaudhary

Key Personnel: Robert DeLeon, Nikolay Simakov, Joseph White, Jeff Palmer, Tom Yearke, Ryan Rathsam, Jeanette Sperhac, Martins Innus, Cynthia Cornelius