Published April 27, 2015
CCR develops monitoring tools for NSF's XSEDE program
In 2010, CCR was awarded a prestigious 5-year NSF Technology Audit Services for the XSEDE grant to develop an active set of tools and services to monitor (audit) XSEDE (formally TeraGrid) cyberinfrastructure.
XSEDE is the world’s largest distributed cyberinfrastructure for open, scientific research, and as such plays a crucial role in advancing simulation based engineering and science in the United States. As NSF develops its strategy for XSEDE and the continued investment in advanced cyberinfrastructure to support scientific research, it is important that the process be informed by reliable, extensive usage and performance data. Until recently, this would have been difficult. However, through the NSF Technology Audit Service award, CCR has developed the XDMoD (XD Metrics on Demand) tool to make this sort of data and data analysis readily accessible.
is a comprehensive auditing framework for use by high performance
computing centers, which provides metrics regarding resource
utilization, resource performance, application performance, quality
of service, and impact on scholarship and research. In
addition to the XSEDE version of XDMoD, an open source version (Open
XDMoD), targeted at academic and industrial HPC centers, has
also been developed and is available for download at http://xdmod.sourceforge.net/.
The XDMoD and Open XDMoD frameworks include a computationally
lightweight application kernel auditing system that utilizes
performance kernels to measure overall system performance.
This allows continuous resource auditing to measure all aspects of
system performance including file-system performance, processor and
memory performance, and network latency and
bandwidth. The frameworks also provide job level
performance data for every job running on the cluster (without the
need to recompile the application codes) and therefore provide
system personnel with the ability to identify poorly performing
codes and subsequently tune them for optimal performance. XDMoD and
Open XDMoD are designed to meet the following
(1) provide the user community with a tool to more effectively and efficiently use their allocations and optimize their use of HPC resources,
(2) provide operational staff with the ability to monitor diagnose, and tune system performance as well as measure the performance of all applications running on their system,
(3) provide software developers with the ability to easily
obtain detailed analysis of application performance to aid in
optimizing code performance,
(4) provide stakeholders with a diagnostic tool to facilitate HPC planning and analysis, and
(5) provide metrics to help measure scientific impact.
While XDMoD and Open XDMoD have made reporting a much simpler
and less time-consuming task, the range of metrics available has
also provided insight into the operation of XSEDE and HPC resources
that was not readily available, and in some cases not even possible
This work was sponsored by NSF under grant number OCI 1025159 for the development of technology audit service for XSEDE.
PI: Tom Furlani (CCR)