Published June 19, 2014
CCR develops monitoring tools for NSF's XSEDE program
In 2010, CCR was awarded a prestigious 5-year NSF Technology Audit Services for the XSEDE grant to develop an active set of tools and services to monitor (audit) XSEDE (formally TeraGrid) cyberinfrastructure.
XSEDE is the world’s largest distributed cyberinfrastructure for open, scientific research, and as such plays a crucial role in advancing simulation based engineering and science in the United States. As NSF develops its strategy for XSEDE and the continued investment in advanced cyberinfrastructure to support scientific research, it is important that the process be informed by reliable, extensive usage and performance data. Until recently, this would have been difficult. However, through the NSF Technology Audit Service award, CCR has developed the XDMoD (XSEDE Metrics on Demand) tool to make this sort of data and data analysis readily accessible.
XDMoD (https://xdmod.ccr.buffalo.edu) is a comprehensive auditing framework for use by high performance computing centers, which provides metrics regarding resource utilization, resource performance, and impact on scholarship and research. This role-based framework is designed to meet the following objectives:
(1) provide the user community with a tool to more effectively and efficiently use their allocations and optimize their use of resources,
(2) provide operational staff with the ability to monitor and tune resource performance,
(3) provide management with a diagnostic tool to facilitate CI planning and analysis as well as monitor resource utilization and performance, and
(4) provide metrics to help measure scientific impact.
While XDMoD has made reporting a much simpler and less time-consuming task, the range of metrics available has also provided insight into the operation of XSEDE that was not readily available, and in some cases not even possible previously.
The XDMoD framework includes a computationally lightweight application kernel auditing system that utilizes performance kernels to measure overall system performance. This allows continuous resource auditing to measure all aspects of system performance including file-system performance, processor and memory performance, and network latency and bandwidth.
This work was sponsored by NSF under grant number OCI 1025159 for the development of technology audit service for XSEDE.
PI: Tom Furlani (CCR)