CCR has been at the forefront of the development of open source
tools for use by HPC centers to provide quantitative and
qualitative metrics relevant to HPC, including resource
utilization, resource performance, and impact on scholarship and
research. These tools are useful to ensure the optimal
operation of such centers and their resources as well as
demonstrate the utility, service, competitive advantage, and return
on investment that these centers provide.
XDMoD (XSEDE Metrics on Demand) is an NSF-funded open source tool designed to audit and facilitate the utilization of the XSEDE cyberinfrastructure by providing a wide range of metrics on XSEDE resources, including resource utilization, resource performance, and impact on scholarship and research. XDMoD (https://xdmod.ccr.buffalo.edu) framework is designed to meet the following objectives: (1) provide the user community with a tool to manage their allocations and optimize their resource utilization, (2) provide operational staff with the ability to monitor and tune resource performance, (3) provide management with a tool to monitor utilization, user base, and performance of resources, and (4) provide metrics to help measure scientific impact. While initially focused on the XSEDE program, future versions of XDMoD will be adaptable to any HPC environment.
The framework includes a computationally lightweight application kernel auditing system that utilizes performance kernels chosen from both low-level benchmarks and actual scientific and engineering applications to measure overall system performance from the user’s perspective. This allows continuous resource monitoring to measure all aspects of system performance including file-system, processor, and memory performance, and network latency and bandwidth. Current and past utilization metrics, coupled with application kernel-based performance analysis, can be used to help guide future cyberinfrastructure investment decisions, plan system upgrades, tune machine performance, improve user job throughput, and facilitate routine system operation and maintenance.
This work was sponsored by NSF under grant number OCI 1025159 for the development of technology audit service for XSEDE.
UBMoD (UB Metrics on Demand) is an open source tool for collecting and mining statistical data from cluster resource managers (such as Torque, OpenPBS, and SGE) commonly found in high-performance computing environments. It was designed to meet the following objectives: (1) provide the user community with an easy to use tool to manage their accounts and optimize their use of resources, (2) provide staff with a diagnostic tool to monitor and tune resource performance for the benefit of the users, (3) provide senior management with a tool to easily monitor utilization, user base, distribution of resources among decanal units, and (4) help ensure that the resources are effectively enabling research and scholarship.
Developed by the Center for Computational Research it presents resource utilization including CPU cycles consumed, total jobs, average wait time, etc. for individual users, research groups, departments, and decanal units. The web-based user interface provides a dashboard for displaying resource consumption along with fine-grained control over the time period and resources displayed. The data warehouse can easily be customized to support new resource managers. The information is presented in easy-to-understand charts and tables and provides system administrators, users, and directors of HPC centers with a rich set of metrics to better understand how their resources are being utilized. The current release includes the ability to apply custom tags to users and jobs and to then filter all reports using those tags. This provides complete flexibility for organizing users into departments, projects, and groups. For example, users can be tagged as members of one or more projects and reports can be dynamically generated for those projects. It can be downloaded at http://ubmod.sf.net/