wiki:WikiStart

Log Analysis and Fault Tolerance

The high performance computing (HPC) community is deeply concerned about the expected high failure rates in extreme-scale machines. Developing efficient fault-tolerance strategies is imperative to provide useful supercomputers in the future. This project aims at analyzing different types of log files from supercomputers and apply that knowledge to improve fault tolerance techniques for HPC.

To achieve the final goal, we will rely on the following methods:

  • Log Analysis: apply data mining and statistical techniques to unveil the hidden patterns in failure log files and job submission log files.
  • Modeling: find appropriate formulas for the different variables in a faulty supercomputing environment.
  • Simulation: design of a set of representative scenarios to contrast competing techniques to decrease the cost of failures in the total execution time of an application.

This project results from the collaboration between faculty of the Computer Science Department and the Center for Simulation and Modeling (SaM).

Documentation

  • A very short presentation on one of the goals of the project: making the case for automatic restart (PDF).
  • A paper characterizing jobs on Jaguar PDF.

Moab Log File Format

Other Failure Data Repositories

Code Repository

Execute the following command to get the GIT repository:

git clone git@web.sam.pitt.edu:logft

For more information on the code repository, please visit this page.

Last modified 3 years ago Last modified on 10/08/14 08:54:57

Attachments (1)

Download all attachments as: .zip

collab.SaM | www.SaM | core.SaM | pitt.edu | find.pitt