AI-assisted, Online Root Cause Analysis of Faults in Enterprise Systems

31 Ottobre 2025
86
VIEW

Krishna Kant
Department of Computer & Information Sciences, Temple University, Philadelphia, PA

Meeting Room 6th Floor
Dipartimento di Ingegneria dell'Informazione
Largo L. Lazzarino 1

Abstract:
In this talk, I will present our work on the issue of online diagnosis of misconfigurations in large enterprise systems, particularly focusing on networks. Misconfigurations are a ubiquitous problem in large computing environments and account for most of the performance and security issues. Our goal is to identify the misconfigured configuration variables in an online manner starting with a user complaint about some system misbehavior. We designed an LLM conversation agent to hold multi-step dialog with the user to obtain relevant details and automatically construct a “trouble-ticket” which is then used to start a sequential testing procedure using a combination of our ad hoc testing algorithm and AI suggested tests to achieve fast diagnosis. The mechanism only uses widely available tests and thus can be directly deployed in real enterprise systems. We also extend the methodology to geographically distributed enterprises where the visibility from any given vantage point is limited and the diagnosis could suffer from the usual distributed systems issues of lost/delayed/duplicate updates. We demonstrate that in all cases, the diagnosis took under 15 seconds even in large systems and the faulty configuration parameter is identified accurately in 93% of the cases. Furthermore, combining the algorithmically suggested and AI suggested tests reduces the number of required tests substantially, well below the number of tests needed by human experts.

Bio:
Krishna Kant is a professor in the Computer and Information Science Department at Temple University in Philadelphia, PA. Earlier he was a research professor in the Center for Secure Information Systems (CSIS) at George Mason University. From 2008-2013 he served as a program director at the National Science Foundation (NSF) where he managed the computer systems research (CSR) program and was instrumental in the development and running of an NSF-wide sustainability initiative called SEES (science, engineering, and education for sustainability). Earlier, he served at Intel Corporation for 11 years working on a variety of data center architecture and technology issues. From 1991 to 1997, he held the consultant position at Ericsson (formerly Bellcore) and worked on many broadband and narrowband telecommunications technologies. Before 1991, he was an Associate Professor of Computer Science at the Pennsylvania State University with research contributions in performance modeling and distributed systems. He received his Ph.D. degree in Mathematical Sciences from the University of Texas at Dallas in 1981. He carries extensive experience in academia, industry, and government and has published in a wide variety of areas in computer science. He has authored a graduate textbook on performance modeling of computer systems and coedited several other books. He is a Fellow of the IEEE and an IEEE distinguished visitor.