Fast dimensional analysis for root cause analysis at scale
以下为 快照 页面，建议前往来源网站查看，会有更好的阅读体验。
Nikolay Pavlovich Laptev Fred Lin Keyur Muzumdar Mihai-Valentin Curelea
What the research is:
A fast dimensional analysis (FDA) framework that automates root cause analysis on structured logs with improved scalability. When a failure event happens in a large-scale distributed production environment, performing root cause analysis can be challenging. Various hardware, software, and tooling logs are often maintained separately, making it difficult to detect issues across multiple logs. Additionally, at scale, there could easily be millions of entities, each with hundreds of features, making it difficult to debug issues.
Our proposed FDA framework combines structured logs from a number of sources and provides a meaningful combination of features. That information arms engineers with actionable insights and helps them determine where to begin their investigation. And improved Apriori/FP-Growth algorithms sustain analysis at Facebook scale.In the figure above, our system finds highly correlated features that explain the exception spike (red line).
How it works:
The FDA framework first fetches structured logs from various sources. Log data is deduplicated at query time (deduping significantly improves algorithm performance). Each duplicated row will have a samples column that counts original row frequency. The resulting data is then one-hot encoded (a Boolean 0/1 value, depending on whether a given feature is present in a given row) to transform it into a schema that fits the frequent pattern mining formulation. Frequent pattern mining is applied to identify frequent item-sets.
Support and lift are metrics that are used for filtering. Support of feature X with respect to data set D refers to the portion of transactions that contain X within D. Support has a downward closure property, which is important because downward closure implies that all subsets of a frequent item-set are also frequent, and all supersets of an infrequent item-set can be safely pruned because they will never be frequent.Formal definition of support and lift used by the fast dimensional analysis framework during the pruning of rules.
To measure significance or importance, we used lift, which measures how much more likely X and Y are to occur together relative to if they were independent. A lift value of 1 means independence between X and Y, and a value greater than 1 signifies dependence. Due to the focus on scalability, we propose pre- and post-processing, parallelism, and filters to further speed up the analysis and improve interpretability.
Why it matters:
During system outages, it is important for engineers to accurately isolate the problem and quickly find a solution. As we’ve mentioned, the challenges of performing root cause analysis in a large-scale distributed production environment make outage detection and mitigation difficult. We have successfully implemented this approach for a large-scale infrastructure environment and hope this work motivates further research and automation for similarly large and complex environments.The proposed framework for the fast dimensional analysis.
Read the full paper:
We’d like to thank Seunghak Lee and Sriram Sankar for their contributions to this work.
Root Cause Analysis (RCA) or simply “Root Cause” are terms often used when troubleshooting enterprise application behavior. A quick web...
Summary Millions of businesses rely on Stripe. Reliability is a top priority for us and we invest in it heavily. However, on 2019-07-10 from 16:35 to 17:02 UTC, and again from 21:14 to 22:47 UTC, the Stripe API w...
Instagram Server is entirely Python powered. Well, mostly. There’s also some Cython, and our dependencies include a fair amount of C++ code exposed to Python as C extensions. Our server app is a monolith, o...
What bugs cause production cloud incidents? Liu et al., HotOS’19 Last time out we looked at
Robin Sloan’s post about how an app can be a home-cooked meal really resonated with me, and it’s something I’ve been doing for a long time. Like a whittler...
Here’s a troubling fact. Aself-driving car hurtling along the highway and weaving through traffic has less understanding of what might cause an accident than a child who’s just learning to walk. A new experime...
vizuka - Explore high-dimensional datasets and how your algo handles specific regions.
kepler-mapper - KeplerMapper is a Python class for visualization of high-dimensional data and 3-D point cloud data.
readme.md "To deal with hyper-...
GitHub is where people build software. More than 28 million people use GitHub to discover, fork, and contribute to over 85 million projects.