Algorithms for correlation of events/issues - machine-learning

We are working on a system that aims to help development teams, SRE, DevOps team members by debugging many of the well known infrastructure issues (k8s to begin with) on their behalf and generate a detailed report which details report which details the specifics of the issue, possible root causes and clear next steps for the users facing the problem. In short, instead of you having to open up terminal, run several commands to arrive at an issue, a system does it for you and show it in a neat UI. We plan to leverage AI to provide better user experiences.
Questions:
1.There are several potential use case like predictive analytics, anomaly detection, forecasting, etc. We will not analysis application logs or metrics (may include metrics in future). Unlike application level logs, the platform logs are more unified. What is a good starting point for AI usage especially for platform based logs?
2.We plant to use AI to analysis issue correlations, we Apyori, FP Growth and got output. The output looks like below
| antecedent | consequent | confidence | lift |
|----------------------------|-------------------| ---------- | ---- |
| [Failed, FailedScheduling] | [BackOff] | 0.75 | 5.43 |
| [NotTriggerScaleUp] | [FailedScheduling]| 0.64 | 7.29 |
| [Failed] | [BackOff] | 0.52 | 3.82 |
| [FailedCreatePodSandBox] | [FailedScheduling]| 0.51 | 5.88 |
FP Growth is data mining algorithm, from the output we can figure the pattern of events. There is one potential use case, save the previous output and compare it with latest output to detect abnormal pattern in the latest output. Can we use the output to inference issue correlations or any other scenario we can use the output?
3.Some logs seems irrelevant, but actually they have connections, like one host has issue, it will impact the applications running on it, the time span maybe long, how can we figure out this kind of relationships?
Any comments and suggestions will be greatly appreciated, thank you in advance.

Related

how to setup a alert on machine learning toolkit for historical data

I am working on a splunk time series forcasting poc and needed to show how splunk send alert when the prediction returns a result above threshold.
the search | inputlookup internet_traffic.csv | timechart span=120min avg("bits_transferred") as bits_transferred | eval bits_transferred=round(bits_transferred) , if predicts bits_trasferred above the condition given in alert should send email to mentioned id.
Currently the condition give is per result of the search.
Kindly let me know how to set up the alert or which condition to setup.
I'll give you an example using the Splunk core function, predict, but you should be able to also apply it to the Machine Learning Toolkit
| inputlookup internet_traffic.csv | timechart span=120min avg("bits_transferred") as bits_transferred | eval bits_transferred=round(bits_transferred) | predict bits_transferred | where bits_transferred > 'upper95(prediction(bits_transferred))'
The Machine Learning Toolkit actually has a showcase example that you can tweak that illustrates detecting anomalies with MLTJ=K
/en-US/app/Splunk_ML_Toolkit/detect_numeric_outliers?ml_toolkit.dataset=Employee%20Logins%20(prediction%20errors)

Keep Gcov Test name in GCDA files

After having performed my test coverage on my product using lcov (for C++ dev), i'd like to draw a matrix to have the correspondence between the test name and the files it covers.
The idea is to have a quick view of the code covered by 1 test file.
eg:
xxxx |file 1 |file 2 |file 3 |file 4 | file 5 |
test 1 | YES | NO | YES | YES | YES |
test 2 | YES | NO | NO | No | NO |
test 3 | YES | YES | NO | NO | YES |
In my project, I need to run thousands of tests to check the coverage of thousands of files, so the matrix will be huge.
Unfortunately, it seems that by design GCOV does not works this way, because we will have only one set of gcda files that covers the whole code, and it looks not possible to determines which test covers which part of the code.
The only solution I could imagine is the following one:
for current_test in all_tests do:
run 1 current_test
retrieve gcda -> .info file
extract from the .info file the name of covered code files
append the matrix with current_test / code filename
The problem is that it will be extremely long, because to do so, it will take around 5 min for 1 test... I'll spend weeks to wait...
Any idea would be very welcomed.
Thanks a lot for your help.
Regards,
Thomas
Unfortunately the gcov data does not include test names, and they must be added in post-processing. Therefore, your sequential loop is the sensible approach if you stay within gcov-based coverage collection.
Workarounds you can try:
Run your tests with an appropriate GCOV_PREFIX variable so that the coverage is written into a different directory, rather than next to your object files.
Use a different coverage tool. E.g. kcov performs runtime instrumentation and writes the coverage results into a directory you specify. However, the coverage data formats are not usable for gcov-based tools.
Distribute your tests across multiple machines.
My guess is that GCOV_PREFIX is likely to work in your scenario so that you can easily run your tests in parallel. This variable is a bit fiddly because you need to know the absolute paths of your object files, but it's probably easier to figure that out than it is to wait multiple days for your coverage matrix.

Optimizing repeated transformations in Apache Beam/DataFlow

I wonder if Apache Beam.Google DataFlow is smart enough to recognize repeated transformations in the dataflow graph and run them only once. For example, if I have 2 branches:
p | GroupByKey() | FlatMap(...)
p | combiners.Top.PerKey(...) | FlatMap(...)
both will involve grouping elements by key under the hood. Will the execution engine recognize that GroupByKey() has the same input in both cases and run it only once? Or do I need to manually ensure that GroupByKey() in this case proceeds all branches where it gets used?
As you may have inferred, this behavior is runner-dependent. Each runner implements its own optimization logic.
The Dataflow Runner does not currently support this optimization.

Can Machine Learning help classify data

I have a data set as below,
Code | Description
AB123 | Cell Phone
B467A | Mobile Phone
12345 | Telephone
WP9876 | Wireless Phone
SP7654 | Satellite Phone
SV7608 | Sedan Vehicle
CC6543 | Car Coupe
Need to create a automated grouping based on the Code and Description. Lets assume I have so many such data already classified into 0-99 groups. Whenever a new data comes in with a Code and Description, the Machine Learning algorithm needs to automatically classify this based on the previously available data.
Code | Description | Group
AB123 | Cell Phone | 1
B467A | Mobile Phone | 1
12345 | Telephone | 1
WP9876 | Wireless Phone | 1
SP7654 | Satellite Phone | 1
SV7608 | Sedan Vehicle | 2
CC6543 | Car Coupe | 3
Can this be achieved to some level of accuracy? Currently this process is so manual. Any such ideas or references are there, please help with that.
Try reading up on Supervised Learning. You need to provide labels for your training data so that the algorithms know what are the correct answers - and are able to generate appropriate models for you.
Then you can "predict" the output classes for your new incoming data using the generated model(s).
Finally, you may wish to circle back to check the accuracy of the predicted results. If you then enter the labels for the newly received and predicted data then those data can then be used for further training on your model(s).
Yes, it's possible with supervised learning. You pick yourself a model which you "train" with the data you already have. The model/algorithm then "generalizes" to previously unseen data from the known data.
What you specify as a group would be called class or "label" which needs to be predicted based on 2 input features (code/description). Whether you input these features directly or preprocess them into more abstract features which suits the algorithm better, depends on which algorithm you choose.
If you have no experience with Machine Learning, you might start with learning some basics while testing already implemented algorithms in tools such as RapidMiner, Weka or Orange.
I don't think machine learning methods are the most appropriate for the solution of the problem, because text based machine learning algorithms tend to be quite complicated. From the examples you provided I'm not sure how
I think the simplest way of solving, or attempting to solve this problem is the following, which can be implemented in many free programming languages, such as python. Each description can be stored as a string. What you could do is to store all the substrings of all the strings (ie Phone is your string, the substrings will be 'P','h',Ph',..,'e') that belong in a particular group in a list (see this question for how to implement it in python... Substrings of a string using Python). Then you want to for each substring and all substrings stored, see which ones are unique to a certain group. Then select strings over a certain length (say 3 characters long, to get rid of random letter concatenations) as your classification criteria. Then when you get new data, check whether the description is unique to a certain group. With this for instance, you would be able to classify all objects that are in group 1 based on whether their description contains the word phone.
Its hard to provide concrete code to solves your problem without knowing what languages you are familiar with/are feasible to use. I hope this helps anyway. Yves

Automatic people counting + twittering

Want to develop a system accurately counting people that go through a normal 1-2m wide door. and twitter whenever people goes in or out and tells how many people remain inside.
Now, Twitter part is easy, but people counting is difficult. There is some semi existing counting solution, but they do not quite fit my needs.
My idea/algorithm:
Should I get some infra-red camera mounting on top of my door and constantly monitoring, and divide the camera image into several grid and calculating they entering and gone?
can you give me some suggestion and starting point?
How about having two sensors about 6 inches apart. They could be those little beam sensors (you know, the ones that chime when you walk into some shops) placed on either side of the door jam. We'll call the sensors S1 and S2
If they are triggered in the order of S1 THEN S2 - this means a person came in
If they are triggered in the order of S2 THEN S1 - this means a person left.
-----------------------------------------------------------
| sensor | door jam | sensor |
-----------------------------------------------------------
| |
| |
| |
| |
S1 S2 this is inside the store
| |
| |
| |
| |
-----------------------------------------------------------
| sensor | door jam | sensor |
-----------------------------------------------------------
If you would like to have the people filmed by a camera you can try to segment the people in the image and track them using a Particle Filter for multi-object tracking.
http://portal.acm.org/citation.cfm?id=1561072&preflayout=flat
This is a paper by one of my professors. Maybe you wanna have a look at it.
If your camera is mounted and doesnt move you can use a substraction-method for segmentation of the moving people (Basically just substract two following images and all that stays where the things that move). Then do some morphological operations on it so only big parts (people) stay. Maybe even identify them by checking on rectangularity so you only keep "standing" objects.
Then use a Particle Filter to track the people in the scene automatically... And each new object would increase the counter...
If you want I could maybe send you a presentation I held a while ago (unfortunately its in German, but you can translate it)
Hope that helps...

Resources