There are some companies like Zebrium around saying that they get an AI to do root cause analysis of logs. I think root cause analysis is essentially a multi-class classification problem. So I check for research on classifying logs but data sets and works I manager to find are all binary classifications (normal vs abnormal).
Program: https://github.com/wuyifan18/DeepLog
Dataset: https://github.com/logpai/loghub
Can anyone enlighten me by providing resources on multi-class (root cause analysis) rather than binary class (anomaly detection)? Many thanks.
Related
I'm looking for metrics to compare various regressions models (e.g. SVM, Decision Tree, Neural Network etc), to decide the merits of each for solving a specific problem.
For my problem I have just over 80,000 training samples with 12 variables, all of which are independent and identically distributed.
I've done most of my research into neural networks but I'm drawing a blank when trying to compare them against other models.
Any input (including reading suggestions) would be greatly appreciated, thanks!
You can compare regression models by calculating the mean squared error for each model over a test set. The best model will simply be the one with the least error.
Sadly, there ist nothing like roc curves for regression models. Except your output is a binary variable like with logistic regression.
I am experimenting with classification using neural networks (I am using tensorflow).
And unfortunately the training of my neural network gets stuck at 42% accuracy.
I have 4 classes, into which I try to classify the data.
And unfortunately, my data set is not well balanced, meaning that:
43% of the data belongs to class 1 (and yes, my network gets stuck predicting only this)
37% to class 2
13% to class 3
7% to class 4
The optimizer I am using is AdamOptimizer and the cost function is tf.nn.softmax_cross_entropy_with_logits.
I was wondering if the reason for my training getting stuck at 42% is really the fact that my data set is not well balanced, or because the nature of the data is really random, and there are really no patterns to be found.
Currently my NN consists of:
input layer
2 convolution layers
7 fully connected layers
output layer
I tried changing this structure of the network, but the result is always the same.
I also tried Support Vector Classification, and the result is pretty much the same, with small variations.
Did somebody else encounter similar problems?
Could anybody please provide me some hints how to get out of this issue?
Thanks,
Gerald
I will assume that you have already double, triple and quadruple checked that the data going in is matching what you expect.
The question is quite open-ended, and even a topic for research. But there are some things that can help.
In terms of better training, there's two normal ways in which people train neural networks with an unbalanced dataset.
Oversample the examples with lower frequency, such that the proportion of examples for each class that the network sees is equal. e.g. in every batch, enforce that 1/4 of the examples are from class 1, 1/4 from class 2, etc.
Weight the error for misclassifying each class by it's proportion. e.g. incorrectly classifying an example of class 1 is worth 100/43, while incorrectly classifying an example of class 4 is worth 100/7
That being said, if your learning rate is good, neural networks will often eventually (after many hours of just sitting there) jump out of only predicting for one class, but they still rarely end well with a badly skewed dataset.
If you want to know whether or not there are patterns in your data which can be determined, there is a simple way to do that.
Create a new dataset by randomly select elements from all of your classes such that you have an even number of all of them (i.e. if there's 700 examples of class 4, then construct a dataset by randomly selecting 700 examples from every class)
Then you can use all of your techniques on this new dataset.
Although, this paper suggests that even with random labels, it should be able to find some pattern that it understands.
Firstly you should check if your model is overfitting or underfitting, both of which could cause low accuracy. Check the accuracy of both training set and dev set, if accuracy on training set is much higher than dev/test set, the model may be overfiiting, and if accuracy on training set is as low as it on dev/test set, then it could be underfitting.
As for overfiiting, more data or simpler learning structures may work while make your structure more complex and longer training time may solve underfitting problem
EDIT1: My code is the same as here, https://github.com/tensorflow/models/blob/master/inception/inception. The only difference is that I pack my files into TFRecords and feed it bactch wise. Also, the ratio of Class 0 : Class 1 is 70:30.
I'm currently working on a project in which I'm making use of inception-V3 CNN model to train a classifier. Currently, I am working on a binary classifier (either predict 1 or 0) but, my model only predicts class 0 for everything. While troubleshooting I've found that the probability of prediction is 100% for class 0 all the time. I have verified everything from the input queuing system to the eval and testing, everything seems to be working well too.
Strangely, the loss value reduces in a perfect semi-parabolic fashion which makes me think that the loss has converged to a local minima. Upon testing the script only churns out class 0(with 100% probability) each time. Another thing I've noticed is that the activation across various Conv layers are always constant which could imply that the neurons are just not firing at all.
My question is,
1. Is my model working ? The loss seems to converge but the activation across various layers seems to be stagnant.
2. I am using the training code available from the models section of the tensorflow (https://github.com/tensorflow/models/blob/master/inception/inception/inception_train.py)
I am reusing the train, eval and supporting code to train my model with a custom input pipeline created by me (which is also working). Can someone help guide me in the right direction on this?
Thanks.
Ik I am a little late in answering this question 😅
First of all your model didn't learn anything at all. All it did (cleverly 😂) was to predict class 0 for all cases so that it achieves a baseline accuracy of 70% without any effort. (Probably the model was lazy 😪😋) JK. This is a very well known problem in machine learning. This is called as class imbalance problem. Refer this http://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/.
Apart from the techniques mentioned there. The one technique that works wonders is using class weights. That is, basically telling the network to be biased towards the weaker class. In your case class weights will be class0:class1 = 3:7. This is a hyperparameter too! But this is a good point to start.
Moreover, you didn't give any info about your dataset size. Whether you are fine tuning or training from scratch. Without them it's hard to speculate. By default I would suggest fine tuning.
Moreover, by loss you mean the training loss or validation loss? Because training loss has literally no info regarding the performance of the model. Moreover, in my opinion both training loss, and validation losses have very little info to derive meaningful insights about the model's performance. Use other metrics like confusion matrix,f1 score, recall, precision etc.
Finally, there is absolutely no single answer to your question. The only way is the hard way - you will learn along with the model 😉. Because I consider training a NN especially a CNN an art. In which, intuition plays a very crucial role coz, most of the times, the least expected changes would give the best results. Anyway that's the fun part of training a NN.
Happy training 💪
P.S: Try using the visualisation tools like gradcam to know whether the model is looking at the correct part of the image for classification. This is very important!
I'm new to Topic Models, Classification, etc… now I'm already a while doing a project and read a lot of research papers. My dataset consists out of short messages that are human-labeled. This is what I have come up with so far:
Since my data is short, I read about Latent Dirichlet Allocation (and all it's variants) that is useful to detect latent words in a document.
Based on this I found a Java implementation of JGibbLDA http://jgibblda.sourceforge.net but since my data is labeled, there is an improvement of this called JGibbLabeledLDA https://github.com/myleott/JGibbLabeledLDA
In most of the research papers, I read good reviews about Weka so I messed around with this on my dataset
However, again, my dataset is labeled and therefore I found an extension of Weka called Meka http://sourceforge.net/projects/meka/ that had implementations for Multi-labeled data
Reading about multi-labeled data, I know most used approaches such as one-vs-all and chain classifiers...
Now the reason me being here is because I hope to have an answer to following questions:
Is LDA a good approach for my problem?
Should LDA be used together with a classifier (NB, SVM, Binary Relevance, Logistic Regression, …) or is LDA 'enough' to function as a classifier/estimator for new, unseen data?
How do I need to interpret the output coming from JGibbLDA / JGibbLabeledLDA. How do I get from these files to something which tells me what words/labels are assigned to the WHOLE message (not just to each word)
How can I use Weka/Meka do get to what I want in previous question (in case LDA is not what I'm looking for)
I hope someone, or more than one person, can help me figure out how I need to do this. The general idea of all is not the issue here, I just don't know how to go from literature to practice. Most of the papers don't give enough description of how they perform their experiments OR are too technical for my background about the topics.
Thanks!
I have a dataset of a particular domain (say sports - 1 class). What I want to do is when I fed a web page to the classifier/clusterer I want to get a result whether that instance (web page) is related to sports or not.
Most of the classifiers in weka are not capable of dealing with unary class datasets except the LibSVM (wrapper). I did some tests with the LibSVM, but the problem is during tests on a unrelated dataset, I get all of them correctly classified, even if the instances are empty! Any suggestions?
What if I use the cosine similarity measure here?
Have you seen this thread unary class text classification in weka? and this post https://list.scms.waikato.ac.nz/mailman/htdig/wekalist/2007-October/011631.html ?
I'm assuming you meant that when you run the classifier against another dataset that is not "sports" it gets the results incorrectly classified (i.e. false positives) e.g. "this is sports".
Are you certain your dataset only contains one class? Did you make sure the dataset does not contain any empty instances? (don't mock, this has happened to me before).
In the comments of the previously mentioned thread there is a linked to a PDF on tuning SVM: http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf - I would say SVMs are a bit harder than other common classifiers.
As an alternative, can't you switch the problem to binary classification? It's much easier to get good results and for most problems there are plenty of examples of things that are not in that class e.g. sports websites vs funny image web sites, programming websites, etc ...
PS: you can use other algorithms for outlier detection: http://en.wikipedia.org/wiki/Outlier_detection