I am creating H2o Autoencoder Anomaly Detection model in h2o python.When calculating anomalies using (test_rec_error=model.anomaly(test.hex,per_feature=false) I am getting one reconstruction error for each record.But When I am trying to predict(finding anomaly) on any test data in H2o Flow I am getting reconstruction error per feature.Is there any option in h2o Flow to get only one reconstruction error(not per feature) in h2o flow?
Also what is the REST API endpoint for getting reconstruction error from anomaly model in h2o. Just like for scoring prediction on test data in classification models API(POST /3/Predictions/models/{model}/frames/{frame}) is there..So just wanted to know whats the REST API for getting reconstruction error from anomaly model in h2o.
Thanks in advance.
Related
I'm trying to perform a multivariate time series anomaly detection. I have training data that consists of "normal" data. I train on this data and detect anomalies on the test set that contains normal + anomalous data. My understanding is that it would be wrong to tweak the model hyperparameters based on the results from the test set.
What would the train/validate/test set look like to train and evaluate a time-series anomaly detector?
Nothing very specific to anomaly detection here. You neeed to split the testing data into one or more validation and test sets, while making sure they are reasonably independent (no information leakage between them).
I have a doubt whether after clustering using any algorithm is it possible to segment new data based on the learning from the previous data
The issue is that clustering algorithms are unsupervised learning algorithms. They don't need a dependent variable to predict classes. They are used to find structures/similarities in the data points. What you can do is, treat the clustered data as your supervised data.
The approach would be clustering and assigning labels in the train data. Treat it as a multi-class classification data, train a new multi-class classification model using your data and validate it on the test data.
Let train and test be the datasets.
clusters <- Clustering(train)
train[y] <- clusters
model <- Classification(train, train[y])
prediction <- model.predict(test)
However interestingly KMeans in sklearn provides fit and predict method. So using KMeans from sklearn you can predict in the new data. However, DBScan doesn't have predict which is quite obvious from it's working mechanism.
Clustering is an unsupervised mechanism where the number of clusters and the identity of the segments which need to be clustered are not known to the system.
Hence what you can do is to obtain the learning of a model which is trained for Clustering , classification,Identification or verification and apply that learning to your use case of clustering.
If the new data is from the same domain of the trained data most probably you will end up with better accuracy in clustering. (You need to properly choose the clustering methodology based on the type of data which you choose. eg for voice clustering Dominant sets and hierarchical clustering will be the most potential candidates).
If the New data is from a different domain then the selected model may fail as it learned the features in correspond to your domain of training data.
I have earlier worked in shallow(one or two layered) neural networks, so i have understanding of them, that how they work, and it is quite easy to visualize the derivations for forward and backward pass during the training of them, Currently I am studying about Deep neural networks(More precisely CNN), I have read lots of articles about their training, but still I am unable to understand the big picture of the training of the CNN, because in some cases people using pre- trained layers where convolution weights are extracted using auto-encoders, in some cases random weights were used for convolution, and then using back propagation they train the weights, Can any one help me to give full picture of the training process from input to fully connected layer(Forward Pass) and from fully connected layer to input layer (Backward pass).
Thank You
I'd like to recommend you a very good explanation of how to train a multilayer neural network using backpropagation. This tutorial is the 5th post of a very detailed explanation of how backpropagation works, and it also has Python examples of different types of neural nets to fully understand what's going on.
As a summary of Peter Roelants tutorial, I'll try to explain a little bit what is backpropagation.
As you have already said, there are two ways to initialize a deep NN: with random weights or pre-trained weights. In the case of random weights and for a supervised learning scenario, backpropagation works as following:
Initialize your network parameters randomly.
Feed forward a batch of labeled examples.
Compute the error (given by your loss function) within the desired output and the actual one.
Compute the partial derivative of the output error w.r.t each parameter.
These derivatives are the gradients of the error w.r.t to the network's parameters. In other words, they are telling you how to change the value of the weights in order to get the desired output, instead of the produced one.
Update the weights according to those gradients and the desired learning rate.
Perform another forward pass with different training examples, repeat the following steps until the error stops decreasing.
Starting with random weights is not a problem for the backpropagation algorithm, given enough training data and iterations it will tune the weights until they work for the given task.
I really encourage you to follow the full tutorial I linked, because you'll get a very detalied view of how and why backpropagation works for multi layered neural networks.
Suppose I have 1 billion data set points, with which we already trained our machine learning model and obtained our parameters / weights . Now i receive another 100 data set points , how i train this new data set ? Deviating from linear regression , how do we train new examples of spam/not spam in spam filtering , if we had already trained let say 2 billion mails ?
It seems to me that you should use a different algorithm (i.e. an online algorithm).
I've never tried this in practice, but here's a paper from NIPS (a well-respected ML conference) that you may find useful: Online Linear Regression and Its Application to Model-Based Reinforcement Learning. (This same algorithm was suggested in an answer to a similar question on Cross Validated.)
What system to use for Anomaly detection?
I see that systems like Mahout do not list anomaly detection, but problems like classification, clustering, recommendation...
Any recommendations as well as tutorials and code examples would be great, since I haven't done it before.
There is an anomaly detection implementation in scikit-learn, which is based on One-class SVM. You can also check out the ELKI project which has spatial outlier detection implemented.
In addition to "anomaly detection", you can also expand your search with "outlier detection", "fraud detection", "intrusion detection" to get some more results.
There are three categories of outlier detection approaches, namely, supervised, semi-supervised, and unsupervised.
Supervised: Requires fully labeled training and testing datasets. An ordinary classifier is trained first and applied afterward.
Semi-supervised: Uses training and test datasets, whereas training data only consists of normal data without any outliers. A model of the normal class is learned and outliers can be detected afterward by deviating from that model.
Unsupervised: Does not require any labels; there is no distinction between a training and a test dataset Data is scored solely based on intrinsic properties of the dataset.
If you have unlabeled data the following unsupervised anomaly detection approaches can be used to detect abnormal data:
Use Autoencoder that captures a feature representation of the features present in the data and flags as outliers data points that are not well explained using the new representation. Outlier score for a data point is calculated based on reconstruction error (i.e., squared distance between the original data and its projection) You can find implementations in H2O and Tensorflow
Use a clustering method, such as Self Organizing Map (SOM) and k-prototypes to cluster your unlabeled data into multiple groups. You can detect external and internal outliers in the data. External outliers are defined as the records positioned at the smallest cluster. Internal outliers are defined as the records distantly positioned inside a cluster. You can find codes for SOM and k-prototypes.
If you have labeled data, there are plenty of supervised classification approaches that you can try to detect outliers. Examples are Neural Networks, Decision Tree, and SVM.