How to cluster sequences? [closed] - machine-learning

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
How you would cluster sequential information? I have about 500 sequences and some have the same characteristics. Is there anything like K-means for categorical sequential (temporal) data, or what would your approach look like?
These sequences are sequences of one-hot-encoded vectors which are representing classes. Consider for example the nurse-rostering problem with four classes: early-shift, day-shift, night-shift, home. The vectors look like this: [0, 1, 0, 0], [0, 1, 0, 0], [0, 0, 0, 1], this nurse works 2 days with the day-shift and is home the third day. But this "schedule" could depend on the parameters of the hospital, so I would like to cluster similar data. I have about 500 "schedules". Any ideas?

I will mention 3 "levels" at which you could solve this problem, assuming that you will be able to frame your problem statement accordingly. Please consider this answer as something you can use to get direction on how to solve this problem since the question you ask is not that specific and covers a very wide scope (usually against SO guidelines).
Traditional approaches involved using some DR (Dimensionality reduction) approaches such as PCA followed by Clustering such as Kmeans, Gaussian mixtures, Density-based methods, etc.
The issue with these approaches was that they assumed that the observed data was generated from a lower-dimensional latent space via simple linear transformations. E.g. When using PCA on data, you assume that the data that you see comes from linear combinations of the 2 principal components. This works for a lot of datasets but more complex data is usually a result of non-linear transformations of lower-dimensional latent spaces.
More modern approaches handled this to some extent using DNNs as pre-processing followed by clustering methods. DNNs helped with the non-linearity as well as allowing for better low dimensional representations for data types such as sequences and images. This is usually what the majority of the baseline benchmark models are made on -
Train an auto-encoder to regenerate the sequence
Take the bottleneck embedding/latent vector and use a clustering algorithm to cluster in this latent space.
While these approaches work well, there is a flaw in these as well. Since no clustering-driven objective is explicitly incorporated in the learning process, the learned DNNs do not necessarily output low dimensional data that are suitable for clustering.
The latest research involves training DNNs along with a clustering loss so that it ensures that the latent space is clustering friendly. These algorithms give superior results to any of the above approaches. One of the SOTA approaches in this category is DCN (Deep clustering networks). DCNs combine the reconstruction loss of an autoencoder with a clustering loss. It defines a centroid-based target probability distribution (very similar to Kmeans but with student-t distribution) and minimizes its KL divergence against the model clustering result.
Find more information here and here.
Specific to your case: You have a sequence vector with 4 features. You can build an LSTM based autoencoder to create initial embeddings and then use a clustering method to cluster the latent vector. Or if you are interested in DCNs, you can build a similar setup with an autoencoder and then use the clustering loss along with reconstruction loss to further train the encoder to generate clustering-friendly embeddings.

Related

Stream normalization for online clustering in evolving environments [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
TL;DR: how to normalize stream data, given that the whole data set is not available and you are dealing with clustering for evolving environments
Hi! I'm currently studying dynamic clustering for non-stationary data streams. I need to normalize the data because all features should have the same impact in the final clustering, but I don't know how to do it .....
I need to apply a standard normalization. My initial approach was to:
Fill a buffer with initial data points
Use those data points to get mean and standard deviation
Use those measures to normalize the current data points
Send those points normalized to the algorithm one by one
Use the previous measures to keep normalizing incoming data points for a while
Every some time calculate again mean and standard deviation
Represent the current micro clusters centroids with the new measures (having the older ones it shouldn't be a problem to go back and normalize again)
Use the new measures to keep normalizing incoming data points for a while
And so on ....
The thing is that normalizing the data should not get involved with what the clustering algorithm does ... I mean, you are not able to tell the clustering algorithm 'ok, the micro clusters you have till now need to be normalized with this new mean and stdev' ... I mean, I developed an algorithm and I could do this, but I am also using existing algorithms (clustream and denstream) and it does not feel right to me to modify them to be able to do this ....
Any ideas?
TIA
As more data streams in, the estimated standardization parameters (e.g, mean and std) are updated and converge further to the true values [1, 2, 3]. In evolving environments, it is even more pronounced as the data distributions are now time-varying too [4]. Therefore, the more recent streamed samples that have been standardized using the more recent estimated standardization parameters are more accurate and representative.
A solution is to merge the present with a partial reflection of the past by embedding a new decay parameter in the update rule of your clustering algorithm. It boosts the contribution of the more recent samples that have been standardized using the more recent distribution estimates. You can see an implementation of this idea in Apache Sparks MLib [5, 6, 7]:
where the α is the new decay parameter; lower α makes the algorithm favor the more recent samples more.
Data normalization affects clustering for algorithms that depend on the L2 distance. Therefore you can't really have a global solution to your question.
If your clustering algorithm supports it, one option would be to use clustering with a warm-start in the following way:
at each step, find the "evolved" clusters from scratch, using the samples re-normalized according to the new mean and std dev
do not initialize clusters randomly, but instead, use the clusters found in the previous step as represented in the new space.

Sigmoid activation for multi-class classification?

I am implementing a simple neural net from scratch, just for practice. I have got it working fine with sigmoid, tanh and ReLU activations for binary classification problems. I am now attempting to use it for multi-class, mutually exclusive problems. Of course, softmax is the best option for this.
Unfortunately, I have had a lot of trouble understanding how to implement softmax, cross-entropy loss and their derivatives in backprop. Even after asking a couple of questions here and on Cross Validated, I can't get any good guidance.
Before I try to go further with implementing softmax, is it possible to somehow use sigmoid for multi-class problems (I am trying to predict 1 of n characters, which are encoded as one-hot vectors)? And if so, which loss function would be best? I have been using the squared error for all binary classifications.
Your question is about the fundamentals of neural networks and therefore I strongly suggest you start here ( Michael Nielsen's book ).
It is python-oriented book with graphical, textual and formulated explanations - great for beginners. I am confident that you will find this book useful for your understanding. Look for chapters 2 and 3 to address your problems.
Addressing your question about the Sigmoids, it is possible to use it for multiclass predictions, but not recommended. Consider the following facts.
Sigmoids are activation functions of the form 1/(1+exp(-z)) where z is the scalar multiplication of the previous hidden layer (or inputs) and a row of the weights matrix, in addition to a bias (reminder: z=w_i . x + b where w_i is the i-th row of the weight matrix ). This activation is independent of the others rows of the matrix.
Classification tasks are regarding categories. Without any prior knowledge ,and even with, most of the times, categories have no order-value interpretation; predicting apple instead of orange is no worse than predicting banana instead of nuts. Therefore, one-hot encoding for categories usually performs better than predicting a category number using a single activation function.
To recap, we want an output layer with number of neurons equals to number of categories, and sigmoids are independent of each other, given the previous layer values. We also would like to predict the most probable category, which implies that we want the activations of the output layer to have a meaning of probability disribution. But Sigmoids are not guaranteed to sum to 1, while softmax activation does.
Using L2-loss function is also problematic due to vanishing gradients issue. Shortly, the derivative of the loss is (sigmoid(z)-y) . sigmoid'(z) (error times the derivative), that makes this quantity small, even more when the sigmoid is closed to saturation. You can choose cross entropy instead, or a log-loss.
EDIT:
Corrected phrasing about ordering the categories. To clarify, classification is a general term for many tasks related to what we used today as categorical predictions for definite finite sets of values. As of today, using softmax in deep models to predict these categories in a general "dog/cat/horse" classifier, one-hot-encoding and cross entropy is a very common practice. It is reasonable to use that if the aforementioned is correct. However, there are (many) cases it doesn't apply. For instance, when trying to balance the data. For some tasks, e.g. semantic segmentation tasks, categories can have ordering/distance between them (or their embeddings) with meaning. So please, choose wisely the tools for your applications, understanding what their doing mathematically and what their implications are.
What you ask is a very broad question.
As far as I know, when the class become 2, the softmax function will be the same as sigmoid, so yes they are related. Cross entropy maybe the best loss function.
For the backpropgation, it is not easy to find the formula...there
are many ways.Since the help of CUDA, I don't think it is necessary to spend much time on it if you just want to use the NN or CNN in the future. Maybe try some framework like Tensorflow or Keras(highly recommand for beginers) will help you.
There is also many other factors like methods of gradient descent, the setting of hyper parameters...
Like I said, the topic is very abroad. Why not trying the machine learning/deep learning courses on Coursera or Stanford online course?

Techniques to improve the accuracy of SVM classifier

I am trying to build a classifier to predict breast cancer using the UCI dataset. I am using support vector machines. Despite my most sincere efforts to improve upon the accuracy of the classifier, I cannot get beyond 97.062%. I've tried the following:
1. Finding the most optimal C and gamma using grid search.
2. Finding the most discriminative feature using F-score.
Can someone suggest me techniques to improve upon the accuracy? I am aiming at at least 99%.
1.Data are already normalized to the ranger of [0,10]. Will normalizing it to [0,1] help?
2. Some other method to find the best C and gamma?
For SVM, it's important to have the same scaling for all features and normally it is done through scaling the values in each (column) feature such that the mean is 0 and variance is 1. Another way is to scale it such that the min and max are for example 0 and 1. However, there isn't any difference between [0, 1] and [0, 10]. Both will show the same performance.
If you insist on using SVM for classification, another way that may result in improvement is ensembling multiple SVM. In case you are using Python, you can try BaggingClassifier from sklearn.ensemble.
Also notice that you can't expect to get any performance from a real set of training data. I think 97% is a very good performance. It is possible that you overfit the data if you go higher than this.
some thoughts that have come to my mind when reading your question and the arguments you putting forward with this author claiming to have achieved acc=99.51%.
My first thought was OVERFITTING. I can be wrong, because it might depend on the dataset - But the first thought will be overfitting. Now my questions;
1- Has the author in his article stated whether the dataset was split into training and testing set?
2- Is this acc = 99.51% achieved with the training set or the testing one?
With the training set you can hit this acc = 99.51% when your model is overfitting.
Generally, in this case the performance of the SVM classifier on unknown dataset is poor.

Is employing BPNN for water quality management an overkill? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I'm developing a device for Freshwater Quality Management which can be used for freshwater bodies such as lakes and rivers. The project is spread in three parts:
The first part deals with acquiring parameters such as pH, turbidity etc.
The second part deals with taking corrective measures based on the parameters. For instance, if the pH is too low, the device will inject basic solution to maintain a pH of 7-7.5.
Now the third part deals with predicting the health of the lake based on the parameters acquired (pH/Turbidity etc.). The predictive algorithm shall take in account of parameters and develop a correlation between them to explain for how long the lake will sustain. To achieve this, I'm currently biased toward using Back Propagation Neural Network (BPNN) as I have found that multiple other people/institutes prefer NN for water quality management.*
Now my concern is whether using BPNN would be an overkill for this project? If yes, which method/tool should I go for?
*1,2 and 3
Doing something the way "it used to be" is not always the best idea. In general, if you do not have strong, analytical reasons to choose neural network you should not ever start with it. Neural networks are tricky to train, have huge number of hyperparameters, are non-deterministic and computationaly expensive. Always start with the simpliest model, and only if it yields poor results - move to more complex ones. From theoretical perspecitive it is strongly justified by Vapnik theorems, and from practical it is similar to agile approach in programming.
So where to start?
Linear regression (Ridge regression, Lasso)
Polynomial regression
KNN regression
RBF Networks
Random Forest Regresor
If all of them fail - think about "classical" neural network. But chances are rather... small.
A neural network is a function approximator. If what you have is a real-valued vector of inputs, and associated with each of those vectors, you have a target real number or classification such as "good", "bad", "red", etc. then a neural network can be used to solve your problem.
Neural networks are, in their simplest form, functions of the form n(x) := g(Wh(Ax + b)+ c), where A and W are matrices, and b and c are vectors, h is a component-wise nonlinear function, generally a sigmoid function, and g is a function taking the same values as your target space.
In your case, your input vector, denoted x above, would contain pH, turbidity, etc, and your targets would be how long the lake will sustain. If your network is "trained" properly, it will be able to, given an unseen input u (new measurements for pH and turbidity etc), compute a good approximation to how long the lake will sustain.
"Training" a neural network consists of choosing the parameters for A, W, b, c. How many of these parameters there are depends on how many columns you chose for A and W (and therefore also for b and c). One way to choose these parameters is such that the function n(x) is close to your actual, measured targets on all of the historical (training) examples you have. More specifically, A,W,b,c are chosen to minimize E(A,W,b,c) := (n(x) - t(x))^2 where t(x) is your historically measured target (how long the lake sustained when the pH and turbidity were as measured in x). One way to try to minimize E over A,W,b,c is to compute the gradient of E with respect to each of the parameters and then take a step toward the negative of the gradient via an algorithm called back-propagation.
I want to note that the computation of a neural network, when the parameters are fixed, is deterministic, but that there are some algorithms for computing the gradient of E which aren't deterministic. Some other algorithms are deterministic.
So, with all that as background, are neural networks overkill for your project? That depends on the function you're trying to approximate from your observations to the output you're trying to predict. Whether a neural network will give you good prediction accuracy depends on many factors, perhaps the most important of which is how many examples you have to train on. If you don't have very many training examples relative to the number of predictors, a neural network may not be what you're looking for, but for the most part, that's an empirical question more than a theoretical one.
The nice thing is that, if you're willing to use python, there are good libraries to make all of this testing very easy for you. If you try a nerual network and it doesn't give you very good predictions, there are many other methods of regression you could try. You could try linear regression (which is a special case of a neural network), or a random forest for example. All of these are easy to code up in python if you use sklearn for your linear regression and your random forest. There are a few libraries for neural networks which make playing with them pretty easy, as well. I recommend tensorflow for neural networks.
My recommendation would be to spend a little bit of time trying several methods. For a relatively simple prediction problem like this, the time to train your network should be pretty short. The longer times of days or weeks you may have heard about are for massive datasets with millions or billions of training examples and millions of parameters.
Here http://pastebin.com/KrUAX9je is a toy neural network I created to "learn" to approximate a function f(a,b,c) = abc.
Backpropagation (BP) is a method for learning artificial neural network model parameters using gradient descent. It computes the gradients in an efficient manner. There are also other methods to train such models but BP is more commonly used due to many reasons. I do not know anything about the scale of the projects and the amount of data collected, but neural networks are more effective if the number of examples is large. If you have, say, 10 attributes (pH, Turbidity ...) and maybe more than 2-3k examples then neural networks could be helpful.
However, you should not think neural networks are the BEST model ever. You need to try out different models and choose the one giving you the best performance.

Which machine learning classifier to choose, in general? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
Suppose I'm working on some classification problem. (Fraud detection and comment spam are two problems I'm working on right now, but I'm curious about any classification task in general.)
How do I know which classifier I should use?
Decision tree
SVM
Bayesian
Neural network
K-nearest neighbors
Q-learning
Genetic algorithm
Markov decision processes
Convolutional neural networks
Linear regression or logistic regression
Boosting, bagging, ensambling
Random hill climbing or simulated annealing
...
In which cases is one of these the "natural" first choice, and what are the principles for choosing that one?
Examples of the type of answers I'm looking for (from Manning et al.'s Introduction to Information Retrieval book):
a. If your data is labeled, but you only have a limited amount, you should use a classifier with high bias (for example, Naive Bayes).
I'm guessing this is because a higher-bias classifier will have lower variance, which is good because of the small amount of data.
b. If you have a ton of data, then the classifier doesn't really matter so much, so you should probably just choose a classifier with good scalability.
What are other guidelines? Even answers like "if you'll have to explain your model to some upper management person, then maybe you should use a decision tree, since the decision rules are fairly transparent" are good. I care less about implementation/library issues, though.
Also, for a somewhat separate question, besides standard Bayesian classifiers, are there 'standard state-of-the-art' methods for comment spam detection (as opposed to email spam)?
First of all, you need to identify your problem. It depends upon what kind of data you have and what your desired task is.
If you are Predicting Category :
You have Labeled Data
You need to follow Classification Approach and its algorithms
You don't have Labeled Data
You need to go for Clustering Approach
If you are Predicting Quantity :
You need to go for Regression Approach
Otherwise
You can go for Dimensionality Reduction Approach
There are different algorithms within each approach mentioned above. The choice of a particular algorithm depends upon the size of the dataset.
Source: http://scikit-learn.org/stable/tutorial/machine_learning_map/
Model selection using cross validation may be what you need.
Cross validation
What you do is simply to split your dataset into k non-overlapping subsets (folds), train a model using k-1 folds and predict its performance using the fold you left out. This you do for each possible combination of folds (first leave 1st fold out, then 2nd, ... , then kth, and train with the remaining folds). After finishing, you estimate the mean performance of all folds (maybe also the variance/standard deviation of the performance).
How to choose the parameter k depends on the time you have. Usual values for k are 3, 5, 10 or even N, where N is the size of your data (that's the same as leave-one-out cross validation). I prefer 5 or 10.
Model selection
Let's say you have 5 methods (ANN, SVM, KNN, etc) and 10 parameter combinations for each method (depending on the method). You simply have to run cross validation for each method and parameter combination (5 * 10 = 50) and select the best model, method and parameters. Then you re-train with the best method and parameters on all your data and you have your final model.
There are some more things to say. If, for example, you use a lot of methods and parameter combinations for each, it's very likely you will overfit. In cases like these, you have to use nested cross validation.
Nested cross validation
In nested cross validation, you perform cross validation on the model selection algorithm.
Again, you first split your data into k folds. After each step, you choose k-1 as your training data and the remaining one as your test data. Then you run model selection (the procedure I explained above) for each possible combination of those k folds. After finishing this, you will have k models, one for each combination of folds. After that, you test each model with the remaining test data and choose the best one. Again, after having the last model you train a new one with the same method and parameters on all the data you have. That's your final model.
Of course, there are many variations of these methods and other things I didn't mention. If you need more information about these look for some publications about these topics.
The book "OpenCV" has a great two pages on this on pages 462-463. Searching the Amazon preview for the word "discriminative" (probably google books also) will let you see the pages in question. These two pages are the greatest gem I have found in this book.
In short:
Boosting - often effective when a large amount of training data is available.
Random trees - often very effective and can also perform regression.
K-nearest neighbors - simplest thing you can do, often effective but slow and requires lots of memory.
Neural networks - Slow to train but very fast to run, still optimal performer for letter recognition.
SVM - Among the best with limited data, but losing against boosting or random trees only when large data sets are available.
Things you might consider in choosing which algorithm to use would include:
Do you need to train incrementally (as opposed to batched)?
If you need to update your classifier with new data frequently (or you have tons of data), you'll probably want to use Bayesian. Neural nets and SVM need to work on the training data in one go.
Is your data composed of categorical only, or numeric only, or both?
I think Bayesian works best with categorical/binomial data. Decision trees can't predict numerical values.
Does you or your audience need to understand how the classifier works?
Use Bayesian or decision trees, since these can be easily explained to most people. Neural networks and SVM are "black boxes" in the sense that you can't really see how they are classifying data.
How much classification speed do you need?
SVM's are fast when it comes to classifying since they only need to determine which side of the "line" your data is on. Decision trees can be slow especially when they're complex (e.g. lots of branches).
Complexity.
Neural nets and SVMs can handle complex non-linear classification.
As Prof Andrew Ng often states: always begin by implementing a rough, dirty algorithm, and then iteratively refine it.
For classification, Naive Bayes is a good starter, as it has good performances, is highly scalable and can adapt to almost any kind of classification task. Also 1NN (K-Nearest Neighbours with only 1 neighbour) is a no-hassle best fit algorithm (because the data will be the model, and thus you don't have to care about the dimensionality fit of your decision boundary), the only issue is the computation cost (quadratic because you need to compute the distance matrix, so it may not be a good fit for high dimensional data).
Another good starter algorithm is the Random Forests (composed of decision trees), this is highly scalable to any number of dimensions and has generally quite acceptable performances. Then finally, there are genetic algorithms, which scale admirably well to any dimension and any data with minimal knowledge of the data itself, with the most minimal and simplest implementation being the microbial genetic algorithm (only one line of C code! by Inman Harvey in 1996), and one of the most complex being CMA-ES and MOGA/e-MOEA.
And remember that, often, you can't really know what will work best on your data before you try the algorithms for real.
As a side-note, if you want a theoretical framework to test your hypothesis and algorithms theoretical performances for a given problem, you can use the PAC (Probably approximately correct) learning framework (beware: it's very abstract and complex!), but to summary, the gist of PAC learning says that you should use the less complex, but complex enough (complexity being the maximum dimensionality that the algo can fit) algorithm that can fit your data. In other words, use the Occam's razor.
Sam Roweis used to say that you should try naive Bayes, logistic regression, k-nearest neighbour and Fisher's linear discriminant before anything else.
My take on it is that you always run the basic classifiers first to get some sense of your data. More often than not (in my experience at least) they've been good enough.
So, if you have supervised data, train a Naive Bayes classifier. If you have unsupervised data, you can try k-means clustering.
Another resource is one of the lecture videos of the series of videos Stanford Machine Learning, which I watched a while back. In video 4 or 5, I think, the lecturer discusses some generally accepted conventions when training classifiers, advantages/tradeoffs, etc.
You should always keep into account the inference vs prediction trade-off.
If you want to understand the complex relationship that is occurring in your data then you should go with a rich inference algorithm (e.g. linear regression or lasso). On the other hand, if you are only interested in the result you can go with high dimensional and more complex (but less interpretable) algorithms, like neural networks.
Selection of Algorithm is depending upon the scenario and the type and size of data set.
There are many other factors.
This is a brief cheat sheet for basic machine learning.

Resources