Stream normalization for online clustering in evolving environments [closed] - machine-learning

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
TL;DR: how to normalize stream data, given that the whole data set is not available and you are dealing with clustering for evolving environments
Hi! I'm currently studying dynamic clustering for non-stationary data streams. I need to normalize the data because all features should have the same impact in the final clustering, but I don't know how to do it .....
I need to apply a standard normalization. My initial approach was to:
Fill a buffer with initial data points
Use those data points to get mean and standard deviation
Use those measures to normalize the current data points
Send those points normalized to the algorithm one by one
Use the previous measures to keep normalizing incoming data points for a while
Every some time calculate again mean and standard deviation
Represent the current micro clusters centroids with the new measures (having the older ones it shouldn't be a problem to go back and normalize again)
Use the new measures to keep normalizing incoming data points for a while
And so on ....
The thing is that normalizing the data should not get involved with what the clustering algorithm does ... I mean, you are not able to tell the clustering algorithm 'ok, the micro clusters you have till now need to be normalized with this new mean and stdev' ... I mean, I developed an algorithm and I could do this, but I am also using existing algorithms (clustream and denstream) and it does not feel right to me to modify them to be able to do this ....
Any ideas?
TIA

As more data streams in, the estimated standardization parameters (e.g, mean and std) are updated and converge further to the true values [1, 2, 3]. In evolving environments, it is even more pronounced as the data distributions are now time-varying too [4]. Therefore, the more recent streamed samples that have been standardized using the more recent estimated standardization parameters are more accurate and representative.
A solution is to merge the present with a partial reflection of the past by embedding a new decay parameter in the update rule of your clustering algorithm. It boosts the contribution of the more recent samples that have been standardized using the more recent distribution estimates. You can see an implementation of this idea in Apache Sparks MLib [5, 6, 7]:
where the α is the new decay parameter; lower α makes the algorithm favor the more recent samples more.

Data normalization affects clustering for algorithms that depend on the L2 distance. Therefore you can't really have a global solution to your question.
If your clustering algorithm supports it, one option would be to use clustering with a warm-start in the following way:
at each step, find the "evolved" clusters from scratch, using the samples re-normalized according to the new mean and std dev
do not initialize clusters randomly, but instead, use the clusters found in the previous step as represented in the new space.

Related

scaling data for a dataset with numerical ( both continuous and discrete) and categorical variables [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 12 months ago.
Improve this question
I am practicing regression and classification techniques on different datasets. Now I came to this dataset and I am going to practice regression algorithms.
I want to try algorithms on raw data, normalized data and standardized data ( just as a practice )
Now I know for categorical variables it is enough to define them as categories and no scaling is needed.
I learnt, dependent variable will not be scaled.
I learnt if continuous numeric variables have different units or large difference in values, I can try scaling on them.
but what about discrete numerical variables? should I scale them? I read in some content that if there is a small range of values I d better define that discrete variable as a categorical variable and do not scale them. is this a commen approach in machine learning?
I would appreciate any help understanding how to treat discrete variables.
Suppose, you have a feature that is the number of students in a university and you want to use that feature to predict the cost per year to study at that university (a simple linear regression task). In this case, it makes sense to take the feature as a continuous variable, and not a categorical one.
Again, suppose, you have a feature, which is the pH of an acid, and you want to use that feature to predict the color it will produce when it is added to universal indicator solution (a solution that shows different color for different pH values), (Now it's a classification problem: classify between colors). In this case, it does not make any sense to take the feature as a continuous variable, but a categorical one as different pH values have different colors, and no gradual change.
So, when should you define a discrete feature as a categorical variable and when to define it as a continuous variable? It actually depends on what the feature represents.

What are the performance metrics for Clustering Algorithms? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I'm working on Kmeans clustering but unlike supervised learning I cannot figure the performance metrics for clustering algorithms. How to perform the accuracy after training the data?
For kmeans you can find the inertia_ of it. Which can give you an idea how well kmeans algorithm has worked.
kmeans = KMeans(...)
# Assuming you already have fitted data on it.
kmeans.inertia_ # lesser is better
Or, alternatively if you call score() function, which will give you the same but the sign will be negative. As we assume bigger score means better but for kmeans lesser inertia_ is better. So, to make them consistent an extra negation is applied on it.
# Call score with data X
kmeans.score(X) # greater is better
This is the very basic form of analyzing performance of kmeans. In reality if you take the number of clusters too high the score() will increase accordingly (in other words inertia_ will decrease), because inertia_ is nothing but the summation of the squared distances from each point to its corresponding cluster's centroid to which cluster it is assigned to. So if you increase the number of the clusters too much, the overall distances' squared summation will decrease as each point will get a centroid very near to it. Although, the quality of clustering is horrible in this case. So, for better analysis you should find out silhouette score or even better use silhouette diagram in this case.
You will find all of the implementations in this notebook: 09_unsupervised_learning.ipynb
The book corresponding to this repository is: Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition. It is a great book to learn all of these details.

How to cluster sequences? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
How you would cluster sequential information? I have about 500 sequences and some have the same characteristics. Is there anything like K-means for categorical sequential (temporal) data, or what would your approach look like?
These sequences are sequences of one-hot-encoded vectors which are representing classes. Consider for example the nurse-rostering problem with four classes: early-shift, day-shift, night-shift, home. The vectors look like this: [0, 1, 0, 0], [0, 1, 0, 0], [0, 0, 0, 1], this nurse works 2 days with the day-shift and is home the third day. But this "schedule" could depend on the parameters of the hospital, so I would like to cluster similar data. I have about 500 "schedules". Any ideas?
I will mention 3 "levels" at which you could solve this problem, assuming that you will be able to frame your problem statement accordingly. Please consider this answer as something you can use to get direction on how to solve this problem since the question you ask is not that specific and covers a very wide scope (usually against SO guidelines).
Traditional approaches involved using some DR (Dimensionality reduction) approaches such as PCA followed by Clustering such as Kmeans, Gaussian mixtures, Density-based methods, etc.
The issue with these approaches was that they assumed that the observed data was generated from a lower-dimensional latent space via simple linear transformations. E.g. When using PCA on data, you assume that the data that you see comes from linear combinations of the 2 principal components. This works for a lot of datasets but more complex data is usually a result of non-linear transformations of lower-dimensional latent spaces.
More modern approaches handled this to some extent using DNNs as pre-processing followed by clustering methods. DNNs helped with the non-linearity as well as allowing for better low dimensional representations for data types such as sequences and images. This is usually what the majority of the baseline benchmark models are made on -
Train an auto-encoder to regenerate the sequence
Take the bottleneck embedding/latent vector and use a clustering algorithm to cluster in this latent space.
While these approaches work well, there is a flaw in these as well. Since no clustering-driven objective is explicitly incorporated in the learning process, the learned DNNs do not necessarily output low dimensional data that are suitable for clustering.
The latest research involves training DNNs along with a clustering loss so that it ensures that the latent space is clustering friendly. These algorithms give superior results to any of the above approaches. One of the SOTA approaches in this category is DCN (Deep clustering networks). DCNs combine the reconstruction loss of an autoencoder with a clustering loss. It defines a centroid-based target probability distribution (very similar to Kmeans but with student-t distribution) and minimizes its KL divergence against the model clustering result.
Find more information here and here.
Specific to your case: You have a sequence vector with 4 features. You can build an LSTM based autoencoder to create initial embeddings and then use a clustering method to cluster the latent vector. Or if you are interested in DCNs, you can build a similar setup with an autoencoder and then use the clustering loss along with reconstruction loss to further train the encoder to generate clustering-friendly embeddings.

How can neural networks learn functions with a variable number of inputs? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed last year.
Improve this question
A simple example: Given an input sequence, I want the neural network to output the median of the sequence. The problem is, if a neural network learnt to compute the median of n inputs, how can it compute the median of even more inputs? I know that recurrent neural networks can learn functions like max and parity over a sequence, but computing these functions only requires constant memory. What if the memory requirement grows with the input size like computing the median?
This is a follow up question on How are neural networks used when the number of inputs could be variable?.
One idea I had is the following: treating each weight as a function of the number of inputs instead of a fixed value. So a weight may have many parameters that define a function, and we train these parameters. For example, if we want the neural network to compute the average of n inputs, we would like each weight function behaves like 1/n. Again, average per se can be computed using recurrent neural networks or hidden markov model, but I was hoping this kind of approaches can be generalized to solve certain problems where memory requirement grows.
If a neural network learnt to compute the median of n inputs, how can it compute the median of even more inputs?
First of all, you should understand the use of a neural network. We, generally use the neural network in problems where a mathematical solution is not possible. In this problem, use of NN is not significant/ unadvisable.
There are other problems of such nature, like forecasting, in which continuous data arrives over time.
One solution to such problem can be Hidden Markov Model (HMM). But again, such models depends on the correlation between input over a period of time. So This model is not efficient for problems where the input is completely random.
So, If input is completely random and memory requirement grows
There is nothing much you can do about it, one possible solution could be growing your memory size.
Just remember one thing, NN and similar models of machine learning aims to extract meaningful information from the data. if data is just some random values then all models will generate some random output.
One more idea: some data transformation. Let have N big enough that always bigger than n. We make a net with 2*N inputs. First N inputs are for data. If n less then N, then rest inputs set to 0. Last N inputs are intended for specifying which numbers are useful. Thus 1 is data, 0 is not data. As follows in Matlab notation: if v is an input, and it is a vector of length 2*N, then we put into v(1:n) our original data. After that, we put to v(n+1:N) zeros. Then put to v(N+1:N+n) ones, and then put V(N+n+1:2*N) zeros. It is just an idea, which I have not checked. If you are interested in the application of neural networks, take a look at the example of how we have chosen an appropriate machine learning algorithm to classify EEG signals for BCI.

What is the relation between the number of Support Vectors and training data and classifiers performance? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I am using LibSVM to classify some documents. The documents seem to be a bit difficult to classify as the final results show. However, I have noticed something while training my models. and that is: If my training set is for example 1000 around 800 of them are selected as support vectors.
I have looked everywhere to find if this is a good thing or bad. I mean is there a relation between the number of support vectors and the classifiers performance?
I have read this previous post but I am performing a parameter selection and also I am sure that the attributes in the feature vectors are all ordered.
I just need to know the relation.
Thanks.
p.s: I use a linear kernel.
Support Vector Machines are an optimization problem. They are attempting to find a hyperplane that divides the two classes with the largest margin. The support vectors are the points which fall within this margin. It's easiest to understand if you build it up from simple to more complex.
Hard Margin Linear SVM
In a training set where the data is linearly separable, and you are using a hard margin (no slack allowed), the support vectors are the points which lie along the supporting hyperplanes (the hyperplanes parallel to the dividing hyperplane at the edges of the margin)
All of the support vectors lie exactly on the margin. Regardless of the number of dimensions or size of data set, the number of support vectors could be as little as 2.
Soft-Margin Linear SVM
But what if our dataset isn't linearly separable? We introduce soft margin SVM. We no longer require that our datapoints lie outside the margin, we allow some amount of them to stray over the line into the margin. We use the slack parameter C to control this. (nu in nu-SVM) This gives us a wider margin and greater error on the training dataset, but improves generalization and/or allows us to find a linear separation of data that is not linearly separable.
Now, the number of support vectors depends on how much slack we allow and the distribution of the data. If we allow a large amount of slack, we will have a large number of support vectors. If we allow very little slack, we will have very few support vectors. The accuracy depends on finding the right level of slack for the data being analyzed. Some data it will not be possible to get a high level of accuracy, we must simply find the best fit we can.
Non-Linear SVM
This brings us to non-linear SVM. We are still trying to linearly divide the data, but we are now trying to do it in a higher dimensional space. This is done via a kernel function, which of course has its own set of parameters. When we translate this back to the original feature space, the result is non-linear:
Now, the number of support vectors still depends on how much slack we allow, but it also depends on the complexity of our model. Each twist and turn in the final model in our input space requires one or more support vectors to define. Ultimately, the output of an SVM is the support vectors and an alpha, which in essence is defining how much influence that specific support vector has on the final decision.
Here, accuracy depends on the trade-off between a high-complexity model which may over-fit the data and a large-margin which will incorrectly classify some of the training data in the interest of better generalization. The number of support vectors can range from very few to every single data point if you completely over-fit your data. This tradeoff is controlled via C and through the choice of kernel and kernel parameters.
I assume when you said performance you were referring to accuracy, but I thought I would also speak to performance in terms of computational complexity. In order to test a data point using an SVM model, you need to compute the dot product of each support vector with the test point. Therefore the computational complexity of the model is linear in the number of support vectors. Fewer support vectors means faster classification of test points.
A good resource:
A Tutorial on Support Vector Machines for Pattern Recognition
800 out of 1000 basically tells you that the SVM needs to use almost every single training sample to encode the training set. That basically tells you that there isn't much regularity in your data.
Sounds like you have major issues with not enough training data. Also, maybe think about some specific features that separate this data better.
Both number of samples and number of attributes may influence the number of support vectors, making model more complex. I believe you use words or even ngrams as attributes, so there are quite many of them, and natural language models are very complex themselves. So, 800 support vectors of 1000 samples seem to be ok. (Also pay attention to #karenu's comments about C/nu parameters that also have large effect on SVs number).
To get intuition about this recall SVM main idea. SVM works in a multidimensional feature space and tries to find hyperplane that separates all given samples. If you have a lot of samples and only 2 features (2 dimensions), the data and hyperplane may look like this:
Here there are only 3 support vectors, all the others are behind them and thus don't play any role. Note, that these support vectors are defined by only 2 coordinates.
Now imagine that you have 3 dimensional space and thus support vectors are defined by 3 coordinates.
This means that there's one more parameter (coordinate) to be adjusted, and this adjustment may need more samples to find optimal hyperplane. In other words, in worst case SVM finds only 1 hyperplane coordinate per sample.
When the data is well-structured (i.e. holds patterns quite well) only several support vectors may be needed - all the others will stay behind those. But text is very, very bad structured data. SVM does its best, trying to fit sample as well as possible, and thus takes as support vectors even more samples than drops. With increasing number of samples this "anomaly" is reduced (more insignificant samples appear), but absolute number of support vectors stays very high.
SVM classification is linear in the number of support vectors (SVs). The number of SVs is in the worst case equal to the number of training samples, so 800/1000 is not yet the worst case, but it's still pretty bad.
Then again, 1000 training documents is a small training set. You should check what happens when you scale up to 10000s or more documents. If things don't improve, consider using linear SVMs, trained with LibLinear, for document classification; those scale up much better (model size and classification time are linear in the number of features and independent of the number of training samples).
There is some confusion between sources. In the textbook ISLR 6th Ed, for instance, C is described as a "boundary violation budget" from where it follows that higher C will allow for more boundary violations and more support vectors.
But in svm implementations in R and python the parameter C is implemented as "violation penalty" which is the opposite and then you will observe that for higher values of C there are fewer support vectors.

Resources