Best embeddings transition from 300 to 2 dimension - embedding

I have a question on which I can't find the answer. I want to train 300 dimensional embeddings sentences by taking the average of its components and then presenting them in 2d. But I'm looking for a way to reduce the overlap of the clusters, because I want to see how these sentences are grouped.
History:
I started with reducing the 300-> 2 dimensions using T-SNE, with standard parameters from Sklearn. Then I read that I would better narrow the 300-> 50 with the PCA and then the 50-> 2 T-SNE with PCA init, what really improves the grouping. Do you know of any other methods of reducing, other parameters to improve of these tools, maybe other libraries or even other reduction tools?
question
Do you know any better method for receiving embeddings sentences?
Best regards
P.S .
I performed the above steps in python

Related

How to cluster sequences? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
How you would cluster sequential information? I have about 500 sequences and some have the same characteristics. Is there anything like K-means for categorical sequential (temporal) data, or what would your approach look like?
These sequences are sequences of one-hot-encoded vectors which are representing classes. Consider for example the nurse-rostering problem with four classes: early-shift, day-shift, night-shift, home. The vectors look like this: [0, 1, 0, 0], [0, 1, 0, 0], [0, 0, 0, 1], this nurse works 2 days with the day-shift and is home the third day. But this "schedule" could depend on the parameters of the hospital, so I would like to cluster similar data. I have about 500 "schedules". Any ideas?
I will mention 3 "levels" at which you could solve this problem, assuming that you will be able to frame your problem statement accordingly. Please consider this answer as something you can use to get direction on how to solve this problem since the question you ask is not that specific and covers a very wide scope (usually against SO guidelines).
Traditional approaches involved using some DR (Dimensionality reduction) approaches such as PCA followed by Clustering such as Kmeans, Gaussian mixtures, Density-based methods, etc.
The issue with these approaches was that they assumed that the observed data was generated from a lower-dimensional latent space via simple linear transformations. E.g. When using PCA on data, you assume that the data that you see comes from linear combinations of the 2 principal components. This works for a lot of datasets but more complex data is usually a result of non-linear transformations of lower-dimensional latent spaces.
More modern approaches handled this to some extent using DNNs as pre-processing followed by clustering methods. DNNs helped with the non-linearity as well as allowing for better low dimensional representations for data types such as sequences and images. This is usually what the majority of the baseline benchmark models are made on -
Train an auto-encoder to regenerate the sequence
Take the bottleneck embedding/latent vector and use a clustering algorithm to cluster in this latent space.
While these approaches work well, there is a flaw in these as well. Since no clustering-driven objective is explicitly incorporated in the learning process, the learned DNNs do not necessarily output low dimensional data that are suitable for clustering.
The latest research involves training DNNs along with a clustering loss so that it ensures that the latent space is clustering friendly. These algorithms give superior results to any of the above approaches. One of the SOTA approaches in this category is DCN (Deep clustering networks). DCNs combine the reconstruction loss of an autoencoder with a clustering loss. It defines a centroid-based target probability distribution (very similar to Kmeans but with student-t distribution) and minimizes its KL divergence against the model clustering result.
Find more information here and here.
Specific to your case: You have a sequence vector with 4 features. You can build an LSTM based autoencoder to create initial embeddings and then use a clustering method to cluster the latent vector. Or if you are interested in DCNs, you can build a similar setup with an autoencoder and then use the clustering loss along with reconstruction loss to further train the encoder to generate clustering-friendly embeddings.

Feature engineering gaussian distributed input

I am designing a NN classifier where most of the input features are estimations of gaussian distributions. I.e. one feature has a mu and a sigma value.
The classifier has about 30 input features, 60 if you consider each mu and sigma their own feature.
The number of outputs are 15, i.e. there are 15 possible classifications.
I have about 50k examples to use for training/verification.
I can think of a few different scenarios of how to transform these features into something useful but I am not clever enough to come to any conclusions on how they would impact my results.
First scenario is to just scale and blindly pass each mu and sigma individually. I don't really see how sigma would help the classifier in this case, since it's just a measure of uncertainty. Optimally this would lead to slightly "fuzzier" classifications which possibly could be used for estimating some certainty metric of a classification result.
Second scenario is to generate more test cases by drawing a value from the gaussian of each each of the 30 input features, and then normalizing these random values. This would give me more training data, which could be useful.
As I side note I have the possibility to get more data (about 50k examples more) but I am not sure how accurate that data is so I would like to try with this smaller set first to see if it converges.
The question is: Is there any consensus or interesting paper in the community, describing how to deal with estimated uncertainty in input features?
Thanks!
P.S. Sorry for my bad wording, ML is not my professional domain nor is English my native language.

Topic model as a dimension reduction method for text mining -- what to do next?

My understanding of the work flow is to run LDA -> Extract keywards (e.g. the top few words for each topics), and hence reduce dimension -> some subsequent analysis.
My question is, if my overall purpose is to give topic to articles in an unsupervised way, or clustering similar documents together, then a running of LDA will take you directly to the goal. Why do you reduce the dimension and then pass it to subsequent analysis? If you do, what sort of subsequent analysis can you do after LDA?
Also, a bit unrelated question -- is it better to ask this question here or at cross validated?
I think cross validated is a better place for these kinds of questions. Anyhow, there are simple explanations about why we need dimension reduction:
Without dimension reduction, vector operations are not computable. Imagine a dot product between two vector with dimension in size of your dictionary! really?
Each number carry more dense amount of information after reducing the dimension. Which it usually leads to less noise. Intuitively, you only kept useful information.
You should rethink your approach, since you are mixing probabilistic methods (LDA) with Linear Algebra (dimensional reduction). When you feel more comfortable with Linear Algebra, consider Non Negative Matrix Factorisation.
Also note that your topics already constitute the reduced dimensions, there is no need to jump back to the extracted top words in the topics.

big number of attributes best classifiers

I have dataset which is built from 940 attributes and 450 instance and I'm trying to find the best classifier to get the best results.
I have used every classifier that WEKA suggest (such as J48, costSensitive, combinatin of several classifiers, etc..)
The best solution I have found is J48 tree with accuracy of 91.7778 %
and the confusion matrix is:
394 27 | a = NON_C
10 19 | b = C
I want to get better reuslts in the confution matrix for TN and TP at least 90% accuracy for each.
Is there something that I can do to improve this (such as long time run classifiers which scans all options? other idea I didn't think about?
Here is the file:
https://googledrive.com/host/0B2HGuYghQl0nWVVtd3BZb2Qtekk/
Please help!!
I'd guess that you got a data set and just tried all possible algorithms...
Usually, it is a good to think about the problem:
to find and work only with relevant features(attributes), otherwise
the task can be noisy. Relevant features = features that have high
correlation with class (NON_C,C).
your dataset is biased, i.e. number of NON_C is much higher than C.
Sometimes it can be helpful to train your algorithm on the same portion of positive and negative (in your case NON_C and C) examples. And cross-validate it on natural (real) portions
size of your training data is small in comparison with the number of
features. Maybe increasing number of instances would help ...
...
There are quite a few things you can do to improve the classification results.
First, it seems that your training data is severly imbalanced. By training with that imbalance you are creating a significant bias in almost any classification algorithm
Second, you have a larger number of features than examples. Consider using L1 and/or L2 regularization to improve the quality of your results.
Third, consider projecting your data into a lower dimension PCA space, say containing 90 % of the variance. This will remove much of the noise in the training data.
Fourth, be sure you are training and testing on different portions of your data. From your description it seems like you are training and evaluating on the same data, which is a big no no.

How to approach machine learning problems with high dimensional input space?

How should I approach a situtation when I try to apply some ML algorithm (classification, to be more specific, SVM in particular) over some high dimensional input, and the results I get are not quite satisfactory?
1, 2 or 3 dimensional data can be visualized, along with the algorithm's results, so you can get the hang of what's going on, and have some idea how to aproach the problem. Once the data is over 3 dimensions, other than intuitively playing around with the parameters I am not really sure how to attack it?
What do you do to the data? My answer: nothing. SVMs are designed to handle high-dimensional data. I'm working on a research problem right now that involves supervised classification using SVMs. Along with finding sources on the Internet, I did my own experiments on the impact of dimensionality reduction prior to classification. Preprocessing the features using PCA/LDA did not significantly increase classification accuracy of the SVM.
To me, this totally makes sense from the way SVMs work. Let x be an m-dimensional feature vector. Let y = Ax where y is in R^n and x is in R^m for n < m, i.e., y is x projected onto a space of lower dimension. If the classes Y1 and Y2 are linearly separable in R^n, then the corresponding classes X1 and X2 are linearly separable in R^m. Therefore, the original subspaces should be "at least" as separable as their projections onto lower dimensions, i.e., PCA should not help, in theory.
Here is one discussion that debates the use of PCA before SVM: link
What you can do is change your SVM parameters. For example, with libsvm link, the parameters C and gamma are crucially important to classification success. The libsvm faq, particularly this entry link, contains more helpful tips. Among them:
Scale your features before classification.
Try to obtain balanced classes. If impossible, then penalize one class more than the other. See more references on SVM imbalance.
Check the SVM parameters. Try many combinations to arrive at the best one.
Use the RBF kernel first. It almost always works best (computationally speaking).
Almost forgot... before testing, cross validate!
EDIT: Let me just add this "data point." I recently did another large-scale experiment using the SVM with PCA preprocessing on four exclusive data sets. PCA did not improve the classification results for any choice of reduced dimensionality. The original data with simple diagonal scaling (for each feature, subtract mean and divide by standard deviation) performed better. I'm not making any broad conclusion -- just sharing this one experiment. Maybe on different data, PCA can help.
Some suggestions:
Project data (just for visualization) to a lower-dimensional space (using PCA or MDS or whatever makes sense for your data)
Try to understand why learning fails. Do you think it overfits? Do you think you have enough data? Is it possible there isn't enough information in your features to solve the task you are trying to solve? There are ways to answer each of these questions without visualizing the data.
Also, if you tell us what the task is and what your SVM output is, there may be more specific suggestions people could make.
You can try reducing the dimensionality of the problem by PCA or the similar technique. Beware that PCA has two important points. (1) It assumes that the data it is applied to is normally distributed and (2) the resulting data looses its natural meaning (resulting in a blackbox). If you can live with that, try it.
Another option is to try several parameter selection algorithms. Since SVM's were already mentioned here, you might try the approach of Chang and Li (Feature Ranking Using Linear SVM) in which they used linear SVM to pre-select "interesting features" and then used RBF - based SVM on the selected features. If you are familiar with Orange, a python data mining library, you will be able to code this method in less than an hour. Note that this is a greedy approach which, due to its "greediness" might fail in cases where the input variables are highly correlated. In that case, and if you cannot solve this problem with PCA (see above), you might want to go to heuristic methods, which try to select best possible combinations of predictors. The main pitfall of this kind of approaches is the high potential of overfitting. Make sure you have a bunch "virgin" data that was not seen during the entire process of model building. Test your model on that data only once, after you are sure that the model is ready. If you fail, don't use this data once more to validate another model, you will have to find a new data set. Otherwise you won't be sure that you didn't overfit once more.
List of selected papers on parameter selection:
Feature selection for high-dimensional genomic microarray data
Oh, and one more thing about SVM. SVM is a black box. You better figure out what is the mechanism that generate the data and model the mechanism and not the data. On the other hand, if this would be possible, most probably you wouldn't be here asking this question (and I wouldn't be so bitter about overfitting).
List of selected papers on parameter selection
Feature selection for high-dimensional genomic microarray data
Wrappers for feature subset selection
Parameter selection in particle swarm optimization
I worked in the laboratory that developed this Stochastic method to determine, in silico, the drug like character of molecules
I would approach the problem as follows:
What do you mean by "the results I get are not quite satisfactory"?
If the classification rate on the training data is unsatisfactory, it implies that either
You have outliers in your training data (data that is misclassified). In this case you can try algorithms such as RANSAC to deal with it.
Your model(SVM in this case) is not well suited for this problem. This can be diagnozed by trying other models (adaboost etc.) or adding more parameters to your current model.
The representation of the data is not well suited for your classification task. In this case preprocessing the data with feature selection or dimensionality reduction techniques would help
If the classification rate on the test data is unsatisfactory, it implies that your model overfits the data:
Either your model is too complex(too many parameters) and it needs to be constrained further,
Or you trained it on a training set which is too small and you need more data
Of course it may be a mixture of the above elements. These are all "blind" methods to attack the problem. In order to gain more insight into the problem you may use visualization methods by projecting the data into lower dimensions or look for models which are suited better to the problem domain as you understand it (for example if you know the data is normally distributed you can use GMMs to model the data ...)
If I'm not wrong, you are trying to see which parameters to the SVM gives you the best result. Your problem is model/curve fitting.
I worked on a similar problem couple of years ago. There are tons of libraries and algos to do the same. I used Newton-Raphson's algorithm and a variation of genetic algorithm to fit the curve.
Generate/guess/get the result you are hoping for, through real world experiment (or if you are doing simple classification, just do it yourself). Compare this with the output of your SVM. The algos I mentioned earlier reiterates this process till the result of your model(SVM in this case) somewhat matches the expected values (note that this process would take some time based your problem/data size.. it took about 2 months for me on a 140 node beowulf cluster).
If you choose to go with Newton-Raphson's, this might be a good place to start.

Resources