How can I normalize data to have same average sum of square? - normalization

In a lot of articles in my field, this sentence has been repeated: " The 2 matrices has been normalized to have the same average sum-of-squares (computed across all subjects and all voxels for each modality)". Suppose that we have two matrices that the rows define different subjects and the columns are features (voxels). In these articles, no much explanation can be found for normalization method. Does anybody knows how I should normalize data to have "same average sum-of-squares"? I don't understand it at all. Thanks

For a start normalization in this context is also known as features scaling, which pretty much sums it up. You scale your features, your data to get rid of variances and range of values which would disturb your algorithm and your results in the end.
https://en.wikipedia.org/wiki/Feature_scaling
In data processing, normalization is quite useful (depending on the application). E.g. in distance based machine learning algorithms you should normalize your features in order to get a proportional contribution to the outcome of your algorithm, independent of the range of value the features comprise.
To do so, you can use different statistical measurements, like the
Sum of squares:
SUM_i(Xi-Xbar)²
Other than that you could use the variance or the standard deviation of your data.
https://www.westgard.com/lesson35.htm#4
Those statistical terms can then be used to normalize your data, to improve e.g. the clustering quality of your algorithm. Which term to use and which method highly depends on the algorithms and data you're using and what you're aiming at.
Here is a paper which compares some of the approaches you could choose from for clustering:
http://maxwellsci.com/print/rjaset/v6-3299-3303.pdf
I hope this can help you a little.

Related

dimensional time series data clustering

I have a data set which is time-series type and contains three dimensions namely acceleration, speed and grade. I want to apply clustering to identify the clusters that have similar speed (acceleration=0, positive or negative) varying with grade. I do not know what type of clustering should i use, surely k-means cannot help me because there is a serial correlation between my data point because each point is affected by its previous point. Could you please help me with the type of clustering?
Popular time series similarity metrics such as DTW can be implemented for multiple variates the same way as for a single variate. The most challenging part is normalization.
You then can run hierarchical clustering trivially. Do not use KMeans.

Reducing a matrix of feature vectors to a single, meaningful vector

I have matrices of feature vectors - 200 features long, in which the feature vectors within a matrix are temporally related, but I wish to reduce each matrix to a single, meaningful vector. I have applied PCA to the matrix in order to reduce its dimensionality to one with high variance, and am considering concatenating its rows together into one feature vector to summarize the data.
Is this a sensible approach, or are there better ways of achieving this?
So you have an n x 200 feature matrix, where n is your number of samples, and 200 features per sample, and each feature is temporally related to all others? Or you have individual feature matrices, one for each time point, and you want to run PCA on each of these individual feature matrices to find a single eigenvector for that time point, and then concatenate those together?
PCA seems more useful in the second case.
While this is doable, this is maybe not the best way to go about it because you lose temporal sensitivity by collapsing together features from different times. Even if each feature in your final feature matrix represents a different time, most classifiers cannot learn about the fact that feature 2 follows feature 1 etc. So you lose the natural temporal ordering by doing this.
If you care about the the temporal relationship between these features you may want to take a look at recurrent neural networks, which allow you feed information from t-1 into a node, at the same time as feeding in your current t features. So in a sense they learn about the relationship between t-1 and t features which will help you preserve temporal ordering. See this for an explanation: http://karpathy.github.io/2015/05/21/rnn-effectiveness/
If you don't care about time and just want to group everything together, then yes PCA will help reduce your feature count. Ultimately it depends what type of information you think is more relevant to your problem.

Difference between similarity strategies in Mahout recommenditembased

I am using mahout recommenditembased algorithm. What are the differences between all the --similarity Classes available? How to know what is the best choice for my application? These are my choices:
SIMILARITY_COOCCURRENCE
SIMILARITY_LOGLIKELIHOOD
SIMILARITY_TANIMOTO_COEFFICIENT
SIMILARITY_CITY_BLOCK
SIMILARITY_COSINE
SIMILARITY_PEARSON_CORRELATION
SIMILARITY_EUCLIDEAN_DISTANCE
What does it mean each one?
I'm not familiar with all of them, but I can help with some.
Cooccurrence is how often two items occur with the same user. http://en.wikipedia.org/wiki/Co-occurrence
Log-Likelihood is the log of the probability that the item will be recommended given the characteristics you are recommending on. http://en.wikipedia.org/wiki/Log-likelihood
Not sure about tanimoto
City block is the distance between two instances if you assume you can only move around like you're in a checkboard style city. http://en.wikipedia.org/wiki/Taxicab_geometry
Cosine similarity is the cosine of the angle between the two feature vectors. http://en.wikipedia.org/wiki/Cosine_similarity
Pearson Correlation is covariance of the features normalized by their standard deviation. http://en.wikipedia.org/wiki/Pearson_correlation_coefficient
Euclidean distance is the standard straight line distance between two points. http://en.wikipedia.org/wiki/Euclidean_distance
To determine which is the best for you application you most likely need to have some intuition about your data and what it means. If your data is continuous value features than something like euclidean distance or pearson correlation makes sense. If you have more discrete values than something along the lines of city block or cosine similarity may make more sense.
Another option is to set up a cross-validation experiment where you see how well each similarity metric works to predict the desired output values and select the metric that works the best from the cross-validation results.
Tanimoto and Jaccard are similars, is a statistic used for comparing the similarity and diversity of sample sets.
https://en.wikipedia.org/wiki/Jaccard_index

What machine learning algorithms can be used in this scenario?

My data consists of objects as follows.
Obj1 - Color - shape - size - price - ranking
So I want to be able to predict what combination of color/shape/size/price is a good combination to get high ranking. Or even a combination could work like for eg: in order to get good ranking, the alg predicts best performance for this color and this shape. Something like that.
What are the advisable algorithms for such a prediction?
Also may be if you can briefly explain how I can approach towards the model building I would really appreciate it. Say for eg: my data looks like
Blue pentagon small $50.00 #5
Red Squre large $30.00 #3
So what is a useful prediction model that I should look at? What algorithm should I try to predict like say highest weightage is for price followed by color and then size. What if I wanted to predict in combinations like a Red small shape is less likely to higher rank compared to pink small shape . (In essence trying to combine more than one nominal values column to make the prediction)
Sounds like you want to learn models that you can interpret as a human. Depending on what type your ranking variable is, a number of different learners are possible.
If ranking is categorical (e.g. stars), a classifier is probably best. There are many in Weka. Some that produce models that are understandable by humans are the J48 decision tree learner and the OneR rule learner.
If the ranking is continuous (e.g. a score), regression might be more appropriate. Suitable algorithms are for example SimpleLogistic and LinearRegression.
Alternatively, you could try clustering your examples with any of the algorithms in Weka and then analyzing the clusters. That is, ideally examples in a cluster would all be of the same (or very similar) ranking and you can have a look at the range of values of the other attributes and draw your own conclusions.
Treat the combination as a linear equation, and apply a Monte Carlo algorithm (like Genetic Algorithm) to tune the parameters of the equation.
Code the color/shape/size/price/rankings into digital values.
Treat the combination as a linear equation, say a*color + b*shape + c*size + d*price = ranking.
Apply Genetic Algorithm to tune a/b/c/d, in order to make calculated rankings to be as closer to the ground-truth as possible.
Finally you got the equation, you could use it to:
1) find maximal rankings by a simple linear planning;
2) predict rankings by just assign other parameters.

Bad clustering results with mahout on Reuters 21578 dataset

I 've used a part of reuters 21578 dataset and mahout k-means for clustering.To be more specific I extracted only the texts that has a unique value for category 'topics'.So I ve been left with 9494 texts that belong to one among 66 categories. I ve used seqdirectory to create sequence files from texts and then seq2sparse to crate the vectors. Then I run k-means with cosine distance measure (I ve tried tanimoto and euclidean too, with no better luck), cd=0.1 and k=66 (same as the number of categories). So I tried to evaluate the results with silhouette measure using custom Java code and the matlab implementation of silhouette (just to be sure that there is no error in my code) and I get that the average silhouette of the clustering is 0.0405. Knowing that the best clustering could give an average silhouette value close to 1, I see that the clustering result I get is no good at all.
So is this due to Mahout or the quality of catgorization on reuters dataset is low?
PS: I m using Mahout 0.7
PS2: Sorry for my bad English..
I've never actually worked with Mahout, so I cannot say what it does by default, but you might consider checking what sort of distance metric it uses by default. For example, if the metric is Euclidean distance on unnormalized document word counts, you can expect very poor quality cluster quality, as document length will dominate any meaningful comparison between documents. On the other hand, something like cosine distance on normalized, or tf-idf weighted word counts can do much better.
One other thing to look at is the distribution of topics in the Reuters 21578. It is very skewed towards a few topics such as "acq" or "earn", while others are used only handfuls of times. This can it difficult to achieve good external clustering metrics.

Resources