I'm working on replicating results from a paper and when the authors are describing their setup for SVM they say this:
To increase the dimensionality of our feature vectors to be better
suited to SVMs, we expanded the feature space by taking the polynomial
combinations of degree less than or equal to 2, of all features. This
increased the number of features from 12 to 91.
How would you do this in the gui version of weka?
I really can't figure out what setting they changed to increase the number of attributes by 79. I've looked through the internet and through the weka documentation and even just clicking around on the gui but I can't seem to find any functionality that would do this.
Thank you for your help!
It seems the authors of the paper do not really understand how SVM works. Simply train SVM with polynomial kernel of degree 2 and you will get the same expressive power.
Related
I've written a program that reads word vectors from a particular website and makes conclusary classifications.
I'm getting the highest accuracy and F Score for a RandomForestClassifier. I'm not sure what I can do to make this accuracy higher except changing the model. I tried to use MLPs but landed with a lower accuracy. Should I use some other neural network?
Does anyone know what models generally work for such NLP based programs?
In a nutshell, what the program does is look through the HTML of a given webpage for certain features, vectorize the words it can find (using predefined vector spaces) and make classifications based on that. I'm getting an accuracy over 90% for the RandomForestClassifier. Any help?
Building a classifier for classical problems, like image classification, is quite straightforward, since by visualization on the image we know the pixel values do contain the information about the target.
However, for the problems in which there is no obvious visualizable pattern, how should we evaluate or to see if the features collected are good enough for the target information? Or if there are some criterion by which we can conclude the collected features does not work at all. Otherwise, we have to try different algorithms or classifiers to verify the predictability of the collected data. Or if there is a thumb rule saying that if apply classical classifiers, like SVM, random forest and adaboost, we cannot get a classifier with a reasonable accuracy (70%) then we should give up and try to find some other more related features.
Or by some high dim visualization tool, like t-sne, if there is no clear pattern presented in some low dim latent space, then we should give up.
First of all, there might be NO features that explain the data well enough. The data may simply be pure noise without any signal. Therefore speaking about "reasonable accuracy" of any level e.g. 70% is improper. For some data sets a model that explains 40 % of its variance will be fantastic.
Having said that, the simplest practical way to evaluate the input features is to calculate correlations between each of them and the target.
Models have their own ways of evaluating features importance.
I've built an algorithm for pedestrian detection using openCV tools. To perform classification I use a boosted classifier trained with the CvBoost class.
The problem of this implementation is that I need to feed my classifier the whole set of features I used for training. This makes the algorithm extremely slow, so much that each image takes around 20 seconds to be fully analysed.
I need a different detection structure, and openCV has this Soft Cascade class that seems like exactly what I need. Its basic principle is that there is no need to examine all the features of a testing sample, since a detector can reject most negative samples using a small number of features. The problem is that I have no idea how to train one given a fully labeled set of negative and positive examples.
I find no information about this online, so I am looking for any tips you can give me on how to use this soft cascade to make classification.
Best regards
I have my own java based implementation of clustering (knn). However I am facing scalability issues. I do not plan to use Mahout because my requirements are very simple and mahout requires lot of work. I am looking for java based Canopy clustering implementation which i can plug into my algo and do parellel processing.
Mahout based Canopy libraries are coupled with Vectors and indexes and does not work on plain strings. If you know of the way, where i can use canopy clustering on strings using simple library, it would fix my issue.
My requirement is to pass list of strings(say 10K) to Canopy clustering algo and it should return sublists based on T1 and T2.
Canopy clustering is mostly useful as a preprocessing step for parallelization. I'm not sure how much it will get you on a single node. I figure you might as well compute the actual algorithm right away, or build an index such as an M-tree.
The strength of Canopy clustering is that you can run it independently on a number of nodes and then just overlap their results.
Also check if it actually is compatible to your approach. I figure that canopy might need metric properties to be correct. Is your string distance a proper metric (i.e. triangle inequality)?
10,000 data points, if that's all you're concerned with, should be no problem with standard k-means. I'd look at optimising that before you consider canopy clustering (which is really designed for millions or even billions of examples). Some things you may have missed:
pre-compute the feature vectors for each string. Don't do it every time you want to compare s_1 to s_2 or s_1 to cluster centroid
you only need to keep the summary statistics in memory: the sum of all points assigned to a cluster and the number of points assigned to a cluster. When you're done with an iteration, divides sums by ns and you have your new centroids.
what's the dimensionality of your feature space? be aware that you should use a distance metric where the dimensions where both vectors are zero have no impact, so you should only need to compute for non-zero dimensions. Store your points as sparse vectors to facilitate this.
Can you do some analysis and determine where the bottle-neck in your implementation is? I'm a little perplexed by your comment about Mahout not working with plain strings.
You should give the clustering algorithms in ELKI a try. Sorry for so shamelessly promoting a project I'm closely affiliated with. But it is the largest collection of clustering and outlier detection algorithms that are implemented in a comparable fashion. (If you'd take all the clustering algorithms available in some R package, you might end up with more algorithms, but they won't be really comparable because of implementation differences)
And benchmarking showed enormous speed differences with different implementations of the same algorithm. See our benchmarking web site on how much performance can vary even on simple algorithms such as k-means.
We do not yet have Canopy Clustering. The reason is that it's more of a preprocessing index than actually a clustering algorithm. Kind of like a primitive variant of the M-tree, or of DBSCAN clustering. However, we should would like to see a contributed canopy clustering as such a preprocessing step.
ELKIs abilities to process strings are also a bit limited so far. You can load typical TF-IDF vectors just fine and we have somewhat optimized sparse vector classes and similarity functions. They don't fully exploit sparsity for k-means yet, though, and there is no spherical k-means yet either. But there are various reasons why k-means results on sparse vectors cannot be expected to be very meaningful; it's more of a heuristic.
But it would be interesting if you could give it a try for your problem and report back your experiences. Was the performance somewhat competitive with your implementation? And we would love to see contributed modules for text processing, such as e.g. further optimized similarity functions, or a spherical k-means variant.
Update: ELKI now actually includes CanopyClustering: CanopyPreClustering (will be part of 0.6.0 then). But as of now, it's just another clustering algorithm, and not yet used to accelerate other algorithms such as k-means. I need to check how to best use it as some kind of index to accelerate algorithms. I can imagine it also helps for speeding up DBSCAN if you set T1=epsilon and T2=0.5*T1. The big issue with CanopyClustering IMHO is how to choose a good radius.
I am implementing K nearest neighbour algorithm for a very sparse data. I want to calculate the distance between a test instance and each sample in the training set, but I am confused.
Because most of the features in training samples don't exist in test instance or vice versa (missing features).
How can I compute the distance in this situation?
To make sure I'm understanding the problem correctly: each sample forms a very sparsely filled vector. The missing data is different between samples, so it's hard to use any Euclidean or other distance metric to gauge similarity of samples.
If that is the scenario, I have seen this problem show up before in machine learning - in the Netflix prize contest, but not specifically applied to KNN. The scenario there was quite similar: each user profile had ratings for some movies, but almost no user had seen all 17,000 movies. The average user profile was quite sparse.
Different folks had different ways of solving the problem, but the way I remember was that they plugged in dummy values for the missing values, usually the mean of the particular value across all samples with data. Then they used Euclidean distance, etc. as normal. You can probably still find discussions surrounding this missing value problem on that forums. This was a particularly common problem for those trying to implement singular value decomposition, which became quite popular and so was discussed quite a bit if I remember right.
You may wish to start here:
http://www.netflixprize.com//community/viewtopic.php?id=1283
You're going to have to dig for a bit. Simon Funk had a little different approach to this, but it was more specific to SVDs. You can find it here: http://www.netflixprize.com//community/viewtopic.php?id=1283
He calls them blank spaces if you want to skip to the relevant sections.
Good luck!
If you work in very high dimension space. It is better to do space reduction using SVD, LDA, pLSV or similar on all available data and then train algorithm on trained data transformed that way. Some of those algorithms are scalable therefor you can find implementation in Mahout project. Especially I prefer using more general features then such transformations, because it is easier debug and feature selection. For such purpose combine some features, use stemmers, think more general.