Multidimensional hyperparameter search with vw-hypersearch in Vowpal Wabbit - machine-learning

vw-hypersearch is the Vowpal Wabbit wrapper intended to optimize hyperparameters in vw models: regularization rates, learning rates and decays, minibatches, bootstrap sizes etc. In the tutorial for vw-hypersearch there is a following example:
vw-hypersearch 1e-10 5e-4 vw --l1 % train.dat
Here % means the parameter to be optimized, 1e-10 5e-4 are the lower and upper bounds for the interval over which to search. The library uses golden section search method to minimize the number of iterations.
But what if I want to search over multiple hyperparameters? From the sources like this github issue discussion, I get a hint that possibly no multidimentional search methods are realized in vw. Thus, the only way out is to write one's own task-specific optimizers. Am I right?

Now this can be done with the module vw-hyperopt.py that lives at /vowpal_wabbit/utl/ in the repository.
See my pull-request here: https://github.com/JohnLangford/vowpal_wabbit/pull/867
In the near future this will be better documented.

Related

How to Customize Metric for GridSearchCV in Scikit Learn to tune for specific class?

I have a use case in ML where I have 2 classes, 0 and 1 for a given text.
Class-0: Can afford some misclassifications
Class-1: Very Important, can't afford any misclassifications
There's a huge imbalance in samples for both classes,
about 30000 for class-0, and only 1000 for class-1
While doing the train-test split, I'm stratifying the split based on the labels, such that, the ratio of 70% train and 30% test is maintained for each label class.
I want to tune parameters in such a way that Precision or Recall for class-1 is improved. I tried using 'f1_macro', 'precision', 'recall' as individual metrics and all combined as well to tune using GridSearchCV, but it's less helpful due to majority samples being Class-0.
I'm exploring the safer ways to reduce class 0 data, although, there's only small degree we can reduce, anyways even without tuning, or with any parameters, class-0 always have above 98% f1-score.
So all I care about tuning is for class-1.
Can you please suggest, perhaps a customized callable metric such that it only focuses on Class-1's Precision, Recall or F1-Score?
I'm using scikit-learn latest stable version.
Similar Problem here, the author is trying to Tune Class-1's F1 Score using Neural Networks (MLP) in Keras
Its been suggested to try customizing metric, just didn't mention how.
The one who can answer here for Scikit-Learn, can also answer below link for Keras.
Hyperparameter tuning in Keras (MLP) via RandomizedSearchCV
Using class_weight='balanced' is helping here.
I referred these articles in Scikit-Learn's official documentation pages.
Understanding how parameter class_weights works:
https://scikit-learn.org/stable/modules/svm.html#unbalanced-problems
https://stackoverflow.com/a/30982811/3149277
Understanding what parameters to use for class_weights:
https://scikit-learn.org/stable/modules/svm.html#tips-on-practical-use
How does the class_weight parameter in scikit-learn work?
Although, due to time limits, I didn't bother defining the custom function as this seemed working close to my expectations.

Pair wise Ranking for Images

I wanted to know whether there are any algorithms using machine learning which can rank a particular set of images given there quality and other features using pairwise comparision like Learning to Rank algorithms (RankNet,LambdaRank and LambdaMART) and can these LTR algos be used for image ranking too and any good sources to find the implementation level explaination.
In the past I have used BRISQUE method to assess image quality. You can combine that with any other desired model of your choice and ensemble the result to form a ranking. Here is a detailed implementation and working of BRISQUE method.
https://towardsdatascience.com/automatic-image-quality-assessment-in-python-391a6be52c11

What kind of feature extractor is used in vowpal wabbit?

In sklearn when we pass sentence to algorithms we can use text features extractors like the countvectorizer, tf-idf vectoriser etc... And we get an array of floats.
But what we get when passed to vowpal wabbit the input file like this one:
-1 |Words The sun is blue
1 |Words The sun is yellow
What is used in internal implementation of vowpal wabbit? How does this text transform?
There are two separate questions here:
Q1: Why can't you (and shouldn't you) use transformations like tf-idf when using vowpal wabbit ?
A1: vowpal wabbit is not a batch learning system, it is an online-learning system. In order to compute measures like tf-idf (term frequency in each document vs the whole corpus) you need to see all the data (corpus) first, and sometimes do multiple passes over the data. vowpal wabbit as an online/incremental learning system is designed to also work on problems where you don't have the full data ahead of time. See This answer for a lot more details.
Q2: How does vowpal wabbit "transform" the features it sees ?
A2: It doesn't. It simply maps each word feature on-the-fly to its hashed location in memory. The online learning step is driven by a repetitive optimization loop (SGD or BFGS) example by example, to minimize the modeling error. You may select the loss function to optimize for.
However, if you already have the full data you want to train on, nothing prevents you from transforming it (using any other tool) before feeding the transformed values to vowpal wabbit. It's your choice. Depending on the particular data, you may get better or worse results using a transformation pre-pass, than by running multiple passes with vowpal wabbit itself without preliminary transformations (check-out the vw --passes option).
To complete the answer, let's add another related question:
Q3: Can I use pre-transformed (e.g. tf-idf) data with vowpal wabbit ?
A3: Yes, you can. Just use the following (post-transformation) form. Instead of words, use integers as feature IDs and since any feature can have an optional explicit weight, use the tf-idf floating point as weights, following the : separator in typical SVMlight format:
-1 | 1:0.534 15:0.123 3:0.27 29:0.066 ...
1 | 3:0.1 102:0.004 24:0.0304 ...
The reason this works, is because vw has a nice feature of distinguishing between string and integer-features. It doesn't hash feature-names that look like integers (unless you use the --hash_all option explicitly). Integer feature numbers are used directly as if they were the hash result of the feature.

Improvements of Random Search for Hyperparameter Optimization

Random search is one possibility for hyperparameter optimization in machine learning. I have applied random search to search for the best hyperparameters of a SVM classifier with a RBF kernel. Additional to the continuous Cost and gamma parameter, I have one discrete parameter and also an equality constraint over some parameters.
Now, I would like to develop random search further, e.g. through adaptive random search. That means for example adaptation of the search direction or of the search range.
Does somebody have an idea how this can be done or could reference to some existing work on this? Other ideas for improving random search are also welcome.
Why you try to reinvent the wheel? Hyperparameters optimization is well studied topic, with at least few of the state of the art method, which simply solve the problem for SVMs, including:
Bayesian optimization (usually through modeling model quality with Gaussian processes), see for example bayesopt http://rmcantin.bitbucket.org/html/
Tree of parzen estimators (sometimes better for discrete, complex hyperparameters spaces) included (in particular) in hyperopt http://hyperopt.github.io/hyperopt/
To improve the random search procedure, you can refer to Hyperband.
Hyperband is a method proposed by UC Berkeley AMP Lab, aiming to improve the efficiency of tuning method like random search.
I'd like to add that Bayesian optimization is a perfect example of an adaptive random search, so looks like it's exactly what you want to apply.
The idea of Bayesian optimization is to model the target function using Gaussian Processes (GP), select the best next point according to the current model and update the model after seeing the actual outcome. So, effectively, Bayesian optimization starts like a random search, gradually builds a picture of what the function looks like and shifts its focus to the most promising areas (note that "promising" can be defined differently by different particular methods - PI, EI, UCB, etc). There are further techniques to help it to find a right balance between exploration and exploitation, for example portfolio strategy. If that's what you mean by adaptive, then Bayesian optimization is your choice .
If you'd like to extend your code without external libraries, it's totally possible because Bayesian optimization is not that hard to implement. You can take a look at sample code that I used in my research, for example here is the bulk of GP-related code.

Clustering of news articles

My scenario is pretty straightforwrd: I have a bunch of news articles (~1k at the moment) for which I know that some cover the same story/topic. I now would like to group these articles based on shared story/topic, i.e., based on their similarity.
What I did so far is to apply basic NLP techniques including stopword removal and stemming. I also calculated the tf-idf vector for each article, and with this can also calculate the, e.g., cosine similarity based on these tf-idf-vectors. But now with the grouping of the articles I struggles a bit. I see two principle ways -- probably related -- to do it:
1) Machine Learning / Clustering: I already played a bit with existing clustering libraries, with more or less success; see here. On the one hand, algorithms such as k-means require the number of clusters as input, which I don't know. Other algorithms require parameters that are also not intuitive to specify (for me that is).
2) Graph algorithms: I can represent my data as a graph with the articles being the nodes and weighted adges representing the pairwise (cosine) similarity between the articles. With that, for example, I can first remove all edges that fall below a certain threshold and then might apply graph algorithms to look for strongly-connected subgraphs.
In short, I'm not sure where best to go from here -- I'm still pretty new in this area. I wonder if there some best practices for that, or some kind of guidelines which methods / algorithms can (not) be applied in certain scenarios.
(EDIT: forgot to link to related question of mine)
Try the class of Hierarchical Agglomerative Clustering HAC algorithms with Single and Complete linkage.
These algorithms do not need the number of clusters as input.
The basic principle is similar to growing a minimal spanning tree across a given set of data points and then stop based on a threshold criteria. A closely related class is the Divisive clustering algorithms which first builds up the minimal spanning tree and then prunes off a branch of the tree based on inter-cluster similarity ratios.
You can also try a canopy variation on k-means to create a relatively quick estimate for the number of clusters (k).
http://en.wikipedia.org/wiki/Canopy_clustering_algorithm
Will you be recomputing over time or do you only care about a static set of news? I ask because your k may change a bit over time.
Since you can model your dataset as a graph you could apply stochastic clustering based on markov models. Here are link for resources on MCL algorithm:
Official thesis description and code base
Gephi plugin for MCL (to experiment and evaluate the method)

Resources