Is Hadoop installation mandatory when using Apache Mahout - mahout

Does Apache mahout work without Hadoop , If not so which parts specifically of mahout are dependent on Hadoop . I am trying Mahout clustering implementations .
Thanks .
Shahid.

From the Mahout FAQ, Hadoop is not required for all of the algorithms implemented in Mahout, including:
User-based collaborative filtering
Item-based collaborative filtering
Matrix factorization with alternating least squares
Matrix factorization with alternating least squares on implicit feedback
Weighted matrix factorization
Logistic regression
Hidden Markov models
Canopy clustering
k-means clustering
Fuzzy k-means
Streaming k-means
Singular value decomposition
Lanczos algo

Related

Can we specify which algorithm to use (e.g., decision tree, SVM, ensemble, NNs) in Vowpal Wabbit? Or, does Automl select the algorithm itself?

I am trying to read the documentation of Vowpal Wabbit and it doesn't specify how to select specific learning algorithms (Not loss) like SVM,NN, Decision trees, etc. How does one select a specific learning algorithm?
Or does it select the algorithm itself depending on problem type (regression/classification like an automl type or low-code ML library?
There are some blogs showing to use Neural networks with -nn command but that isn't part of documentation--is this because it doesn't focus on specific algorithm, as noted above? If so, What is Vowpal Wabbit in essence?
Vowpal Wabbit is based on online learning (SGD-like updates, but there is also --bfgs if you really need batch optimization) and (machine learning) reductions. See some of the tutorials or papers to understand the idea of reductions. Many VW papers are also about Contextual Bandit, which is implemented as a reduction to a cost-sensitive one-against-all (OAA) classification (which is further reduced to regression). See a simple intro into reductions or a simple example how binary classification is reduced into regression.
As far as I know, VowpalWabbit does not support Decision trees nor ensembles, but see --boosting and --bootstrap. It does not support SVM, but see --loss_function hinge (hinge loss is one of the two key concepts of SVM) and --ksvm. It does not support NN, but --nn (and related options) provides a very limited support simulating a single hidden layer (feed-forward with tanh activation function), which can be added into the reduction stack.

Is Random Forest a special case of AdaBoost?

What is the difference if we use Decision Tree as Base estimator in AdaBoost algorithm ?
Is Random Forest a special case of AdaBoost?
Most certainly not; Random Forest is a case of bagging ensemble algorithm (short for bootstrap aggregating), which is different from boosting - check here for their differences.
What is the difference if we use Decision Tree as Base estimator in AdaBoost algorithm ?
You don't get a Random Forest, but a Gradient Tree Boosting Machine, available in several packages like xgboost (R/Python), gbm (R), scikit-learn (Python) etc.
Check chapter 8 of the excellent (and freely available) book An Introduction to Statistical Learning for more, or The Elements of Statistical Learning (heavy in math & theory, not for the faint-hearted)...

What machine learning algorithms supported with spark mllib and does not supported with mahout and vise versa?

I want list of ml algorithms that supported with spark mllib and does not supported with mahout and list of ml algorithms that supported with mahout and does not supported with spark mllib thanks.
I think this page gives a good overview. It lists all the supported algorithms. If an algorithm is not shown there, you could assume it is not supported.
Classification:
logistic regression,
naive Bayes,...
Regression:
generalized linear regression,
survival regression,...
Decision trees,
random forests,
and gradient-boosted trees
Recommendation:
alternating least squares (ALS)
Clustering:
K-means,
Gaussian mixtures (GMMs),...
Topic modeling:
latent Dirichlet allocation (LDA)
Frequent itemsets,
association rules,
and sequential pattern mining

Use sklearn DBSCAN model to classify new entries

I have a huge "dynamic" dataset and I'm trying to find interesting clusters on it.
After running a lot of different unsupervised clustering algorithms I have found a configuration of DBSCAN which gives coherent results.
I would like to extrapolate the model that DBSCAN creates according to my test data to apply it to other datasets, but without re-running the algorithm. I cannot run the algorithm over the whole dataset cause it would run out of memory, and the model might not make sense to me at a different time as the data is dynamic.
Using sklearn, I have found that other clustering algorithms - like MiniBatchKMeans - have a predict method, but DBSCAN does not.
I understand that for MiniBatchKMeans the centroids uniquely define the model. But such a thing might not exist for DBSCAN.
So my question is: What is the proper way to extrapolate the DBSCAN model? should I train a supervised learning algorithm using the output that DBSCAN gave on my test dataset? or is there something intrinsically belonging to DBSCAN model that can be used to classify new data without re-running the algorithm?
DBSCAN and other 'unsupervised' clustering methods can be used to automatically propagate labels used by classifiers (a 'supervised' machine learning task) in what as known as 'semi-supervised' machine learning. I'll break down the general steps for doing this and cite a series of semi-supervised papers that motivated this approach.
By some means, label a small portion of your data.
Use DBSCAN or other clustering method (e.g. k-nearest neighbors) to cluster your labeled and unlabeled data.
For each cluster, determine the most common label (if any) for members of the cluster. Re-label all members in the cluster to that label. This effectively increased the number of labeled training data.
Train a supervised classifier using the dataset from step 3.
The following papers propose some extensions to this general process to improve classification performance. As a note, all of the following papers have found that k-means is a consistent, efficient, and effective clustering method for semi-supervised learning compared to about a dozen other clustering methods. They then use k-nearest neighbors with a large K value for classification. One paper that specifically covered DBSCAN based clustering is:
- Erman, J., & Arlitt, M. (2006). Traffic classification using clustering algorithms. In Proceedings of the 2006 SIGCOMM workshop on Mining network data (pp. 281–286). https://doi.org/http://doi.acm.org/10.1145/1162678.1162679
NOTE: These papers are listed in chronological order and build upon each other. The 2016 Glennan paper is what you should read if you only want to see the most successful/advanced iteration.
Erman, J., & Arlitt, M. (2006). Traffic classification using clustering algorithms. In Proceedings of the 2006 SIGCOMM workshop on Mining network data (pp. 281–286). https://doi.org/http://doi.acm.org/10.1145/1162678.1162679
Wang, Y., Xiang, Y., Zhang, J., & Yu, S. (2011). A novel semi-supervised approach for network traffic clustering. In 5th International Conference on Network and System Security (NSS) (pp. 169–175). Milan, Italy: IEEE. https://doi.org/10.1109/ICNSS.2011.6059997
Zhang, J., Chen, C., Xiang, Y., & Zhou, W. (2012). Semi-supervised and compound classification of network traffic. In Proceedings - 32nd IEEE International Conference on Distributed Computing Systems Workshops, ICDCSW 2012 (pp. 617–621). https://doi.org/10.1109/ICDCSW.2012.12
Glennan, T., Leckie, C., & Erfani, S. M. (2016). Improved Classification of Known and Unknown Network Traffic Flows Using Semi-supervised Machine Learning. In J. K. Liu & R. Steinfeld (Eds.), Information Security and Privacy: 21st Australasian Conference (Vol. 2, pp. 493–501). Melbourne: Springer International Publishing. https://doi.org/10.1007/978-3-319-40367-0_33
Train a classificator based on your model.
DBSCAN is not easy to adapt to new objects, because you would need to eventually adjust minPts. Adding points to DBSCAN can cause clusters to merge, which you probably do not want to happen.
If you consider the clusters found by DBSCAN to be useful, train a classifier to put new instances into the same classes. You now want to perform classification, not rediscover structure.

Which Regression methods are suitable for binary valued features and continuous output?

I want to build a machine learning model to regression on continuous output given binary valued features(0,1). the dimension of my problem is around 200.
which of the flowing methods seems suitable for this kind of problem ?
SVR with different Kernels
Regression random forest
MARS
Gradient boosting with regression tree
Kernel regression (Nadya-Watson Kernel regression)
LSR and LARS
Stochastic gradient boosting
Intuitively speaking, anything requiring the calculation of a gradient is going to struggle on binary values. From your list, SVR and Forests would be the first place I'd look for a benchmark solution.
You can also look at expectation maximization for Bernoully mixture models.
It deals with binary input sets. You can find theory in book:
Christopher M. Bishop. "Pattern Recognition and Machine Learning".

Resources