How does google prediction API work [closed] - machine-learning

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
Can someone predict :) or guess how does the Google Prediction API work under the hood?
I know there are some machine learning techniques:
Decision Trees, Neuron networks, naive Bayesian classification etc.
Which technique do you think Google is using?

The single answer to the question on Stats SE is good, given limited information from Google itself. It concludes with the same thought I had, that Google isn't telling regarding the innards of the Google Prediction API.
There was a Reddit discussion about this too. The most helpful response was from a user who is credible due to his prior work in that field (in my opinion). He wasn't certain what Google Prediction API was using, but had some ideas about what it was NOT using, based on discussions on the Google Group for the Prediction API:
the current implementation is not able to deal correctly with non-linear
separable data sets (XOR and Circular). That probably means that they
are fitting linear models such as regularized logistic regression or
SVMs but not neural networks or kernel SVMs. Fitting linear models is
very scalable to both wide problems (many features) and long problems
(many samples) provided that you use... stochastic gradient descent
with truncated gradients to handle sparsity inducing regularizers.
There was a little more, and of course, some other responses. Note that Google Prediction API has since released a new version, but it is not any more obvious (to me) how it works "under the hood".

Related

Is it a bad idea to always standardize all features by default? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
Is there a reason not to standardize all features by default? I realize it may not be necessary for e.g., decision trees but for certain algorithms such as KNN, SVM and K-Means. Would there be any harm just routinely to do this for all of my features?
Also, it seems the consensus that standardization is preferable to normalization? When would this not be a good idea?
Standardization and normalization, in my experience, have the most (positive) impact when your dataset consists of features that have very different ranges (for instance age vs number of dolars per house)
In my professional experience, while working on a project with sensors from the car (time-series), I noticed that normalization (min-max scaling), even though when applied in case of a neural network, had a negative impact upon the training process and of course the final results. Admittedly, were the sensor features(values) very close as values to one another. It was a very interesting result to remark considering that I was working with Time-Series, where most of the data scientists resort to scaling by default (they are neural network in the end, goes along the theory).
In principle, standardization is better to be applied when it comes to having specific outliers in the dataset, since normalization generates smaller standard deviation values. In my humble knowledge this is the main reason standardization tends to be favored over normalization, its robustness over outliers.
Three years ago, if someone asked me this question, I would have said "standardization" is the way to go. Now I say, follow the principles, but test every hypothesis prior to jumping to a certain conclusion.

Which supervised machine learning classification method suits for randomly spread classes? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
If classes are randomly spread or it is having more noise, which type of supervised ML classification model will give better results, and why?
It is difficult to say which classifier will perform best on general problems. It often requires testing of a variety of algorithms on a given problem in order to determine which classifier performs best.
Best performance is also dependent on the nature of the problem. There is a great answer in this stackoverflow question which looks at various scoring metrics. For each problem, one needs to understand and consider which scoring metric will be best.
All of that said, neural networks, Random Forest classifiers, Support Vector Machines, and a variety of others are all candidates for creating useful models given that classes are, as you indicated, equally distributed. When classes are imbalanced, the rules shift slightly, as most ML algorithms assume balance.
My suggestion would be to try a few different algorithms, and tune the hyper parameters, to compare them for your specific application. You will often find one algorithm is better, but not remarkably so. In my experience, often of far greater importance, is how your data are preprocessed and how your features are prepared. Once again this is a highly generic answer as it depends greatly on your given application.

How to choose which model to fit to data? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
My question is given a particular dataset and a binary classification task, is there a way we can choose a particular type of model that is likely to work best? e.g. consider the titanic dataset on kaggle here: https://www.kaggle.com/c/titanic. Just by analyzing graphs and plots, are there any general rules of thumb to pick Random Forest vs KNNs vs Neural Nets or do I just need to test them out and then pick the best performing one?
Note: I'm not talking about image data since CNNs are obv best for those.
No, you need to test different models to see how they perform.
The top algorithms based on the papers and kaggle seem to be boosting algorithms, XGBoost, LightGBM, AdaBoost, stack of all of those together, or just Random Forests in general. But there are instances where Logistic Regression can outperform them.
So just try them all. If the dataset is >100k, you're not gonna lose that much time, and you might learn something valuable about your data.

Does the use of kernels in SVMs increase the chances of overfitting? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
When using kernels to delimit non linear domains in SVMs, we introduce new features based on the training examples. We then have as many features as training examples. But having as many features as examples increases the chances of overfitting right? Should we drop some of these new features?
You really can't drop any of the kernel-generated features, in many cases you don't know what features are being used or what weight is being given to them. In addition to the usage of kernels, SVMs use regularization, and this regularization decreases the possibility of overfitting.
You can read about the connection between the formulation of SVMs and statistical learning theory, but the high level summary is that the SVM doesn't just find a separating hyperplane but finds one that maximizes the margin.
The wikipedia article for SVMs is very good and provides excellent links to regularization, and parameter search and many other important topics.
increasing feature did increase the chances of overfitting,may be you should use cross validation(libsvm contains) strategy to test the model you trained now is overfitting or not,
and use feature seletion tools to select feature http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/fselect/fselect.py

Survey to determine satisfaction: how to find the questions that mattered? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
If a survey is given to determine overall customer satisfaction, and there are 20 general questions and a final summary question: "What's your overall satisfaction 1-10", how could it be determined which questions are most significantly related to the summary question's answer?
In short, which questions actually mattered and which ones were just wasting space on the survey...
Information about the relevance of certain features is given by linear classification and regression weights associated with these features.
For your specific application, you could try training an L1 or L0 regularized regressor (http://en.wikipedia.org/wiki/Least-angle_regression, http://en.wikipedia.org/wiki/Matching_pursuit). These regularizers force many of the regression weights to zero, which means that the features associated with these weights can be effectively ignored.
There are many different approaches for answering this question and at varying levels of sophistication. I would start by calculating the correlation matrix for all pair-wise combinations of answers, thereby indicating which individual questions are most (or most negatively) correlated with the overall satisfaction score. This is pretty straightforward in Excel with the Analysis ToolPak.
Next, I would look into clustering techniques starting simple and moving up in sophistication only if necessary. Not knowing anything about the domain to which this survey data applies it is hard to say which algorithm would be the most effective, but for starters I would look at k-means and variants if your clusters are likely to all be similarly-sized. However, if a vast majority of the responses are very similar, I would look into expectation-maximization-based algorithms. A good open-source toolkit for exploring data and testing the efficacy of various algorithms is called Weka.

Resources