Machine learning models don't work with continuous data - machine-learning

I'm attempting to get a machine learning model to predict a baseball players Batting Average based on their At Bats and Hits. Since:
Batting Average = Hits/At Bats
I would think this relationship would be relatively easier to discover. However, since Batting Average is float (i.e. 0.300), all the models I try return the following error:
ValueError: Unknown label type: 'continuous'
I'm using sklearns models. I've tried LogisticRegression, RandomForestClassifier, LinearRegression. They all have the same problem.
From reading other StackOverflow posts on this error, I began doing this:
lab_enc = preproccessing.LabelEncoder()
y = pd.DataFrame(data=lab_enc.fit_transform(y))
Which seems to change values such as 0.227 to 136 which seems odd to me. Probably just because I don't quite understand what the transform is doing. I would, if possible, prefer just using the actual Batting Average values.
Is there a way to get the models I tried to work when predicting continuous values?

The problem you are trying to solve falls into the regression (i.e. numeric prediction) context, and it can certainly be dealt with ML algorithms.
I'm using sklearns models. I've tried LogisticRegression, RandomForestClassifier, LinearRegression. They all have the same problem.
The first two algorithms you mention here (Logistic Regression and Random Forest Classifier) are for classification problems, and thus are not suitable for your (regression) setting (they expectedly produce the error you mention). Linear Regression however is suitable and it should work fine here.
Please, for starters, stick to Linear Regression, in order to convince yourself that it can indeed handle the problem; you can subsequently extend to other scikit-learn algorithms like RandomForestRegressor etc. If you face any issues, open a new question with the specific code & error(s)...

Related

What is interpretability in machine learning?

I read this line today :
Every regression gets better with the addition of more features or variables... But adding more features increases complexity and reduces interpretability of the model as well.
I am unable to understand what is interpretability? (searched it on google but still did not get it)
Please help thank you
I would say that interpretability in a regression problems is when you can explain the result of your model to non statistician / domain experts.
For example: you try to predict the size of people depending on many variable, including sex. If you use linear regression, you will be able to say that the model will add 20cm (again, for example) to the predicted size if the person is a man (compared to a woman). The domain expert will understand the relationship between explanatory variable and the predicted result, without understanding statistics or how a linear regression works.
In addition, I disagree with the fact that the addition of more features or variables always improve regression result.
What is a better regression ? Improvement in choosen metrics ? For training or test set ? A "better regression" doesn't mean anything...
If we assume that a better regression is a regression which is better to predict the target for a new dataset, more variable doesn't always improve prediction power, especially when there is no regularization, if the added feature contains futures variables or many others cases.

Prediction of Stock Returns with ML Algrithm

I am working on a prediction model for stock returns over a fixed period of time (say n days). I am was hoping to gather a few ideas ahead of time. My questions are:
1) Would it be best to turn this into a classification problem, say create a dummy variable with returns larger than x%? Then I could try the entire arsenal of ML Algorithms.
2) If I don't turn it into a classification problem but use say a regression model, would it make sense or be necessary to transform the returns into logs?
Any thoughts are appreciated.
EDIT: My goal with this is relatively broadly defined, in the sense that I would simple like to improve performance of the selection process (pick positive returns and avoid negative ones)
Best under what quality? Turning it into a thresholding problem simply means translating the problem space to a much simpler one. Your problem definition is your own; you can turn it into a binary classification problem (>x or not), a multi-class classification problem (binning into ranges) or simply keep it as a prediction task. If you do the latter, you can still apply binning or classification as a post-processing step.
Classification is just a subclass of prediction. The log transformation employed by logistic regression is no more than a neat trick to turn the outputs into something that resembles a probability distribution; don't put too much thought into it. That said, applying transformations on your output is not necessarily bad (you could for instance apply some normalization to keep your output within the range of some activation function).

Best Learning model for high numerical dimension data? (with Rapidminer)

I have a dataset of approx. 4800 rows with 22 attributes, all numerical, describing mostly the geometry of rock / minerals, and 3 different classes.
I tried out a cross validation with k-nn Model inside it, with k= 7 and Numerical Measure -> Camberra Distance as parameters set..and I got a performance of 82.53% and 0.673 kappa. Is that result representative for the dataset? I mean 82% is quite ok..
Before doing this, I evaluated the best subset of attributes with a decision table, I got out 6 different attributes for that.
the problem is, you still don't learn much from that kind of models, like instance-based k-nn. Can I get any more insight from knn? I don't know how to visualize the clusters in that high dimensional space in Rapidminer, is that somehow possible?
I tried decision tree on the data, but I got too much branches (300 or so) and it looked all too messy, the problem is, all numerical attributes have about the same mean and distribution, therefore its hard to get a distinct subset of meaningful attributes...
ideally, the staff wants to "Learn" something about the data, but my impression is, that you cannot learn much meaningful of that data, all that works best is "Blackbox" Learning models like Neural Nets, SVM, and those other instance-based models...
how should I proceed?
Welcome to the world of machine learning! This sounds like a classic real-world case: we want to make firm conclusions, but the data rows don't cooperate. :-)
Your goal is vague: "learn something"? I'm taking this to mean that you're investigating, hoping to find quantitative discriminations among the three classes.
First of all, I highly recommend Principal Component Analysis (PCA): find out whether you can eliminate some of these attributes by automated matrix operations, rather than a hand-built decision table. I expect that the messy branches are due to unfortunate choice of factors; decision trees work very hard at over-fitting. :-)
How clean are the separations of the data sets? Since you already used Knn, I'm hopeful that you have dense clusters with gaps. If so, perhaps a spectral clustering would help; these methods are good at classifying data based on gaps between the clusters, even if the cluster shapes aren't spherical. Interpretation depends on having someone on staff who can read eigenvectors, to interpret what the values mean.
Try a multi-class SVM. Start with 3 classes, but increase if necessary until your 3 expected classes appear. (Sometimes you get one tiny outlier class, and then two major ones get combined.) The resulting kernel functions and the placement of the gaps can teach you something about your data.
Try the Naive Bayes family, especially if you observe that the features come from a Gaussian or Bernoulli distribution.
As a holistic approach, try a neural net, but use something to visualize the neurons and weights. Letting the human visual cortex play with relationships can help extract subtle relationships.

machine learning: how to generate regression model that outputs a multivariate instead of a univarite?

Given D=(x,y), y=F(x), it seems most machine learning methods only outputs y as a univariate, either a label or a real value. But I am facing a situation that x vector may only have 5~9 dimensions while I need y to be a multinomial distribution vector which can have up to 800 dimensions. This makes the problem really tricky.
I looked into a lot of things in multitask machine learning methods, where I can train all these y_i at the same time. And of course, another stupid way is that I can also train all these dimensions separately without considering the linkage between tasks. But the problem is, after reviewing many papers, seem that most MTL experiments only deal with 10~30 tasks, which means 800 tasks can be crazy and bad to train. Maybe clustering could be a solution, but I am really curious that can anyone give some suggestions about other ways to deal with this problem, not from a MTL perspective.
When the input is so "small" and the output so big, I would expect there to be a different representation of those output values. You could analyze if they are a linear or nonlinear combination of some sort, so to estimate the "function parameters" instead of the values itself. Example: We once have estimated a time series which could be "reduced" to a weighted sum of normal distributions, so we just had to estimate the weights and parameters.
In the end you will reach only a 6-to-12-dimensional subspace in some sense (not linear, probably) when you have only 6 input parameters. They can of course be a bit complicated, but to avoid the chaos in a 800-dim space I would really look into parametrizing the result.
And as I commented the machine learning that I know produce vectors. http://en.wikipedia.org/wiki/Bayes_estimator

How to approach machine learning problems with high dimensional input space?

How should I approach a situtation when I try to apply some ML algorithm (classification, to be more specific, SVM in particular) over some high dimensional input, and the results I get are not quite satisfactory?
1, 2 or 3 dimensional data can be visualized, along with the algorithm's results, so you can get the hang of what's going on, and have some idea how to aproach the problem. Once the data is over 3 dimensions, other than intuitively playing around with the parameters I am not really sure how to attack it?
What do you do to the data? My answer: nothing. SVMs are designed to handle high-dimensional data. I'm working on a research problem right now that involves supervised classification using SVMs. Along with finding sources on the Internet, I did my own experiments on the impact of dimensionality reduction prior to classification. Preprocessing the features using PCA/LDA did not significantly increase classification accuracy of the SVM.
To me, this totally makes sense from the way SVMs work. Let x be an m-dimensional feature vector. Let y = Ax where y is in R^n and x is in R^m for n < m, i.e., y is x projected onto a space of lower dimension. If the classes Y1 and Y2 are linearly separable in R^n, then the corresponding classes X1 and X2 are linearly separable in R^m. Therefore, the original subspaces should be "at least" as separable as their projections onto lower dimensions, i.e., PCA should not help, in theory.
Here is one discussion that debates the use of PCA before SVM: link
What you can do is change your SVM parameters. For example, with libsvm link, the parameters C and gamma are crucially important to classification success. The libsvm faq, particularly this entry link, contains more helpful tips. Among them:
Scale your features before classification.
Try to obtain balanced classes. If impossible, then penalize one class more than the other. See more references on SVM imbalance.
Check the SVM parameters. Try many combinations to arrive at the best one.
Use the RBF kernel first. It almost always works best (computationally speaking).
Almost forgot... before testing, cross validate!
EDIT: Let me just add this "data point." I recently did another large-scale experiment using the SVM with PCA preprocessing on four exclusive data sets. PCA did not improve the classification results for any choice of reduced dimensionality. The original data with simple diagonal scaling (for each feature, subtract mean and divide by standard deviation) performed better. I'm not making any broad conclusion -- just sharing this one experiment. Maybe on different data, PCA can help.
Some suggestions:
Project data (just for visualization) to a lower-dimensional space (using PCA or MDS or whatever makes sense for your data)
Try to understand why learning fails. Do you think it overfits? Do you think you have enough data? Is it possible there isn't enough information in your features to solve the task you are trying to solve? There are ways to answer each of these questions without visualizing the data.
Also, if you tell us what the task is and what your SVM output is, there may be more specific suggestions people could make.
You can try reducing the dimensionality of the problem by PCA or the similar technique. Beware that PCA has two important points. (1) It assumes that the data it is applied to is normally distributed and (2) the resulting data looses its natural meaning (resulting in a blackbox). If you can live with that, try it.
Another option is to try several parameter selection algorithms. Since SVM's were already mentioned here, you might try the approach of Chang and Li (Feature Ranking Using Linear SVM) in which they used linear SVM to pre-select "interesting features" and then used RBF - based SVM on the selected features. If you are familiar with Orange, a python data mining library, you will be able to code this method in less than an hour. Note that this is a greedy approach which, due to its "greediness" might fail in cases where the input variables are highly correlated. In that case, and if you cannot solve this problem with PCA (see above), you might want to go to heuristic methods, which try to select best possible combinations of predictors. The main pitfall of this kind of approaches is the high potential of overfitting. Make sure you have a bunch "virgin" data that was not seen during the entire process of model building. Test your model on that data only once, after you are sure that the model is ready. If you fail, don't use this data once more to validate another model, you will have to find a new data set. Otherwise you won't be sure that you didn't overfit once more.
List of selected papers on parameter selection:
Feature selection for high-dimensional genomic microarray data
Oh, and one more thing about SVM. SVM is a black box. You better figure out what is the mechanism that generate the data and model the mechanism and not the data. On the other hand, if this would be possible, most probably you wouldn't be here asking this question (and I wouldn't be so bitter about overfitting).
List of selected papers on parameter selection
Feature selection for high-dimensional genomic microarray data
Wrappers for feature subset selection
Parameter selection in particle swarm optimization
I worked in the laboratory that developed this Stochastic method to determine, in silico, the drug like character of molecules
I would approach the problem as follows:
What do you mean by "the results I get are not quite satisfactory"?
If the classification rate on the training data is unsatisfactory, it implies that either
You have outliers in your training data (data that is misclassified). In this case you can try algorithms such as RANSAC to deal with it.
Your model(SVM in this case) is not well suited for this problem. This can be diagnozed by trying other models (adaboost etc.) or adding more parameters to your current model.
The representation of the data is not well suited for your classification task. In this case preprocessing the data with feature selection or dimensionality reduction techniques would help
If the classification rate on the test data is unsatisfactory, it implies that your model overfits the data:
Either your model is too complex(too many parameters) and it needs to be constrained further,
Or you trained it on a training set which is too small and you need more data
Of course it may be a mixture of the above elements. These are all "blind" methods to attack the problem. In order to gain more insight into the problem you may use visualization methods by projecting the data into lower dimensions or look for models which are suited better to the problem domain as you understand it (for example if you know the data is normally distributed you can use GMMs to model the data ...)
If I'm not wrong, you are trying to see which parameters to the SVM gives you the best result. Your problem is model/curve fitting.
I worked on a similar problem couple of years ago. There are tons of libraries and algos to do the same. I used Newton-Raphson's algorithm and a variation of genetic algorithm to fit the curve.
Generate/guess/get the result you are hoping for, through real world experiment (or if you are doing simple classification, just do it yourself). Compare this with the output of your SVM. The algos I mentioned earlier reiterates this process till the result of your model(SVM in this case) somewhat matches the expected values (note that this process would take some time based your problem/data size.. it took about 2 months for me on a 140 node beowulf cluster).
If you choose to go with Newton-Raphson's, this might be a good place to start.

Resources