I am new with support vector regression (SVR) and its prediction task , can any one help me understand if SVR can do prediction of futur events (time and space) and how ? Thx
Related
Which Evaluation metric should i use for classification problem statement ? On what factor should i decide ?
1. Accuracy
2. F1 Score
3. AUC ROC Score
4. Log Loss
Accuracy is a great metric when you are working with a balanced dataset. It's the number of true predictions over the total number of predictions.
F1 Score is a great metric when you want to maximaze the precision and the recall of the prediction, it's also great to unbalanced datasets.
AUC ROC Score represents how much of your data is covered by the algorithm. I really like using this evaluation metric, it works well for both balanced and unbalanced datasets.
Log Loss is the logarithmic loss of the prediction, beased on the cross-entropy between the predicted label and the true label. I never used this metric before.
I am currently developing a model in Python and Keras for a binary classification task (success/failure). My aim is to generate success probabilities for each observation so that I can use them later on in another task.
Do you know of any metric that quantifies the accuracy of these probabilities individually (and not the overall accuracy of the model)?
Thank you in advance.
I am new to machine learning and I am currently working on classification problem. I am able to train the model and predict test data sets. I want to know whether is there some way by which I can get scores along with the prediction. By scores , I mean those are proximity scores along with prediction. For example, in standard age-salary-buy (based on age and salary whether the customer will buy the product or not) classification problem, I want to know what is a score out of 100 that he will buy that product in addition to the prediction of whether he will buy it or not.
Currently, I am using LibSVM Algo. Is there some algo which provides me above data ?
Thanks.
What you are looking for is a support of your decision. In other words, many classifiers base their decision of x class over labels Y on:
cl(x) = arg max_{y \in Y} p(y|x)
where p(y|x) is their internal estimation of "x having label y". And such classifiers include:
neural networks (with sigmoid output)
logistic regression
naive bayes
voting ensembles (such as RF)
...
These methods can be easily converted to your 0-100 scale, as probability is in 0-1 scale.
Some, on the other hand use measure proportional to probability (such as SVM), but unbounded, here you can get this value (often called decision function) but you cannot convert it to 0-100 score (as you do not have "maximum" value). This is a big drawback, so some modification were proposed. In particular for SVM you have Platt's scaling which actually fits a logistic regression on top of SVM so you get your probability estimate. In libSVM you can set -b to get probability estimates
from libsvm website
-b probability_estimates: whether to train a SVC or SVR model for probability estimates, 0 or 1 (default 0)
I have a cyclic method running which collects a data set of 15.000 feature vectors with 30 dimensions (every 200ms). My current setup simply feeds all raw feature vectors to a SVM with RBF (Radial basis function). The classification result is rather unconvincing as being costly in terms of time. I know that the dataset isn't that big, so classification in real-time could be possible with the right subsampling feature vector or so. The goal is to speed up the entire classification process (training/prediction) to reach a few milliseconds. To obtain an unsupervised classification approach, I currently run k-means to label the feature vectors. I pick a few cluster results and assign them class 1 and all others class 0.
The idea now the following:
collect all 15.000 (N) feature vectors with 30 (D) dimensions
PCA on all N feature vectors
use the eigenvalues to determine a feature vector with (d) dimensions (d < D)
Fed the new set of (n < N)
feature vectors
or: the eigenvectors ?
to train the svm
Maybe instead of SVM a KNN approach would result in similar result?
Does this approach makes sense?
Any ideas to improve the process or change it in order to speed it up?
How do I determine the best number of d?
The classification accuracy shouldn't suffer too much from the time reduction.
EDIT: Data stream mining
I was just reading about Data Stream Mining. I think this topic fits my setup quite well since I have to extract knowledge structures from continuous, rapid data records. Maybe I should replace the SVM with a Gradient Boosted Tree?
Thanks!
I have been working on the Sentiment analysis prediction using the Rotten Tomatoes movie reviews dataset.
The dataset has 5 classes {0,1,2,3,4} where 0 being very negative and 4 being very positive
The dataset is highly unbalanced,
total samples = 156061
'0': 7072 (4.5%),
'1': 27273 (17.4%),
'2': 79583 (50.9%),
'3': 32927 (21%),
'4': 9206 (5.8%)
as you can see class 2 has almost 50% samples and 0 and 5 contribute to ~10% of training set
So there is a very strong bias for class 2 thus reducing the accuracy of classification for class 0 and 4.
What can I do to balance the dataset? One solution would be to get equal number of samples by reducing the samples to only 7072 for each class, but it reduces the dataset drastically!
How can I optimize and balance the dataset without affecting the accuracy of overall classification?
You should not balance the dataset, you should train a classifier in a balanced manner. Nearly all existing classifiers can be trained with some cost sensitive objective. For example - SVMs let you "weight" your samples, simply weight samples of the smaller class more. Similarly Naive Bayes has classes priors - change them! Random forest, Neural networks, Logistic regression, they all let you somehow "weight" samples, it is the core technique for getting more balanced results.
For classification problems, you can try class_weight='balanced' option in your estimator, such as Logistic, SVM, etc. For example:
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression