confused between choosing linear or nonlinear regression to model this data - machine-learning

I plotted the data I wanted to make a model for and the result is as shown in the picture. I tried modeling it using a Sinc-function but i failed so
if anyone has an idea that would help. https://i.stack.imgur.com/QY17L.jpg

First, please note that while using linear regression for this problem may let you fit these datapoints, it's not going to provide any information really about any future data. It may fit your test data if all of it lies on this same curve. If you're looking for something to predict future price you might want to consider a time series model.
However, if you're just trying use a linear regression to fit this data to that curve you have to get slightly creative. If all your features are linear and you're using linear reg, then one way or another you'll end up with a linear answer, which won't fit this model. So you will need to make your own custom features from your data. You can probably get a pretty good approximation for this using a 10th degree polynomial. So your features could be X (Years since 1992), X^2 (Years since 1992, squared), X^3, X^4, X^5, X^6 ... X^10.
A variety of other classifiers would also work fine, but you'll probably need to use some sort of time series model (LSTM for example) to get anything you can generalize to predict the future.

Related

new features in dataset

I'm now in the middle of the semester and trying to understand the background of the algorithms and features.
I would like to understand some theory.
If I have a dataset with N samples.
each sample has 5 features for example.
I have done 3 kinds of classifications algorithms for example : SVM, decision tree and kMeans.
In all 3, I got nice results
In a mystery way, a new feature added to the dataset. The value of the features for every sample selected randomly.
I restarted the algorithms on the dataset ( with the new feature)
Are the classification results gonna change from the first results without the new feature? If yes, why are they gonna change and by how much ?
In addition, if I do not have the dataset how can I know how to recognize that new feature?
The results of your classification algorithm are going to either change or stay the same depending on how much information the model gains from the feature. If the feature for instance is random noise then it will have little to no effect on your model, other than slowing it down. If it contains useful information it might be able to increase parameters such as recall and precision. Hope this might help.

How do I create a feature vector if I don’t have all the data?

So say for each of my ‘things’ to classify I have:
{house, flat, bungalow, electricityHeated, gasHeated, ... }
Which would be made into a feature vector:
{1,0,0,1,0,...} which would mean a house that is heated by electricity.
For my training data I would have all this data- but for the actual thing I want to classify I might only have what kind of house it is, and a couple other things- not all the data ie.
{1,0,0,?,?,...}
So how would I represent this?
I would want to find the probability that a new item would be gasHeated.
I would be using a SVM linear classifier- I don’t have any core to show because this is purely theoretical at the moment. Any help would be appreciated :)
When I read this question, it seems that you may have confused with feature and label.
You said that you want to predict whether a new item is "gasHeated", then "gasHeated" should be a label rather than a feature.
btw, one of the most-common ways to deal with missing value is to set it as "zero" (or some unused value, say -1). But normally, you should have missing value in both training data and testing data to make this trick be effective. If this only happened in your testing data but not in your training data, it means that your training data and testing data are not from the same distribution, which basically violated the basic assumption of machine learning.
Let's say you have a trained model and a testing sample {?,0,0,0}. Then you can create two new testing samples, {1,0,0,0}, {0,0,0,0}, and you will have two predictions.
I personally don't think SVM is a good approach if you have missing values in your testing dataset. Just like I have mentioned above, although you can get two new predictions, but what if each one has different predictions? It is difficult to assign a probability to results of SVM in my opinion unless you use logistic regression or Naive Bayes. I would prefer Random Forest in this situation.

How to model multiple inputs to single output in classification?

Purpose:
I am trying to build a model to classify multiple inputs to a single output class, which is something like this:
{x_i1, x_i2, x_i3, ..., x_i16} (features) to y_i (class)
I am using a SVM to make the classification, but the 0/1-loss was bad (half of the data a misclassified), which leads me to the conclusion that the data might be non-linear. This is why I played around with polynomial basis function. I transformed each coefficient such that I get any combinations of polynomials up to degree 4, in the hope that my features are linear in the transformed space. my new transformed input looks like this:
{x_i1, ..., x_i16, x_i1^2, ..., x_i16^2, ... x_i1^4, ..., x_i16^4, x_i1^3, ..., x_i16^3, x_i1*x_i2, ...}
The loss was minimized but still not quite where I want to go. Since with the number of polynomial degree the chance of overfitting rises, i added regularization in order to counter balance that. I also added a forward greedy algorithm in order to pick up the coefficients which leads to minimal cross-validation error, but with no great improvement.
Question:
Is there a systematic way to figure out which transform leads to linear feature behaviour in the transformed space? Seems little odd to me that I have to try out every polynomial until it "fits". Are there perhaps better basis functions except polynomials? I understand that in low dimensional feature space, one can simply plot the data out and estimate the transform visually, but how can I do it in high dimensional space?
Maybe a little off topic but I also informed myself about PCA in order to throw away the components which doesnt provide much informations in the first place. Is this worth a try?
Thank you for your help.
Did you try other kernel functions such as RBF other than linear and polynomial? Since different dataset may have different characteristics, some kernel functions may work better than others do, especially in non-linear cases.
I don't know which tools you are using, but the following one also provides a guide for beginners on how to build SVM models:
https://www.csie.ntu.edu.tw/~cjlin/libsvm/
It is always a good idea to have a feature selection step first, especially for high-dimensional data. Those noisy or irrelevant features should be taken away, leading to a better performance and higher efficiency.

Beginners guide to troubleshooting badly performing models

Im creating my first predictive model and its results are absolutely awful.
Im in need of some help identifying how i troubleshoot this.
Im doing linear regression & logistic regression classification, to predict if a student will pass a course, 1 for yes, 0 for no.
The dataset is tiny, as we only have complete data for one class, 16 features just under 60 rows, 35 passed and 25 failed.
I'm wondering if my dataset is simply too small.
I dont want to share the dataset just yet, but will clean it up so its completely anonymous.
The ROC is very very jagged and mostly (for log regression), and predicts more false positives than anything else.
Id appreciate some general troubleshooting advice for a beginner that i can try before we hire in a professional.
Thanks for any help provided.
Id suggest some tips:
In Azure ML theres a module called "filter based feature selection", you can use it to score your features and check if there is really predictive power in them or even select just the ones with the highest score.
If you haven't ,splitt in train/cross validation set and evaluate your model in both and use it as a diagnosis to identify underfitting(high bias) or overfitting(high variance), and depending on the diagnosis perform actions like:
For overfitting: get more data, use less features, use a less complex model , add or increase regularization
For underfitting: add more features, use a more complex model, decrease regularization.
And don't forget ,before start training to explore and evaluate your data, use scatter plots to see if indeed its separable, perform feature engineering and preprocessing for this ask yourself: given this features, would a human expert be able to perform predictions?, if your answer is not, transform or drop features so that the answer is positive

How to approach machine learning problems with high dimensional input space?

How should I approach a situtation when I try to apply some ML algorithm (classification, to be more specific, SVM in particular) over some high dimensional input, and the results I get are not quite satisfactory?
1, 2 or 3 dimensional data can be visualized, along with the algorithm's results, so you can get the hang of what's going on, and have some idea how to aproach the problem. Once the data is over 3 dimensions, other than intuitively playing around with the parameters I am not really sure how to attack it?
What do you do to the data? My answer: nothing. SVMs are designed to handle high-dimensional data. I'm working on a research problem right now that involves supervised classification using SVMs. Along with finding sources on the Internet, I did my own experiments on the impact of dimensionality reduction prior to classification. Preprocessing the features using PCA/LDA did not significantly increase classification accuracy of the SVM.
To me, this totally makes sense from the way SVMs work. Let x be an m-dimensional feature vector. Let y = Ax where y is in R^n and x is in R^m for n < m, i.e., y is x projected onto a space of lower dimension. If the classes Y1 and Y2 are linearly separable in R^n, then the corresponding classes X1 and X2 are linearly separable in R^m. Therefore, the original subspaces should be "at least" as separable as their projections onto lower dimensions, i.e., PCA should not help, in theory.
Here is one discussion that debates the use of PCA before SVM: link
What you can do is change your SVM parameters. For example, with libsvm link, the parameters C and gamma are crucially important to classification success. The libsvm faq, particularly this entry link, contains more helpful tips. Among them:
Scale your features before classification.
Try to obtain balanced classes. If impossible, then penalize one class more than the other. See more references on SVM imbalance.
Check the SVM parameters. Try many combinations to arrive at the best one.
Use the RBF kernel first. It almost always works best (computationally speaking).
Almost forgot... before testing, cross validate!
EDIT: Let me just add this "data point." I recently did another large-scale experiment using the SVM with PCA preprocessing on four exclusive data sets. PCA did not improve the classification results for any choice of reduced dimensionality. The original data with simple diagonal scaling (for each feature, subtract mean and divide by standard deviation) performed better. I'm not making any broad conclusion -- just sharing this one experiment. Maybe on different data, PCA can help.
Some suggestions:
Project data (just for visualization) to a lower-dimensional space (using PCA or MDS or whatever makes sense for your data)
Try to understand why learning fails. Do you think it overfits? Do you think you have enough data? Is it possible there isn't enough information in your features to solve the task you are trying to solve? There are ways to answer each of these questions without visualizing the data.
Also, if you tell us what the task is and what your SVM output is, there may be more specific suggestions people could make.
You can try reducing the dimensionality of the problem by PCA or the similar technique. Beware that PCA has two important points. (1) It assumes that the data it is applied to is normally distributed and (2) the resulting data looses its natural meaning (resulting in a blackbox). If you can live with that, try it.
Another option is to try several parameter selection algorithms. Since SVM's were already mentioned here, you might try the approach of Chang and Li (Feature Ranking Using Linear SVM) in which they used linear SVM to pre-select "interesting features" and then used RBF - based SVM on the selected features. If you are familiar with Orange, a python data mining library, you will be able to code this method in less than an hour. Note that this is a greedy approach which, due to its "greediness" might fail in cases where the input variables are highly correlated. In that case, and if you cannot solve this problem with PCA (see above), you might want to go to heuristic methods, which try to select best possible combinations of predictors. The main pitfall of this kind of approaches is the high potential of overfitting. Make sure you have a bunch "virgin" data that was not seen during the entire process of model building. Test your model on that data only once, after you are sure that the model is ready. If you fail, don't use this data once more to validate another model, you will have to find a new data set. Otherwise you won't be sure that you didn't overfit once more.
List of selected papers on parameter selection:
Feature selection for high-dimensional genomic microarray data
Oh, and one more thing about SVM. SVM is a black box. You better figure out what is the mechanism that generate the data and model the mechanism and not the data. On the other hand, if this would be possible, most probably you wouldn't be here asking this question (and I wouldn't be so bitter about overfitting).
List of selected papers on parameter selection
Feature selection for high-dimensional genomic microarray data
Wrappers for feature subset selection
Parameter selection in particle swarm optimization
I worked in the laboratory that developed this Stochastic method to determine, in silico, the drug like character of molecules
I would approach the problem as follows:
What do you mean by "the results I get are not quite satisfactory"?
If the classification rate on the training data is unsatisfactory, it implies that either
You have outliers in your training data (data that is misclassified). In this case you can try algorithms such as RANSAC to deal with it.
Your model(SVM in this case) is not well suited for this problem. This can be diagnozed by trying other models (adaboost etc.) or adding more parameters to your current model.
The representation of the data is not well suited for your classification task. In this case preprocessing the data with feature selection or dimensionality reduction techniques would help
If the classification rate on the test data is unsatisfactory, it implies that your model overfits the data:
Either your model is too complex(too many parameters) and it needs to be constrained further,
Or you trained it on a training set which is too small and you need more data
Of course it may be a mixture of the above elements. These are all "blind" methods to attack the problem. In order to gain more insight into the problem you may use visualization methods by projecting the data into lower dimensions or look for models which are suited better to the problem domain as you understand it (for example if you know the data is normally distributed you can use GMMs to model the data ...)
If I'm not wrong, you are trying to see which parameters to the SVM gives you the best result. Your problem is model/curve fitting.
I worked on a similar problem couple of years ago. There are tons of libraries and algos to do the same. I used Newton-Raphson's algorithm and a variation of genetic algorithm to fit the curve.
Generate/guess/get the result you are hoping for, through real world experiment (or if you are doing simple classification, just do it yourself). Compare this with the output of your SVM. The algos I mentioned earlier reiterates this process till the result of your model(SVM in this case) somewhat matches the expected values (note that this process would take some time based your problem/data size.. it took about 2 months for me on a 140 node beowulf cluster).
If you choose to go with Newton-Raphson's, this might be a good place to start.

Resources