Let say. We have a dataset (in .csv format) for supervised machine learning. It has 60 data points (row of data), and each data point has 100 variables.
Does it make sense that I train machine learning models using all 100 variables from 60 data points? To me, it seems that it is mathematically wrong. It like I solve an equation set that with 100 variables, but only 60 equations?
In a dataset, if we have n variables, what is the minimal number of data points we need to train a machine learning model?
Any statistic theory for this?
Thank you very much.
To answer your first question, you are right, it does not make sense to try to generalize a model with 100 features but only 60 examples.
The statistical reason has been widely explained in "statistical learning theory" by Vladimir Vapnik. I do not really suggest going and read all that book, it is a large book and lots of math, and not too many examples. But the point that you need to know is what is called Vapnik Chervonenkis dimension or most of the time, it is being called VC dimension.
But long story short, in cases where the dimension is bigger than the number of training examples, what you will get is not a generalization, but an overfitting
Related
I have hourly data of no. of minutes spent online by people for 2 years. Hence the values are distributed between 0 and 60 and also most data is either 0 or 60. My goal is to predict the number of minutes the person will spend online in the future (next day/hour/month etc.). What kind of approach or machine learning model can I use to predict this data? Can this be modelled into a regression/forecasting problem in spite of the skewness?hourly data
In the case of time series data and its prediction, it’s better to use a regression model rather than a classification or clustering model. Because it’s related to calculating specific figures.
It can be modeled into a regression problem to some extent, but more skewness means getting far from the normal probability distribution which might influence the expression into the model, lower prediction accuracy, and so forth. Anyway, any data with significant skewness cannot be regarded as well-refined data. So you might need to rearrange the samples of the data so that the skewness of the data can decrease.
cross_val_score : what does it return ? The score for training / test set ?! I have a model with 5 fold.what does the cross_val_score correspond to ?can someone explain in layman's terms?
Great question! Cross-validation splits your training dataset into 5 parts (aka folds). It then rotates which part of the dataset gets used for testing.
It's basically like this:
You have your training data. You have no clue what machine learning algorithm to use, but you have a hunch that it might be either deep learning or linear regression.
So you take your training data and you divide it up into 5 different equal sections (randomize first tho). You use 4 of those sections to train your deep learning model. Then you test it by comparing the answers it gives you to the answers you know to be true, from the 5th section (the test/validation section). You do this another 4 times, rotating which part gets to be the test part. Then you take the average score of all 5 times, and that is your cross-validation score.
You repeat the process for linear regression. Whichever algorithm gives you the best cross-validation score is the one you will pick, because it's the best for that problem.
Imagine you are picking between two cars: a Honda and a Toyota. You don't want to just test-drive a car once before buying it. It's a big decision. So for the Honda, you test-drive it 5 times and you average your experience over those 5 times. Same for the Toyota, you test-drive it 5 different times and average your experience so you can make an informed decision.
Cross-validation is basically taking a machine learning algorithm (or its hyperparameters, etc.) for a test-drive and seeing how it does.
I have recently watched a video explaining that for Deep Learning, if you add more data, you don't need as much regularization, which sort of makes sense.
This being said, does this statement hold for "normal" Machine Learning algorithms like Random Forest for example ? And if so, when searching for the best hyper-parameters for the algorithm, in theory you should have as input dataset ( of course that gets further divided into cross validation sets etc ) as much data as you have, and not just a sample of it. This of course means a muuch longer training time, as for every combination of hyper-params you have X cross-validation sets which need to be trained and so on.
So basically, is it fair to assume that the params found for a decently size sample of your dataset are the "best" ones to use for the entire dataset or isn't it ?
Speaking from a statistician's point of view: it really depends on the quality of your estimator. If it's unbiased and low-variance, then a sample will be fine. If the variance is high, you'll want to use all the data you can.
I have a dataset of approx. 4800 rows with 22 attributes, all numerical, describing mostly the geometry of rock / minerals, and 3 different classes.
I tried out a cross validation with k-nn Model inside it, with k= 7 and Numerical Measure -> Camberra Distance as parameters set..and I got a performance of 82.53% and 0.673 kappa. Is that result representative for the dataset? I mean 82% is quite ok..
Before doing this, I evaluated the best subset of attributes with a decision table, I got out 6 different attributes for that.
the problem is, you still don't learn much from that kind of models, like instance-based k-nn. Can I get any more insight from knn? I don't know how to visualize the clusters in that high dimensional space in Rapidminer, is that somehow possible?
I tried decision tree on the data, but I got too much branches (300 or so) and it looked all too messy, the problem is, all numerical attributes have about the same mean and distribution, therefore its hard to get a distinct subset of meaningful attributes...
ideally, the staff wants to "Learn" something about the data, but my impression is, that you cannot learn much meaningful of that data, all that works best is "Blackbox" Learning models like Neural Nets, SVM, and those other instance-based models...
how should I proceed?
Welcome to the world of machine learning! This sounds like a classic real-world case: we want to make firm conclusions, but the data rows don't cooperate. :-)
Your goal is vague: "learn something"? I'm taking this to mean that you're investigating, hoping to find quantitative discriminations among the three classes.
First of all, I highly recommend Principal Component Analysis (PCA): find out whether you can eliminate some of these attributes by automated matrix operations, rather than a hand-built decision table. I expect that the messy branches are due to unfortunate choice of factors; decision trees work very hard at over-fitting. :-)
How clean are the separations of the data sets? Since you already used Knn, I'm hopeful that you have dense clusters with gaps. If so, perhaps a spectral clustering would help; these methods are good at classifying data based on gaps between the clusters, even if the cluster shapes aren't spherical. Interpretation depends on having someone on staff who can read eigenvectors, to interpret what the values mean.
Try a multi-class SVM. Start with 3 classes, but increase if necessary until your 3 expected classes appear. (Sometimes you get one tiny outlier class, and then two major ones get combined.) The resulting kernel functions and the placement of the gaps can teach you something about your data.
Try the Naive Bayes family, especially if you observe that the features come from a Gaussian or Bernoulli distribution.
As a holistic approach, try a neural net, but use something to visualize the neurons and weights. Letting the human visual cortex play with relationships can help extract subtle relationships.
I am new in machine learning. My problem is to make a machine to select a university for the student according to his location and area of interest. i.e it should select the university in the same city as in the address of the student. I am confused in selection of the algorithm can I use Perceptron algorithm for this task.
There are no hard rules as to which machine learning algorithm is the best for which task. Your best bet is to try several and see which one achieves the best results. You can use the Weka toolkit, which implements a lot of different machine learning algorithms. And yes, you can use the perceptron algorithm for your problem -- but that is not to say that you would achieve good results with it.
From your description it sounds like the problem you're trying to solve doesn't really require machine learning. If all you want to do is match a student with the closest university that offers a course in the student's area of interest, you can do this without any learning.
I second the first remark that you probably don't need machine learning if the student has to live in the same area as the university. If you want to use an ML algorithm, maybe it would best to think about what data you would have to start with. The thing that comes to mind is a vector for a university that has certain subjects/areas for each feature. Then compute a distance from a vector which is like an ideal feature vector for the student. Minimize this distance.
The first and formost thing you need is a labeled dataset.
It sounds like the problem could be decomposed into a ML problem however you first need a set of positive and negative examples to train from.
How big is your dataset? What features do you have available? Once you answer these questions you can select an algorithm that bests fits the features of your data.
I would suggest using decision trees for this problem which resembles a set of if else rules. You can just take the location and area of interest of the student as conditions of if and else if statements and then suggest a university for him. Since its a direct mapping of inputs to outputs, rule based solution would work and there is no learning required here.
Maybe you can use a "recommender system"or a clustering approach , you can investigate more deeply the techniques like "collaborative filtering"(recommender system) or k-means(clustering) but again, as some people said, first you need data to learn from, and maybe your problem can be solved without ML.
Well, there is no straightforward and sure-shot answer to this question. The answer depends on many factors like the problem statement and the kind of output you want, type and size of the data, the available computational time, number of features, and observations in the data, to name a few.
Size of the training data
Accuracy and/or Interpretability of the output
Accuracy of a model means that the function predicts a response value for a given observation, which is close to the true response value for that observation. A highly interpretable algorithm (restrictive models like Linear Regression) means that one can easily understand how any individual predictor is associated with the response while the flexible models give higher accuracy at the cost of low interpretability.
Speed or Training time
Higher accuracy typically means higher training time. Also, algorithms require more time to train on large training data. In real-world applications, the choice of algorithm is driven by these two factors predominantly.
Algorithms like Naïve Bayes and Linear and Logistic regression are easy to implement and quick to run. Algorithms like SVM, which involve tuning of parameters, Neural networks with high convergence time, and random forests, need a lot of time to train the data.
Linearity
Many algorithms work on the assumption that classes can be separated by a straight line (or its higher-dimensional analog). Examples include logistic regression and support vector machines. Linear regression algorithms assume that data trends follow a straight line. If the data is linear, then these algorithms perform quite good.
Number of features
The dataset may have a large number of features that may not all be relevant and significant. For a certain type of data, such as genetics or textual, the number of features can be very large compared to the number of data points.