Sorry if my question sounds too naive... i am really new to machine learning and regression
i have recently joined a machine learning lab as a master student . my professor wants me to write "the experiments an analysis" section of a paper the lab is about to submit about a regression algorithm that they have developed.
the problem is i don't know what i have to do he said the algorithm is stable and completed and they have written the first part of paper and i need to write the evaluation part .
i really don't know what to do . i have participated in coding the algorithm and i understand it pretty well but i don't know what are the tasks i must take in order to evaluate and analysis its performance.
-where do i get data?
-what is the testing process?
-what are the analysis to be done?
i am new to research and paper writing and really don't know what to do.
i have read a lot of paper recently but i have no experience in analyzing ML algorithms.
could you please guide me and explain (newbie level) the process please.
detailed answers are appreciated
thanks
You will need a test dataset to evaluate the performance. If you
don't have that, divide your training dataset (that you're currently
running this algorithm on) into training set and cross validation set
(non overlapping).
Create the test set by stripping out the predictions (y values) from
the cross validation set.
Run the algorithm with the training dataset to train the model.
Once your model is trained, test it's performance using the stripped
off 'test set'.
To evaluate the performance, you can use the RMSE (Root Mean Squared
Error) metric. You will need to use the predictions that your
algorithm made for each sample in the test set and their
corresponding actual predictions (that you stripped off earlier to
feed in the test set). You can find more information here.
Machine learning model evaluation
Take a look at this paper. It has been written for people without a computer science background, so it should be fairly easy to follow. It covers:
model evaluation workflow
holdout validation
cross-validation
k-fold cross-validation
stratified k-fold cross-validation
leave-one-out cross-validation
leave-p-out cross-validation
leave-one-group-out cross-validation
nested cross-validation
Related
I have a dataset of 300 respondents (hours studied vs grade), I load the dataset in Excel run the data analysis add-in and run a linear regression. I get my results.
So the question is, Am I doing a Statistical Analysis or Am I doing Machine Learning? I know the question may seem simple but I think we should get some debate from this.
Maybe your question is better suited for Data Science as it is not a question related to app/program development. Running formulas in excel through an add on is not really considered anywhere close to "programming".
Statistical Analysis is when you take statistical metrics of your data, like mean, standard deviation, confidence intervall, p-value...
Supervised Machine Learning is when you try to classify or predict something. For these problemns you use features as input to the model in order to classify a class or predict a value.
In this case you are doing machine learning, because you use the hours studied feature to predict the student grade.
In the proper context, you're actually doing Statistical Analysis... (Which is part of Machine Learn
Which are the fundamental criterias for using supervised or unsupervised learning?
When is one better than the other?
Is there specific cases when you can only use one of them?
Thanks
If you a have labeled dataset you can use both. If you have no labels you only can use unsupervised learning.
It´s not a question of "better". It´s a question of what you want to achieve. E.g. clustering data is usually unsupervised – you want the algorithm to tell you how your data is structured. Categorizing is supervised since you need to teach your algorithm what is what in order to make predictions on unseen data.
See 1.
On a side note: These are very broad questions. I suggest you familiarize yourself with some ML foundations.
Good podcast for example here: http://ocdevel.com/podcasts/machine-learning
Very good book / notebooks by Jake VanderPlas: http://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/Index.ipynb
Depends on your needs. If you have a set of existing data including the target values that you wish to predict (labels) then you probably need supervised learning (e.g. is something true or false; or does this data represent a fish or cat or a dog? Simply put - you already have examples of right answers and you are just telling the algorithm what to predict). You also need to distinguish whether you need a classification or regression. Classification is when you need to categorize the predicted values into given classes (e.g. is it likely that this person develops a diabetes - yes or no? In other words - discrete values) and regression is when you need to predict continuous values (1,2, 4.56, 12.99, 23 etc.). There are many supervised learning algorithms to choose from (k-nearest neighbors, naive bayes, SVN, ridge..)
On contrary - use the unsupervised learning if you don't have the labels (or target values). You're simply trying to identify the clusters of data as they come. E.g. k-Means, DBScan, spectral clustering..)
So it depends and there's no exact answer but generally speaking you need to:
Collect and see you data. You need to know your data and only then decide which way you choose or what algorithm will best suite your needs.
Train your algorithm. Be sure to have a clean and good data and bear in mind that in case of unsupervised learning you can skip this step as you don't have the target values. You test your algorithm right away
Test your algorithm. Run and see how well your algorithm behaves. In case of supervised learning you can use some training data to evaluate how well is your algorithm doing.
There are many books online about machine learning and many online lectures on the topic as well.
Depends on the data set that you have.
If you have target feature in your hand then you should go for supervised learning. If you don't have then it is a unsupervised based problem.
Supervised is like teaching the model with examples. Unsupervised learning is mainly used to group similar data, it plays a major role in feature engineering.
Thank you..
I have used the extreme learning machine for classification purpose and found that my classification accuracy is only at 70+% which leads me to use the ensemble method by creating more classification model and testing data will be classified based on the majority of the models' classification. However, this method only increase classification accuracy by a small margin. Can I asked what are the other methods which can be used to improve classification accuracy of the 2 dimension linearly inseparable dataset ?
Your question is very broad ... There's no way to help you properly without knowing the real problem you are treating. But, some methods to enhance a classification accuracy, talking generally, are:
1 - Cross Validation : Separe your train dataset in groups, always separe a group for prediction and change the groups in each execution. Then you will know what data is better to train a more accurate model.
2 - Cross Dataset : The same as cross validation, but using different datasets.
3 - Tuning your model : Its basically change the parameters you're using to train your classification model (IDK which classification algorithm you're using so its hard to help more).
4 - Improve, or use (if you're not using) the normalization process : Discover which techniques (change the geometry, colors etc) will provide a more concise data to you to use on the training.
5 - Understand more the problem you're treating... Try to implement other methods to solve the same problem. Always there's at least more than one way to solve the same problem. You maybe not using the best approach.
Enhancing a model performance can be challenging at times. I’m sure, a lot of you would agree with me if you’ve found yourself stuck in a similar situation. You try all the strategies and algorithms that you’ve learnt. Yet, you fail at improving the accuracy of your model. You feel helpless and stuck. And, this is where 90% of the data scientists give up. Let’s dig deeper now. Now we’ll check out the proven way to improve the accuracy of a model:
Add more data
Treat missing and Outlier values
Feature Engineering
Feature Selection
Multiple algorithms
Algorithm Tuning
Ensemble methods
Cross Validation
if you feel the information is lacking then this link should you learn, hopefully can help : https://www.analyticsvidhya.com/blog/2015/12/improve-machine-learning-results/
sorry if the information I give is less satisfactory
So this question may seem a little stupid but I couldn't wrap my head around it.
What is the purpose of test data? Is it only to calculate accuracy of the classifier? I'm using Naive Bayes for sentiment analysis of tweets. Once I train my classifier using training data, I use test data just to calculate accuracy of the classifier. How can I use the test data to improve classifier's performance?
In doing general supervised machine learning, the test data set plays a critical role in determining how well your model is performing. You typically will build a model with say 90% of your input data, leaving 10% aside for testing. You then check the accuracy of that model by seeing how well it does against the 10% training set. The performance of the model against the test data is meaningful because the model has never "seen" this data. If the model be statistically valid, then it should perform well on both the training and test data sets. This general procedure is called cross validation and you can read more about it here.
You don't -- like you surmise, the test data is used for testing, and mustn't be used for anything else, lest you skew your accuracy measurements. This is an important cornerstone of any machine learning -- you only fool yourself if you use your test data for training.
If you are considering desperate measures like that, the proper way forward is usually to re-examine your problem space and the solution you have. Does it adequately model the problem you are trying to solve? If not, can you devise a better model which captures the essence of the problem?
Machine learning is not a silver bullet. It will not solve your problem for you. Too many failed experiments prove over and over again, "garbage in -- garbage out".
For example: If I want to train a classifier (maybe SVM), how many sample do I need to collect? Is there a measure method for this?
It is not easy to know how many samples you need to collect. However you can follow these steps:
For solving a typical ML problem:
Build a dataset a with a few samples, how many? it will depend on the kind of problem you have, don't spend a lot of time now.
Split your dataset into train, cross, test and build your model.
Now that you've built the ML model, you need to evaluate how good it is. Calculate your test error
If your test error is beneath your expectation, collect new data and repeat steps 1-3 until you hit a test error rate you are comfortable with.
This method will work if your model is not suffering "high bias".
This video from Coursera's Machine Learning course, explains it.
Unfortunately, there is no simple method for this.
The rule of thumb is the bigger, the better, but in practical use, you have to gather the sufficient amount of data. By sufficient I mean covering as big part of modeled space as you consider acceptable.
Also, amount is not everything. The quality of test samples is very important too, i.e. training samples should not contain duplicates.
Personally, when I don't have all possible training data at once, I gather some training data and then train a classifier. Then I classifier quality is not acceptable, I gather more data, etc.
Here is some piece of science about estimating training set quality.
This depends a lot on the nature of the data and the prediction you are trying to make, but as a simple rule to start with, your training data should be roughly 10X the number of your model parameters. For instance, while training a logistic regression with N features, try to start with 10N training instances.
For an empirical derivation of the "rule of 10", see
https://medium.com/#malay.haldar/how-much-training-data-do-you-need-da8ec091e956