I have a dataset of 300 respondents (hours studied vs grade), I load the dataset in Excel run the data analysis add-in and run a linear regression. I get my results.
So the question is, Am I doing a Statistical Analysis or Am I doing Machine Learning? I know the question may seem simple but I think we should get some debate from this.
Maybe your question is better suited for Data Science as it is not a question related to app/program development. Running formulas in excel through an add on is not really considered anywhere close to "programming".
Statistical Analysis is when you take statistical metrics of your data, like mean, standard deviation, confidence intervall, p-value...
Supervised Machine Learning is when you try to classify or predict something. For these problemns you use features as input to the model in order to classify a class or predict a value.
In this case you are doing machine learning, because you use the hours studied feature to predict the student grade.
In the proper context, you're actually doing Statistical Analysis... (Which is part of Machine Learn
Related
I have been reading so many articles on Machine Learning and Data mining from the past few weeks. Articles like the difference between ML and DM, similarities, etc. etc. But I still have one question, it may look like a silly question,
How to determine, when should we use ML algorithms and when should we use DM?
Because I have performed some practicals of DM using weka on Time Series Analysis(future population prediction, sales prediction), text mining using R/python, etc. Same can be done using ML algorithms also, like future population prediction using Linear regression.
So how to determine, that, for a given problem ML is best suitable or Dm is best suitable.
Thanks in advance.
Probably the closest thing to the quite arbitrary and meaningless separation of ML and DM is unsupervised methods vs. supervised learning.
Choose ML if you have training data for your target function.
Choose DM when you need to explore your data.
Which are the fundamental criterias for using supervised or unsupervised learning?
When is one better than the other?
Is there specific cases when you can only use one of them?
Thanks
If you a have labeled dataset you can use both. If you have no labels you only can use unsupervised learning.
It´s not a question of "better". It´s a question of what you want to achieve. E.g. clustering data is usually unsupervised – you want the algorithm to tell you how your data is structured. Categorizing is supervised since you need to teach your algorithm what is what in order to make predictions on unseen data.
See 1.
On a side note: These are very broad questions. I suggest you familiarize yourself with some ML foundations.
Good podcast for example here: http://ocdevel.com/podcasts/machine-learning
Very good book / notebooks by Jake VanderPlas: http://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/Index.ipynb
Depends on your needs. If you have a set of existing data including the target values that you wish to predict (labels) then you probably need supervised learning (e.g. is something true or false; or does this data represent a fish or cat or a dog? Simply put - you already have examples of right answers and you are just telling the algorithm what to predict). You also need to distinguish whether you need a classification or regression. Classification is when you need to categorize the predicted values into given classes (e.g. is it likely that this person develops a diabetes - yes or no? In other words - discrete values) and regression is when you need to predict continuous values (1,2, 4.56, 12.99, 23 etc.). There are many supervised learning algorithms to choose from (k-nearest neighbors, naive bayes, SVN, ridge..)
On contrary - use the unsupervised learning if you don't have the labels (or target values). You're simply trying to identify the clusters of data as they come. E.g. k-Means, DBScan, spectral clustering..)
So it depends and there's no exact answer but generally speaking you need to:
Collect and see you data. You need to know your data and only then decide which way you choose or what algorithm will best suite your needs.
Train your algorithm. Be sure to have a clean and good data and bear in mind that in case of unsupervised learning you can skip this step as you don't have the target values. You test your algorithm right away
Test your algorithm. Run and see how well your algorithm behaves. In case of supervised learning you can use some training data to evaluate how well is your algorithm doing.
There are many books online about machine learning and many online lectures on the topic as well.
Depends on the data set that you have.
If you have target feature in your hand then you should go for supervised learning. If you don't have then it is a unsupervised based problem.
Supervised is like teaching the model with examples. Unsupervised learning is mainly used to group similar data, it plays a major role in feature engineering.
Thank you..
I have been using SVM for training and testing one dimensional data (15000 sample points for training, 7500 sample points for testing) and it has brought up satisfactory results so far. But to improve on the results, I am thinking of using Deep Learning for the same. Will it be able to improve results? What should I study for a quick implementation of Deep Learning algorithms? I am new to the DL field but want a quick implementation, if at all it is justifiable.
In machine learning applications it is hard to say if an algorithm will improve the results or not because the results really depend on the data. There is no best algorithm. You should follow the steps given below:
Analyze your data
Apply the appropriate algorithms by the help of your machine learning background
Evaluate the results
There are many machine learning libraries for different programming languages i.e. Weka for Java and scikit-learn for Python. The implementations may have special names other than the abstract names like Deep Learning. Thus, research for the implementation you are looking for in the library you are using.
Sorry if my question sounds too naive... i am really new to machine learning and regression
i have recently joined a machine learning lab as a master student . my professor wants me to write "the experiments an analysis" section of a paper the lab is about to submit about a regression algorithm that they have developed.
the problem is i don't know what i have to do he said the algorithm is stable and completed and they have written the first part of paper and i need to write the evaluation part .
i really don't know what to do . i have participated in coding the algorithm and i understand it pretty well but i don't know what are the tasks i must take in order to evaluate and analysis its performance.
-where do i get data?
-what is the testing process?
-what are the analysis to be done?
i am new to research and paper writing and really don't know what to do.
i have read a lot of paper recently but i have no experience in analyzing ML algorithms.
could you please guide me and explain (newbie level) the process please.
detailed answers are appreciated
thanks
You will need a test dataset to evaluate the performance. If you
don't have that, divide your training dataset (that you're currently
running this algorithm on) into training set and cross validation set
(non overlapping).
Create the test set by stripping out the predictions (y values) from
the cross validation set.
Run the algorithm with the training dataset to train the model.
Once your model is trained, test it's performance using the stripped
off 'test set'.
To evaluate the performance, you can use the RMSE (Root Mean Squared
Error) metric. You will need to use the predictions that your
algorithm made for each sample in the test set and their
corresponding actual predictions (that you stripped off earlier to
feed in the test set). You can find more information here.
Machine learning model evaluation
Take a look at this paper. It has been written for people without a computer science background, so it should be fairly easy to follow. It covers:
model evaluation workflow
holdout validation
cross-validation
k-fold cross-validation
stratified k-fold cross-validation
leave-one-out cross-validation
leave-p-out cross-validation
leave-one-group-out cross-validation
nested cross-validation
I have two dependent continuous variables and i want to use their combined values to predict the value of a third binary variable. How do i go about discretizing/categorizing the values? I am not looking for clustering algorithms, i'm specifically interested in obtaining 'meaningful' discrete categories i can subsequently use in in a Bayesian classifier.
Pointers to papers, books, online courses, all very much appreciated!
That is the essence of machine learning and problem one of the most studied problem.
Least-square regression, logistic regression, SVM, random forest are widely used for this type of problem, which is called binary classification.
If your goal is to pragmatically classify your data, several libraries are available, like Scikits-learn in python and weka in java. They have a great documentation.
But if you want to understand what's the intrinsics of machine learning, just search (here or on google) for machine learning resources.
If you wanted to be a real nerd, generate a bunch of different possible discretizations and then train a classifier on it, and then characterize the discretizations by features and then run a classifier on that, and see what sort of discretizations are best!?
In general discretizing stuff is more of an art and having a good understanding of what the input variable ranges mean.