Recently I started working on a learning to rank algorithm which involves feature extraction as well as ranking. Famous learning to rank algorithm data-sets that I found on Microsoft research website had the datasets with query id and Features extracted from the documents. Can someone suggest me a good learning to rank Dataset which would have query-document pairs in their original form with good relevance judgment ??.
Alex Rogozhnikov keeps track of a few of datasets that you can use for learning to rank, check his blog post
You can also use the DBLP dataset, which was also used in a Learning to Rank task, check this paper: https://arxiv.org/pdf/1501.05132.pdf
Related
I have a dataset of 300 respondents (hours studied vs grade), I load the dataset in Excel run the data analysis add-in and run a linear regression. I get my results.
So the question is, Am I doing a Statistical Analysis or Am I doing Machine Learning? I know the question may seem simple but I think we should get some debate from this.
Maybe your question is better suited for Data Science as it is not a question related to app/program development. Running formulas in excel through an add on is not really considered anywhere close to "programming".
Statistical Analysis is when you take statistical metrics of your data, like mean, standard deviation, confidence intervall, p-value...
Supervised Machine Learning is when you try to classify or predict something. For these problemns you use features as input to the model in order to classify a class or predict a value.
In this case you are doing machine learning, because you use the hours studied feature to predict the student grade.
In the proper context, you're actually doing Statistical Analysis... (Which is part of Machine Learn
I have been reading so many articles on Machine Learning and Data mining from the past few weeks. Articles like the difference between ML and DM, similarities, etc. etc. But I still have one question, it may look like a silly question,
How to determine, when should we use ML algorithms and when should we use DM?
Because I have performed some practicals of DM using weka on Time Series Analysis(future population prediction, sales prediction), text mining using R/python, etc. Same can be done using ML algorithms also, like future population prediction using Linear regression.
So how to determine, that, for a given problem ML is best suitable or Dm is best suitable.
Thanks in advance.
Probably the closest thing to the quite arbitrary and meaningless separation of ML and DM is unsupervised methods vs. supervised learning.
Choose ML if you have training data for your target function.
Choose DM when you need to explore your data.
I have been using SVM for training and testing one dimensional data (15000 sample points for training, 7500 sample points for testing) and it has brought up satisfactory results so far. But to improve on the results, I am thinking of using Deep Learning for the same. Will it be able to improve results? What should I study for a quick implementation of Deep Learning algorithms? I am new to the DL field but want a quick implementation, if at all it is justifiable.
In machine learning applications it is hard to say if an algorithm will improve the results or not because the results really depend on the data. There is no best algorithm. You should follow the steps given below:
Analyze your data
Apply the appropriate algorithms by the help of your machine learning background
Evaluate the results
There are many machine learning libraries for different programming languages i.e. Weka for Java and scikit-learn for Python. The implementations may have special names other than the abstract names like Deep Learning. Thus, research for the implementation you are looking for in the library you are using.
Many machine learning competitions are held in Kaggle where a training set and a set of features and a test set is given whose output label is to be decided based by utilizing a training set.
It is pretty clear that here supervised learning algorithms like decision tree, SVM etc. are applicable. My question is, how should I start to approach such problems, I mean whether to start with decision tree or SVM or some other algorithm or is there is any other approach i.e. how will I decide?
So, I had never heard of Kaggle until reading your post--thank you so much, it looks awesome. Upon exploring their site, I found a portion that will guide you well. On the competitions page (click all competitions), you see Digit Recognizer and Facial Keypoints Detection, both of which are competitions, but are there for educational purposes, tutorials are provided (tutorial isn't available for the facial keypoints detection yet, as the competition is in its infancy. In addition to the general forums, competitions have forums also, which I imagine is very helpful.
If you're interesting in the mathematical foundations of machine learning, and are relatively new to it, may I suggest Bayesian Reasoning and Machine Learning. It's no cakewalk, but it's much friendlier than its counterparts, without a loss of rigor.
EDIT:
I found the tutorials page on Kaggle, which seems to be a summary of all of their tutorials. Additionally, scikit-learn, a python library, offers a ton of descriptions/explanations of machine learning algorithms.
This cheatsheet http://peekaboo-vision.blogspot.pt/2013/01/machine-learning-cheat-sheet-for-scikit.html is a good starting point. In my experience using several algorithms at the same time can often give better results, eg logistic regression and svm where the results of each one have a predefined weight. And test, test, test ;)
There is No Free Lunch in data mining. You won't know which methods work best until you try lots of them.
That being said, there is also a trade-off between understandability and accuracy in data mining. Decision Trees and KNN tend to be understandable, but less accurate than SVM or Random Forests. Kaggle looks for high accuracy over understandability.
It also depends on the number of attributes. Some learners can handle many attributes, like SVM, whereas others are slow with many attributes, like neural nets.
You can shrink the number of attributes by using PCA, which has helped in several Kaggle competitions.
Many models in Machine Learning include hyper parameters. What is the best practice to find those hyper parameters using hold out data? Or what is your way to do that?
Grid search and manual search are the most widely use techniques for optimizing hyper-parameters of machine learning algorithms. However, a recent paper published by James Bergstra and Yoshua Bengio argued that random search is better than grid and manual search for hyper-parameter optimization. For more information about random ( grid and manual ) search please look at their paper:
Random Search for Hyper-Parameter Optimization
Recently, I submitted ( and accepted ) a paper for the Pattern Recognition Letters Journal. For that paper I used the random search technique.