I’m reading towards M.Sc. in Computer Science and just completed first year of the source. (This is a two year course). Soon I have to submit a proposal for the M.Sc. Project. I have selected following topic.
“Suitability of machine learning for document ranking in information retrieval system”. Researchers have been using various machine learning algorithms for ranking documents. So as the first phase of the project I will be doing a complete literature survey and finding out advantages/disadvantages of current approaches. In the second phase of the project I will be proposing a new (modified) algorithm in order to overcome the limitations of current approaches.
Actually my question is whether this type of project is suitable as a M.Sc. project? Moreover if somebody has some interesting idea in information retrieval filed, is it possible to share those ideas with me.
Thanks
Ranking is always the hardest part of any of Information Retrieval systems. I think it is a very good topic but you have to take care to -- as soon as possible -- to define a scope of the work. Probably you will not be able to develop a new IR engine but rather build a prototype based on, e.g., apache lucene.
Currently there is a lot of dataset including stackoverflow data dump, which provide you all information you need to define a rich feature vector (number of points, time, you can mine topics of previous question etc., popularity of a tag) for you machine learning ranking algorithm. In this part of the work you could, e.g., classify types of features (e.g., user specific, semantic feature - software name in the title) and perform series of experiments to learn which features are most important and which are not for a given dataset.
The second direction of such a project can be how to perform learning efficiently. The reason behind is the quantity of data within web or community forums and changes in the forum (this would be important if you take a community specific features), e.g., changes in technologies, new software release, etc.
There are many other topics related to search and machine learning. The best idea is to search on scholar.google.com for the recent survey papers on ranking, machine learning, and search to learn what is the state-of-the-art. The very next step would be to talk with your MSc supervisor.
Good luck!
Everything you said is good and should be done, but you forgot the most important part:
Prove that your algorithm is better and/or faster than other algorithms, with good experiments and maybe some statistics (p-value, confidence interval).
If you do that and convince people that your algorithm is useful you surely will not fail :)
Related
I am looking for solutions where I can automatically approve or disapprove different supplier invoices based on historical data.
Let's say, I got an invoice from an HP laptop supplier and based on the previous data, I have to approve or reject that invoice.
Basically, I want to make a decision or prediction based on the data already available based on the history with artificial intelligence, machine learning or any other cloud service
This isn't a direct question though but you can start by looking into various methods of classifications. There is a huge amount of material available online. Try reading about K-Nearest Neighbors, Naive Bayes, K-means, etc. to get an idea about how algorithms in Machine Learning domain work. Once you start understanding what is written in the documentation then start implementing them. You will face a lot of problems which you can search online and I'm sure you will find most of them answered here in this portal.
I have developed a ML model for a classification (0/1) NLP task and deployed it in production environment. The prediction of the model is displayed to users, and the users have the option to give a feedback (if the prediction was right/wrong).
How can I continuously incorporate this feedback in my model ? From a UX stand point you dont want a user to correct/teach the system more than twice/thrice for a specific input, system shld learn fast i.e. so the feedback shld be incorporated "fast". (Google priority inbox does this in a seamless way)
How does one build this "feedback loop" using which my system can improve ? I have searched a lot on net but could not find relevant material. any pointers will be of great help.
Pls dont say retrain the model from scratch by including new data points. Thats surely not how google and facebook build their smart systems
To further explain my question - think of google's spam detector or their priority inbox or their recent feature of "smart replies". Its a well known fact that they have the ability to learn / incorporate (fast) user feed.
All the while when it incorporates the user feedback fast (i.e. user has to teach the system correct output atmost 2-3 times per data point and the system start to give correct output for that data point) AND it also ensure it maintains old learnings and does not start to give wrong outputs on older data points (where it was giving right output earlier) while incorporating the learning from new data point.
I have not found any blog/literature/discussion w.r.t how to build such systems - An intelligent system that explains in detaieedback loop" in ML systems
Hope my question is little more clear now.
Update: Some related questions I found are:
Does the SVM in sklearn support incremental (online) learning?
https://datascience.stackexchange.com/questions/1073/libraries-for-online-machine-learning
http://mlwave.com/predicting-click-through-rates-with-online-machine-learning/
https://en.wikipedia.org/wiki/Concept_drift
Update: I still dont have a concrete answer but such a recipe does exists. Read the section "Learning from the feedback" in the following blog Machine Learning != Learning Machine. In this Jean talks about "adding a feedback ingestion loop to machine". Same in here, here, here4.
There could be couple of ways to do this:
1) You can incorporate the feedback that you get from the user to only train the last layer of your model, keeping the weights of all other layers intact. Intuitively, for example, in case of CNN this means you are extracting the features using your model but slightly adjusting the classifier to account for the peculiarities of your specific user.
2) Another way could be to have a global model ( which was trained on your large training set) and a simple logistic regression which is user specific. For final predictions, you can combine the results of the two predictions. See this paper by google on how they do it for their priority inbox.
Build a simple, light model(s) that can be updated per feedback. Online Machine learning gives a number of candidates for this
Most good online classifiers are linear. In which case we can have a couple of them and achieve non-linearity by combining them via a small shallow neural net
https://stats.stackexchange.com/questions/126546/nonlinear-dynamic-online-classification-looking-for-an-algorithm
I want to program a robot which will sense obstacles and learn whether to cross over them or bypass around them.
Since my project, must be realized in week and a half period, I must use an online learning algorithm (GA or such would take a lot time to test because robot needs to try to cross over the obstacle in order to determine is it possible to cross).
I'm really new to online learning so I don't really know which online learning algorithm to use.
It would be a great help if someone could recommend me a few algorithms that would be the best for my problem and some link with examples wouldn't hurt.
Thanks!
I think you could start with A* (A-Star)
It's simple and robust, and widely used.
There are some nice tutorials on the web like this http://www.raywenderlich.com/4946/introduction-to-a-pathfinding
Online algorithm is just the one that can collect new data and update a model incrementally without re-training with full dataset (i.e. it may be used in online service that works all the time). What you are probably looking for is reinforcement learning.
RL itself is not a method, but rather general approach to the problem. Many concrete methods may be used with it. Neural networks have been proved to do well in this field (useful course). See, for example, this paper.
However, to create real robot being able to bypass obstacles you will need much then just knowing about neural networks. You will need to set up sensors carefully, preprocess data from them, work out your model and collect a dataset. Not sure it's possible to even learn it all in a week and a half.
I want to teach myself enough machine learning so that I can, to begin with, understand enough to put to use available open source ML frameworks that will allow me to do things like:
Go through the HTML source of pages
from a certain site and "understand"
which sections form the content,
which the advertisements and which
form the metadata ( neither the
content, nor the ads - for eg. -
TOC, author bio etc )
Go through the HTML source of pages
from disparate sites and "classify"
whether the site belongs to a
predefined category or not ( list of
categories will be supplied
beforhand )1.
... similar classification tasks on
text and pages.
As you can see, my immediate requirements are to do with classification on disparate data sources and large amounts of data.
As far as my limited understanding goes, taking the neural net approach will take a lot of training and maintainance than putting SVMs to use?
I understand that SVMs are well suited to ( binary ) classification tasks like mine, and open source framworks like libSVM are fairly mature?
In that case, what subjects and topics
does a computer science graduate need
to learn right now, so that the above
requirements can be solved, putting
these frameworks to use?
I would like to stay away from Java, is possible, and I have no language preferences otherwise. I am willing to learn and put in as much effort as I possibly can.
My intent is not to write code from scratch, but, to begin with putting the various frameworks available to use ( I do not know enough to decide which though ), and I should be able to fix things should they go wrong.
Recommendations from you on learning specific portions of statistics and probability theory is nothing unexpected from my side, so say that if required!
I will modify this question if needed, depending on all your suggestions and feedback.
"Understanding" in machine learn is the equivalent of having a model. The model can be for example a collection of support vectors, the layout and weights of a neural network, a decision tree, or more. Which of these methods work best really depends on the subject you're learning from and on the quality of your training data.
In your case, learning from a collection of HTML sites, you will like to preprocess the data first, this step is also called "feature extraction". That is, you extract information out of the page you're looking at. This is a difficult step, because it requires domain knowledge and you'll have to extract useful information, or otherwise your classifiers will not be able to make good distinctions. Feature extraction will give you a dataset (a matrix with features for each row) from which you'll be able to create your model.
Generally in machine learning it is advised to also keep a "test set" that you do not train your models with, but that you will use at the end to decide on what is the best method. It is of extreme importance that you keep the test set hidden until the very end of your modeling step! The test data basically gives you a hint on the "generalization error" that your model is making. Any model with enough complexity and learning time tends to learn exactly the information that you train it with. Machine learners say that the model "overfits" the training data. Such overfitted models seem to appear good, but this is just memorization.
While software support for preprocessing data is very sparse and highly domain dependent, as adam mentioned Weka is a good free tool for applying different methods once you have your dataset. I would recommend reading several books. Vladimir Vapnik wrote "The Nature of Statistical Learning Theory", he is the inventor of SVMs. You should get familiar with the process of modeling, so a book on machine learning is definitely very useful. I also hope that some of the terminology might be helpful to you in finding your way around.
Seems like a pretty complicated task to me; step 2, classification, is "easy" but step 1 seems like a structure learning task. You might want to simplify it to classification on parts of HTML trees, maybe preselected by some heuristic.
The most widely used general machine learning library (freely) available is probably WEKA. They have a book that introduces some ML concepts and covers how to use their software. Unfortunately for you, it is written entirely in Java.
I am not really a Python person, but it would surprise me if there aren't also a lot of tools available for it as well.
For text-based classification right now Naive Bayes, Decision Trees (J48 in particular I think), and SVM approaches are giving the best results. However they are each more suited for slightly different applications. Off the top of my head I'm not sure which would suit you the best. With a tool like WEKA you could try all three approaches with some example data without writing a line of code and see for yourself.
I tend to shy away from Neural Networks simply because they can get very very complicated quickly. Then again, I haven't tried a large project with them mostly because they have that reputation in academia.
Probability and statistics knowledge is only required if you are using probabilistic algorithms (like Naive Bayes). SVMs are generally not used in a probabilistic manner.
From the sound of it, you may want to invest in an actual pattern classification textbook or take a class on it in order to find exactly what you are looking for. For custom/non-standard data sets it can be tricky to get good results without having a survey of existing techniques.
It seems to me that you are now entering machine learning field, so I'd really like to suggest to have a look at this book: not only it provides a deep and vast overview on the most common machine learning approaches and algorithms (and their variations) but it also provides a very good set of exercises and scientific paper links. All of this is wrapped in an insightful language starred with a minimal and yet useful compendium about statistics and probability
People often throw around the terms IR, ML, and data mining, but I have noticed a lot of overlap between them.
From people with experience in these fields, what exactly draws the line between these?
This is just the view of one person (formally trained in ML); others might see things quite differently.
Machine Learning is probably the most homogeneous of these three terms, and the most consistently applied--it's limited to the pattern-extraction (or pattern-matching) algorithms themselves.
Of the terms you mentioned, "Machine Learning" is the one most used by Academic Departments to describe their Curricula, their academic departments, and their research programs, as well as the term most used in academic journals and conferences proceedings. ML is clearly the least context-dependent of the terms you mentioned.
Information Retrieval and Data Mining are much closer to describing complete commercial processes--i.e., from user query to retrieval/delivery of relevant results. ML algorithms might be somewhere in that process flow, and in the more sophisticated applications, often are, but that's not a formal requirement. In addition, the term Data Mining seems usually to refer to application of some process flow on big data (i.e, > 2BG) and therefore usually includes a distributed processing (map-reduce) component near the front of that workflow.
So Information Retrieval (IR) and Data Mining (DM) are related to Machine Learning (ML) in an Infrastructure-Algorithm kind of way. In other words, Machine Learning is one source of tools used to solve problems in Information Retrieval. But it's only one source of tools. But IR doesn't depend on ML--for instance, a particular IR project might be storage and rapid retrieval of the fully-indexed data responsive to a user's search query IR, the crux of which is optimizing performance of the data flow, i.e., the round-trip from query to delivering the search results to the user. Prediction or pattern matching might not be useful here. Likewise, a DM project might use an ML algorithm for the predictive engine, yet a DM project is more likely to also be concerned with the entire processing flow--for instance, parallel computation techniques for efficient input of an enormous data volume (TB perhaps) which delivers a proto-result to a processing engine for computation of descriptive statistics (mean, standard deviation, distribution, etc. on the variables (columns).
Lastly consider the Netflix Prize. This competition was directed solely to Machine Learning--the focus was on the prediction algorithm, as evidenced by the fact that there was a single success criterion: accuracy of the predictions returned by the algorithm. Imagine if the 'Netflix Prize' were rebranded as a Data Mining competition. The success criteria would almost certainly be expanded to more accurately access the algorithm's performance in the actual commercial setting--so for instance overall execution speed (how quickly are the recommendations delivered to the user) would probably be considered along with accuracy.
The terms "Information Retrieval" and "Data Mining" are now in mainstream use, though for a while I only saw these terms in my job description or in vendor literature (usually next to the word "solution.") At my employer, we recently hired a "Data Mining" analyst. I don't know what he does exactly, but he wears a tie to work every day.
I'd try to draw the line as follows:
Information retrieval is about finding something that already is part of your data, as fast as possible.
Machine learning are techniques to generalize existing knowledge to new data, as accurate as possible.
Data mining is primarly about discovering something hidden in your data, that you did not know before, as "new" as possible.
They intersect and often use techniques of one another. DM and IR both use index structures to accelerate processes. DM uses a lot of ML techniques, for example a pattern in the data set that is useful for generalization might be a new knowledge.
They are often hard to separate. Do yourself a favor and don't just go for the buzzwords. In my opinion the best way of distinguishing them is by their intention, as given above: find data, generalize to new data, find new properties of existing data.
You can also add pattern recognition and (computational?) statistics as another couple of areas that overlap with the three you mentioned.
I'd say there is no well-defined line between them. What separates them is their history and their emphases. Statistics emphasizes mathematical rigor, data mining emphasizes scaling to large datasets, ML is somewhere in between.
Data mining is about discovering hidden patterns or unknown knowledge, which can be used
for decision making by people.
Machine learning is about learning a model to classify new objects.