I have been doing research on feature selection and I'm failing to understand the difference about these two approaches.
According to most authors on literature, feature selection algorithms are categorized into three categories. The first two, filter and wrapper are easy to understand and there is a general agreement on that. However, on the last category there seems to be a misunderstandment. Some authors as the case of H. Liu name the last category as hybrid. In contrast, V. Kumar names it embedded. In addiction to that there are cases where authors define 4 categories including both embedded and hybrid algorithms, as is the case of P. Abinaya.
Authors explain the hybrid algorithms as the combination between a filter algorithm and a wrapper approachs. The main idea behind these algorithms is to use a filter approach to reduce the search space for a wrapper approach.
On the other hand the definition of embedded algorithms on the literature is very different depending on the source. Some use almost the same definitation as the hybrid algorithms as is the case of the wikipedia page. Others give more abstract definitions such as: methods that perform feature selection during learning of optimal parameters, and methods that incorporate knowledge about the specific structure of the class of functions used by a certain learning machine.
So I would appreciate if anyone could explain me what's the difference between these two approaches or give a less abstract definition of embedded methods.
Thanks.
Related
My goal of a project is to correctly assign medications. I have a large catalog at my disposal for this purpose. However, the medications do not appear there in exactly the same spelling. Possibly additional information was added or possible parts of the prescription were abbreviated.
I was already able to implement a possible algorithm using the Levensthein distance (token_set_ratio).
Because of the sometimes long additional information this algorithm assigns wrong medications, I wanted to ask if there are better algorithms for comparing strings. For example, does it make sense to implement machine learning algorithms or NLP technology? This is a relatively new area for me. I would appreciate any ideas or inspiration.
This sounds like a classic Deduplication task. For example, have a look at dedupe. This tool lets you annotate training examples and learns when two items refer to the same thing. It can be used with as few as 10 training sanples and has an active learning approach implemented.
I used machine learning to train depression related sentences. And it was LinearSVC that performed best. In addition to LinearSVC, I experimented with MultinomialNB and LogisticRegression, and I chose the model with the highest accuracy among the three. By the way, what I want to do is to be able to think in advance which model will fit, like ml_map provided by Scikit-learn. Where can I get this information? I searched a few papers, but couldn't find anything that contained more detailed information other than that SVM was suitable for text classification. How do I study to get prior knowledge like this ml_map?
How do I study to get prior knowledge like this ml_map?
Try to work with different example datasets on different data types by using different algorithms. There are hundreds to be explored. Once you get the good grasp of how they work, it will become more clear. And do not forget to try googling something like advantages of algorithm X, it helps a lot.
And here are my thoughts, I think I used to ask such questions before and I hope it can help if you are struggling: The more you work on different Machine Learning models for a specific problem, you will soon realize that data and feature engineering play the more important parts than the algorithms themselves. The road map provided by scikit-learn gives you a good view of what group of algorithms to use to deal with certain types of data and that is a good start. The boundaries between them, however, are rather subtle. In other words, one problem can be solved by different approaches depending on how you organize and engineer your data.
To sum it up, in order to achieve a good out-of-sample (i.e., good generalization) performance while solving a problem, it is mandatory to look at the training/testing process with different setting combinations and be mindful with your data (for example, answer this question: does it cover most samples in terms of distribution in the wild or just a portion of it?)
I am looking for an approach in NLP , where i can generate a concept tree from a set of keywords.
Here is the scenario, i have extracted a set of keywords from a research paper. Now i want to arrange these keywords in form of a tree where most general keyword comes on top. At next level of tree will have keywords that are important to understand upper level concept and will be more specific as compared to upper level keywords. And the same way tree will grow.
Something like this :
I know there are many resources that can help me to solve this problem. Like Wikipedia dataset, Wordnet. But i do not know how to proceed with them.
My preferred programming language is Python. Do you know any python library or package which generate this?
I am also very interested to see the use of Machine learning approach to solve this problem.
I will really appreciate your any kind of help.
One way of looking at the problem is, given a set of documents, identify topics from them and also the dependencies between the topics.
So, for example, if you have some research papers as input (large set of documents), the output would be what topics the papers are on and how those topics are related in a hierarchy/tree. One research area that tries to tackle this is Hierarchical topic modeling and you can read more about this here and here.
But if you are just looking at creating a tree out of a bunch of keywords (that are somehow obtained) and no other information is available, then it needs knowledge of real world relationships and can perhaps be a rule-based system where we define Math --> Algebra and so on.
There is no way for a system to understand that algebra comes under math other than by looking at large no. of documents and inferring that relationship (see first suggestion) or if we manually map that relationship (perhaps, a rule-based system). That is how even humans learn those relationships.
I want to teach myself enough machine learning so that I can, to begin with, understand enough to put to use available open source ML frameworks that will allow me to do things like:
Go through the HTML source of pages
from a certain site and "understand"
which sections form the content,
which the advertisements and which
form the metadata ( neither the
content, nor the ads - for eg. -
TOC, author bio etc )
Go through the HTML source of pages
from disparate sites and "classify"
whether the site belongs to a
predefined category or not ( list of
categories will be supplied
beforhand )1.
... similar classification tasks on
text and pages.
As you can see, my immediate requirements are to do with classification on disparate data sources and large amounts of data.
As far as my limited understanding goes, taking the neural net approach will take a lot of training and maintainance than putting SVMs to use?
I understand that SVMs are well suited to ( binary ) classification tasks like mine, and open source framworks like libSVM are fairly mature?
In that case, what subjects and topics
does a computer science graduate need
to learn right now, so that the above
requirements can be solved, putting
these frameworks to use?
I would like to stay away from Java, is possible, and I have no language preferences otherwise. I am willing to learn and put in as much effort as I possibly can.
My intent is not to write code from scratch, but, to begin with putting the various frameworks available to use ( I do not know enough to decide which though ), and I should be able to fix things should they go wrong.
Recommendations from you on learning specific portions of statistics and probability theory is nothing unexpected from my side, so say that if required!
I will modify this question if needed, depending on all your suggestions and feedback.
"Understanding" in machine learn is the equivalent of having a model. The model can be for example a collection of support vectors, the layout and weights of a neural network, a decision tree, or more. Which of these methods work best really depends on the subject you're learning from and on the quality of your training data.
In your case, learning from a collection of HTML sites, you will like to preprocess the data first, this step is also called "feature extraction". That is, you extract information out of the page you're looking at. This is a difficult step, because it requires domain knowledge and you'll have to extract useful information, or otherwise your classifiers will not be able to make good distinctions. Feature extraction will give you a dataset (a matrix with features for each row) from which you'll be able to create your model.
Generally in machine learning it is advised to also keep a "test set" that you do not train your models with, but that you will use at the end to decide on what is the best method. It is of extreme importance that you keep the test set hidden until the very end of your modeling step! The test data basically gives you a hint on the "generalization error" that your model is making. Any model with enough complexity and learning time tends to learn exactly the information that you train it with. Machine learners say that the model "overfits" the training data. Such overfitted models seem to appear good, but this is just memorization.
While software support for preprocessing data is very sparse and highly domain dependent, as adam mentioned Weka is a good free tool for applying different methods once you have your dataset. I would recommend reading several books. Vladimir Vapnik wrote "The Nature of Statistical Learning Theory", he is the inventor of SVMs. You should get familiar with the process of modeling, so a book on machine learning is definitely very useful. I also hope that some of the terminology might be helpful to you in finding your way around.
Seems like a pretty complicated task to me; step 2, classification, is "easy" but step 1 seems like a structure learning task. You might want to simplify it to classification on parts of HTML trees, maybe preselected by some heuristic.
The most widely used general machine learning library (freely) available is probably WEKA. They have a book that introduces some ML concepts and covers how to use their software. Unfortunately for you, it is written entirely in Java.
I am not really a Python person, but it would surprise me if there aren't also a lot of tools available for it as well.
For text-based classification right now Naive Bayes, Decision Trees (J48 in particular I think), and SVM approaches are giving the best results. However they are each more suited for slightly different applications. Off the top of my head I'm not sure which would suit you the best. With a tool like WEKA you could try all three approaches with some example data without writing a line of code and see for yourself.
I tend to shy away from Neural Networks simply because they can get very very complicated quickly. Then again, I haven't tried a large project with them mostly because they have that reputation in academia.
Probability and statistics knowledge is only required if you are using probabilistic algorithms (like Naive Bayes). SVMs are generally not used in a probabilistic manner.
From the sound of it, you may want to invest in an actual pattern classification textbook or take a class on it in order to find exactly what you are looking for. For custom/non-standard data sets it can be tricky to get good results without having a survey of existing techniques.
It seems to me that you are now entering machine learning field, so I'd really like to suggest to have a look at this book: not only it provides a deep and vast overview on the most common machine learning approaches and algorithms (and their variations) but it also provides a very good set of exercises and scientific paper links. All of this is wrapped in an insightful language starred with a minimal and yet useful compendium about statistics and probability
I'm mainly just looking for a discussion of approaches on how to go from decentralized, non-normalized, completely open user-submitted tags, to start making sense of all of it through combining them into those semantic groups they called "clusters".
Does it take actual people to figure out what people actually mean by the tags used, or can it be done simply by automatically analyzing how often the tags go together?
That kind of stuff. Feel free to elaborate wildly :) (Also, if this has been discussed elsewhere, I'd love to hear about it).
Read this article: Automated Tag Clustering. It provides a good overview of the existing approaches and describes the algorithms for tag clustering.
Algorithms of the Intelligent Web (Manning) (esp. Chapter 4) and a book with a similar title from O'Reilly cover clustering algorithms. The Manning book starts with naive SQL approaches and moves to K-means, ROCK, and DBSCAN. It's more generalized than just focusing on tags, but easy to apply in that context. Code is presented in Java but is easily adapted to Ruby (sometimes more easily than adapting the Java code to your problem).
Chapter 5 covers classifications, which is about building topologies, and discusses Bayesian algorithms.