ML - Instance-based learning [closed] - machine-learning

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 5 years ago.
Improve this question
I am quite new to machine learning and I've been reading a book where the author describes instance-based learning as follows
Possibly the most trivial form of learning is simply to learn by heart. If you were to create a spam filter
this way, it would just flag all emails that are identical to emails that have already been flagged by users
— not the worst solution, but certainly not the best.
Instead of just flagging emails that are identical to known spam emails, your spam filter could be
programmed to also flag emails that are very similar to known spam emails. This requires a measure of
similarity between two emails. A (very basic) similarity measure between two emails could be to count
the number of words they have in common. The system would flag an email as spam if it has many words
in common with a known spam email.
This is called instance-based learning: the system learns the examples by heart, then generalizes to new
cases using a similarity measure
But I couldn't understand it completely as he used the words similar and identical. I didn't understand difference. Any explanation would be appreciated. Thank you.

Identical literally means identical - there are zero differences and it is an exact match.
The strings "aaaaa" and "aaaaa" are identical. There is no other string that can ever exist that is also identical to "aaaaa" other than itself.
Similar again is being used in the literal sense. "aaaaa" and "aaaab" are not identical, they differ by one character. But they are similar in that they share 4 out of 5 characters. There are many-many possible strings that are similar to "aaaaa".
Naively looking at the number of different characters in a string is one way to define similarity.
The trick to all instance based learning is the answering the question: how do we explicitly define similar for this application. Every application would likely benefit from different measures of similarity, though there are some common ones do exist and get re-used frequently, that doesn't mean they are optimal.

Related

Generating ques-answer pairs from unstructured text

I have to create a system that generates all possible question answer pairs from unstructured text in a specific domain.Many questions may have the same answer but the system should generate all possible types of questions that an answer can have.The questions formed should be meaningful and grammatically correct.
For this purpose, I used nltk and trained an NER, creating entities according to my domain and then I created some rules to identify the question word using the combination of NER identified entities and POS tagged words. But this approach isn't working fine as I am not able to create meaningful questions from the text. Moreover, some question words are wrongly identified and some question words are missed. I also read research papers on using RNN for this purpose but I don't have a large training data since the domain is pretty small. Can anyone suggest a better approach?

Categorize customer questions based on content

I’m working on web app where users can ask questions. These questions should be categorized by some criteria based on question content, title, user data, region and so on. Next these questions should be processed in so way: for some additional information requests should be sent, others should be deleted or marked as spam and some – sent directly to some specialist.
The problem is that users can’t choose the right category themselves, it’s pretty complex things and users can cheat.
Are there any approaches how to do that automatically? For now a few persons do this job filtering questions. Perhaps some already done solutions exist.
This is a really complex task. You should take a look at supervised machine learning classification algorithms. You can try to use similar to some spam filtering algorithm (https://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering)
Gather some number of questions categorized before (labeled examples).
Gather some number of words (vocabulary) used for questions classifications (identify group).
Process question text removing “stop words” and replace words with their stems.
Map question text, title, user data and so to some numbers (question vector).
Use some algorithm like SVM to create and use classifier (model)
But it’s like very general approach you can look at. It’s hard to say something more specific without additional details. I don’t think you can find already done solution, it’s pretty specific task. But of cause you can use a lot of machine-learning frameworks.

Best Features for Term Level Clustering [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
At the moment, I am working on a project that related to mining Twitter data. The aim of the project is to find the themes that can be used to represent the set of tweets. To help us finding the themes, we came up with an idea to do term level clustering. The terms are some important concepts that already extracted using some TextMining tools.
Well, my main question is, what are the best features to define term similarity? In this project, due to an insufficient amount of data, I am doing the unsupervised learning, which is clustering using the k-means algorithm.
I do have some extracted features. As I understand, one way to know the semantic (not actually) meaning of a term is by seeing the context of which the term is mentioned. Therefore, what I have at the moment are preceding and following WORD and POS of the term. For instance:
I drink a cup of XYZ
She had a spoon of ABC yesterday.
By seeing the preceding word and POS - cup/NN and of/IN for XYZ and spoon/NN and of/IN for ABC - I knew that XYZ and ABC might be a liquid material or component. Well, it sounds very naive, in fact, I don't get the good clusters. In addition to the previous features, I have some named entity types that I considered as features. For instance, entity type like Person, Location, Problem (in medical), MEDTERM etc.
So, what are the common features for term level clustering? Any comments and suggestions would be appreciated. I do open to any guidance, such as paper, link etc. Thanks
EDIT: In addition to those features, I've extracted the head nouns of each term and considered it as one of my features. I am thinking of using a head noun in the case for multi-word terms.
Well, let me see if I understood correctly what you need. You already extracted/found the terms you want as centres of your clusters, and now you want to find all terms which are similar to them so they get grouped in the proper cluster?.
In general you need to define a similarity measure (distance) and here is the main point, what you want that similarity distance to measure or determine. If you are looking for term to term similarity, just letters then you can try things like Levenshtein distance for example, but if what you want to find are contextual similar terms, even they are written in a very different way but could mean the same thing, thats different from Levenshtein pretty much harder to do.
What is important to keep in mind is that you need a measure of similarity to find the similar terms. What I see you call features some named entity types, normally k-means is bad when dealing with non continuos data.

Publicly Available Spam Filter Training Set [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I'm new to machine learning, and for my first project I'd like to write a naive Bayes spam filter. I was wondering if there are any publicly available training sets of labeled spam/not spam emails, preferably in plain text and not a dump of a relational database (unless they pretty-print those?).
I know such a publicly available database exists for other kinds of text classification, specifically news article text. I just haven't been able to find the same sort of thing for emails.
Here is what I was looking for: http://untroubled.org/spam/
This archive has around a gigabyte of compressed accumulated spam messages dating 1998 - 2011. Now I just need to get non-spam email. So I'll just query my own Gmail for that using the getmail program and the tutorial at mattcutts.com
Sure, there's Spambase, which is as far as i'm aware, is the most widely cited spam data set in the machine learning literature.
I have used this data set many times; each time i am impressed how much effort has been put into the formatting and documentation of this data set.
A few characteristics of the Spambase set:
4601 data points--all complete
each comprised of 58 features
(attributes)
each data point is labelled 'spam' or
'no spam'
approx. 40% are labeled spam
of the features, all are continuous
(vs. discrete)
a representative feature: average
continuous sequence of capital
letters
Spambase is archived in the UCI Machine Learning Repository; in addition, it's also available on the Website for the excellent ML/Statistical Computation Treatise, Elements of Statistical Learning by Hastie et al.
SpamAssassin has a public corpus of both spam and non-spam messages, although it hasn't been updated in a few years. Read the readme.html file to learn what's there.
You might consider taking a look at the TREC spam/ham corpus (which I think is the collection of emails from Enron that was made public from the court case). TREC generally runs a bunch of competitive text processing tasks, so it might give you some references for comparison.
The downside is that they're stored in raw mbox format, though there are parsers available in many languages (Apache Tika is a good example).
The webpage isn't TREC, but this seems to be a good overview of the task with links to the data: http://plg.uwaterloo.ca/~gvcormac/spam/
A more modern one spam training set can be found at kaggle. Moreover, you can test accuracy of your classifier on their website by uploading your results.
I have also an answer, here you can find a daily refreshed Bayesian database for initial training and also a daily created archive containing captured spams. You will find the instructions how to use it on the site.

How does Collective Intelligence beat Experts' view? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I am interested in doing some Collective Intelligence programming, but wonder how it can work?
It is said to be able to give accurate predictions: the O'Reilly Programming Collective Intelligence book, for example, says a collection of traders' action actually can predict future prices (such as corn) better than an expert can.
Now we also know in statistics class that, if it is a room of 40 students taking exam, there will be 3 to 5 students who will get an "A" grade. There might be 8 that get "B", and 17 that got "C", and so on. That is, basically, a bell curve.
So from these two standpoints, how can a collection of "B" and "C" answers give a better prediction than the answer that got an "A"?
Note that the corn price, for example, is the accurate price factoring in weather, demand of food companies using corn, etc, rather than "self fulfilling prophecy" (more people buy the corn futures and price goes up and more people buy the futures again). It is actually predicting the supply and demand accurately to give out an accurate price in the future.
How is it possible?
Update: can we say Collective Intelligence won't work in stock market euphoria and panic?
The Wisdom of Crowds wiki page offers a good explanation.
In short, you don't always get good answers. There needs to be a few conditions for it to occur.
Well, you might want to think of the following "model" for a guess:
guess = right answer + error
If we ask a lot of people a question, we'll get lots of different guesses. But if, for some reason, the distribution of errors is symmetric around zero (actually it just has to have zero mean) then the average of the guesses will be a pretty good predictor of the right answer.
Note that the guesses don't necessarily have to be good -- i.e., the errors could indeed be large (grade B or C, rather than A) as long as there are grade B and C answers distributed on both sides of the right answer.
Of course, there are cases where this is a terrible model for our guesses, so collective intelligence won't always work...
Crowd Wisdom techniques, like prediction markets, work well in some situations, and poorly in others, just as other approaches (experts, for instance) have their strengths and weaknesses. The optimal arenas therefore, are ones where no other approaches do very well, and prediction markets can do well. Some examples include predicting public elections, estimating project completion dates, and predicting the prevalence of epidemics. These are areas where information is spread around sparsely, and experts haven't found effective models that reliably predict.
The general idea is that market participants make up for one another's weaknesses. The expectation isn't that the markets will always predict every outcome correctly, but that, due to people noticing other people's mistakes, they won't miss crucial information as often, and that over the long haul, they'll do better. In cases where the exerts actually know the answer, they'll be able to influence the outcome. Different experts can weigh in on different questions, so each has more influence where they have the most knowledge. And as markets continue over time, each participant gets feedback from their gains and losses that makes them better informed about which kinds of questions they actually understand and which ones they should stay away from.
In a classroom, people are often graded on a curve, so the distribution of grades doesn't tell you much about how good the answers were. Prediction markets calibrate all the answers against actual outcomes. This public record of successes and failures does a lot to reinforce the mechanism, and is missing in most other approaches to forecasting.
Collective intelligence is really good at coming up to to problems that have complex behavior behind them, because they are able to take multiple sources of opinions/attributes to determine the end result. With a setup like this, training helps to optimize the end result of the processes.
The fault is in your analogy, both opinions are not equal. Traders predict direct profit for their transaction (the little part of the market they have overview over), while the expert tries to predict the overall field.
IOW the overall traders position is pieced together like a jigsaw puzzle based on a large amount of small opinions for their respective piece of the pie (where they are assumed to be experts).
A single mind can't process that kind of detail, which is why the overall position MIGHT overshadow the real expert. Note that this is particularly phenomon is usually restricted to a fairly static market, not in periods of turmoil. Expert usually do better then, since they are often better trained and motivated to avoid going with general sentiment. (which is often comparable to that of a lemming in times of turmoil)
The problem with the class analogy is that the grading system doesn't assume that the students are masters in their (difficult to predict) terrain, so it is not comparable.
P.s. note that the base axiom depends on all players being experts in a small piece of the field. One can debate if this requirement actually transports well to a web 2 environment.

Resources