Best way to detect features based on text [closed] - machine-learning

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I have a "simple" problem: I have text sections and based on this it should be decided its whether "Category A" or "Category B".
As training data I have classified sections of text, which the algorithm can be trained.
The text sections look something like this:
Category A
a blue car drives
or
the blue bus stops
or
the blue bike drives
Category B
a red bike drives
or
the red bus stops
(The section text contains up to 20 words and the vary is massive)
If I have trained the algorithm with this example data, it should decide if text contains "blue" it's Categorie A, if its contains "red" it's Categorie B and so on.
The algorithm should learn based on training data if the frequency of a word is it likely more Category A or B.
Whats the best way to do this and which tool should I use?

You can try Fisher method, in which the probability of both positive (A) and negative (B) for each feature word (red, blue) in the document are calculated. The probability that a sentence with each of the two specified word (red, blue) belongs in the specified category (A, B) is obtained, assuming there will be an equal number of items in each category. Then a combined probability is obtained.
Since the features are not independent, this won’t be a real probability, but it works much like the Bayesian classifier. The value returned by the Fisher method is a much better estimate of probability, which can be very useful when reporting results or deciding cutoffs.

I think a fisrt try should be Logistic Regression as you have a binary classification problem. As soon as you have defined you caracteristic vector (e.g., frequency of a set of determined words) you can optimize the parameters of cost function that is for binary classification (e.g, the sigmoid funcion ).
An step that you probably will need is to eliminate 'stop words'.
I really recommend the Coursera Machine Learning classes.

Related

How to classify text with 35+ classes; only ~100 samples per class? [closed]

Closed. This question is not about programming or software development. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed yesterday.
This post was edited and submitted for review 19 hours ago.
Improve this question
The task is seemingly straightforward -- given a list of classes and some samples/rules of what belongs in the class, assign all relevant text samples to it. All the classes are arguably dissimilar, but they have a high degree of overlap in terms of vocab.
Precision is most important, but acceptable recall is about 80%.
Here is what I have done so far:
Checked if any of the samples have direct word matches/lemma matches to the samples that are in the class' corpora of words. (High precision but low recall -- got me to cover about 40% of text)
Formed a cosine_sim matrix of all the class' corpora of words and the remaining text samples. Cut off at an empirical threshold, it helped me identify a couple new texts that are very similar. (Covered maybe 10% more text)
I appended each sample picked by the word match/lemma match/embedding match (using sbert) to the class' corpora of words
Essentially I increased the number of samples in the class. Note that there are 35+ classes, and even with this method I got to maybe about 200-250 samples per class.
I converted each class' samples to embeddings via sbert, and then used UMAP to reduce dimensions. UMAP also has a secondary, but less used, use-case : it can learn representation and transform new data into similar representation! I used this concept to convert text to embeddings, then reduce them via UMAP, and saved the UMAP transformation. Using this reduced representation, I built a voting classifier ( with XGB, RF, KNearestNeighbours, SVC and Logistic Regression) and set it to a hard voting criteria.
The unclassified texts went through the prediction pipeline (sbert embeddings -> transformed lower dim embeddings via saved UMAP -> predict class via voter)
Is this the right approach for when trying to classify between a large number of classes with small training data size?

difference between classification and regression in k-nearest neighbor? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
what is the difference between using K-nearest neighbor in classification and using it in regression?
and when using KNN in recommendation system. Does it concerned as classification or as regression?
In classification tasks, the user seeks to predict a category, which is usually represented as an integer label, but represents a category of "things". For instance, you could try to classify pictures between "cat" and "dog" and use label 0 for "cat" and 1 for "dog".
The KNN algorithm for classification will look at the k nearest neighbours of the input you are trying to make a prediction on. It will then output the most frequent label among those k examples.
In regression tasks, the user wants to output a numerical value (usually continuous). It may be for instance estimate the price of a house, or give an evaluation of how good a movie is.
In this case, the KNN algorithm would collect the values associated with the k closest examples from the one you want to make a prediction on and aggregate them to output a single value. usually, you would choose the average of the k values of the neighbours, but you could choose the median or a weighted average (or actually anything that makes sense to you for the task at hand).
For your specific problem, you could use both but regression makes more sense to me in order to predict some kind of a "matching percentage ""between the user and the thing you want to recommand to him.

Difference between Regression and classification in Machine Learning? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am new to Machine Learning. Can anyone tell me the major difference between classification and regression in machine learning?
Regression aims to predict a continuous output value. For example, say that you are trying to predict the revenue of a certain brand as a function of many input parameters. A regression model would literally be a function which can output potentially any revenue number based on certain inputs. It could even output revenue numbers which never appeared anywhere in your training set.
Classification aims to predict which class (a discrete integer or categorical label) the input corresponds to. e.g. let us say that you had divided the sales into Low and High sales, and you were trying to build a model which could predict Low or High sales (binary/two-class classication). The inputs might even be the same as before, but the output would be different. In the case of classification, your model would output either "Low" or "High," and in theory every input would generate only one of these two responses.
(This answer is true for any machine learning method; my personal experience has been with random forests and decision trees).
Regression - the output variable takes continuous values.
Example :Given a picture of a person, we have to predict their age on the basis of the given picture
Classification - the output variable takes class labels.
Example: Given a patient with a tumor, we have to predict whether the tumor is malignant or benign.
I am a beginner in Machine Learning field but as far as i know, regression is for "continuous values" and classification is for "discrete values". With regression, there is a line for your continuous value and you can see that your model is good or bad fit. On the other hand, you can see how discrete values gains some meaning "discretely" with classification. If i am wrong please feel free to make correction.

What are the algorithms which could be sued to match sentences? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Let's say we have a list of 50 sentences and we have an input sentence. How can i choose the closest sentence to the input sentence from the list?
I have tried many methods/algorithms such as averaging word2vec vector representations of each token of the sentence and then cosine similarity of result vectors.
For example I want the algorithm to give a high similarity score between "what is the definition of book?" and "please define book".
I am looking for a method (probably a combinations of methods) which
1. looks for semantics
2. looks for syntax
3. gives different weights for different tokens with different role (e.g. in the first example 'what' and 'is' should get lower weights)
I know this might be a bit general but any suggestion is appreciated.
Thanks,
Amir
before counting a distance between sentences, you need to clean them,
For that:
A Lemmatization of your words is needed to get the root of each word, so your sentence "what is the definition of book" woul be "what be the definition of bood"
You need to delete all preposition, verb to be and all Word without meaning, like : "what be the definition of bood" would be "definintion book"
And then you transform you sentences into vectors of number by using tf-idf method or wordToVec.
Finnaly you can count the distance between your sentences by using cosine between the vectors, so if the cosine is small so the your two sentences are similar.
Hop that will help
Your sentences are too sparse to compare the two documents directly. Aggressive morphological transformations (such as stemming, lemmatization, etc) might help some, but will probably fall short given your examples.
What you could do is compare the 'search results' of the 2 sentences in a large document collection with a number of methods. According to the distributional hypothesis similar sentences should occur in similar context (see Distributional hypothesis, but also Rocchio's algorithm, co-occurrence and word2vec). Those context (when gathered in a smart way) could be large enough to do some comparison (such as cosine similarity).

NLP: Calculating probability a document belongs to a topic (with a bag of words)? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
Given a topic, how can I calculate the probability a document "belongs" to that topic(ie sports)
This is what I have to work with:
1) I know the common words in documents associated with that topics (eliminating all STOP words), and the % of documents that have that word
For instance if the topic is sports, I know:
75% of sports documents have the word "play"
70% have the word "stadium"
40% have the word "contract"
30% have the word "baseball"
2) Given this, and a document with a bunch of words, how can I calculate the probability this document belongs to that topic?
This is fuzzy classification problem with topics as classes and words as features. Normally you don't have bag of words for each topic, but rather set of documents and associated topics, so I will describe this case first.
The most natural way to find probability (in the same sense it is used in probability theory) is to use naive Bayes classifier. This algorithm has been described many times, so I'm not going to cover it here. You can find quite good explanation in this synopsis or in associated Coursera NLP lectures.
There are also many other algorithms you can use. For example, your description naturally fits tf*idf based classifiers. tf*idf (term frequency * inverse document frequency) is a statistic used in modern search engines to calculate importance of a word in a document. For classification, you may calculate "average document" for each topic and then find how close new document is to each topic with cosine similarity.
If you have the case exactly like you've described - only topics and associated words - just consider each bag of words as a single document with, possibly, duplicating frequent words.
Check out topic modeling (https://en.wikipedia.org/wiki/Topic_model) and if you are coding in python, you should check out radim's implementation, gensim (http://radimrehurek.com/gensim/tut1.html). Otherwise there are many other implementations from http://www.cs.princeton.edu/~blei/topicmodeling.html
There are many approaches to solving a clustering problem. I suggest start with simple logistic regression and look at the results. If you already have predefined ontology sets, you can add them as features in next stage to improve accuracy.

Resources