How to cluster/classify search query - machine-learning

Background:
I have a dataset from a small search portal site. The dataset includes all the search queries / keywords users searched for.
The format is like
Keyword num_of_searches
Yahoo 5098
Google 8873
エロ動画 98982
... ...
(The portal site is in JP, so there are lots of Japanese keywords in the dataset.)
Question:
Is there any existing machine learning models that allow me to classify all the keywords into a few categories?
(I have heard of "keyword clustering", but I don't now which model to use.)

You could try using pretrained word embeddings then doing clustering on the embedding vectors. Word embeddings could be visualized https://projector.tensorflow.org using t-SNE or PCA to gain more insights.

Related

How to deal with out of vocabulary words in nlp?

I have a multi-label dataset that contains a lot of out-of-vocabulary words. The dataset is basically from a user forum site. The columns are post_title, post_description and tags. I want to predict the tags using machine learning models. But as the dataset contains many out-of-vocabulary words, the models are giving me very poor results. So what should I do in this case?

Questions about feature vector of a tweet in a Twitter Sentiment Analysis Task.

https://aclanthology.info/pdf/S/S17/S17-2132.pdf
In this paper that describes one of the systems used in the shared task of SemEval2017, it explains how an SVM Classifier was implemented for a twitter sentiment analysis task.
I am trying to imitate this system, but there are some parts that are not understandable as a beginner.
It lists bunch of features that describe a single tweet. To implement a feature vector X (where a single row represents a single tweet and the column contains all values of the features listed in the paper) am I supposed to compute vectors for each features and concatenate it at the end?
One of the features is the Tf-Idf vectorizer. As far as I know, Tf-Idf gives us the weight of the word per document. But in this case, what would be the document? Wouldn't using Tf-idf as one of the features cause the final X to have rows with different numbers of columns? I just want to know how Tf-idf would work here as one of the feature vectors.
What does the 2.8 Topic and Hashtag feature indicate? I don't quite what kind of value it is describing.
Any kind of advice would be great! Please help!

what methods are there to classify documents?

I am trying to do document classification. But I am really confused between feature selections and tf-idf. Are they the same or two different ways of doing classification?
Hope somebody can tell me? I am not really sure that my question will make sense to you guys.
Yes, you are confusion a lot of things.
Feature selection is the abstract term for choosing features (0 or 1). Stopword removal can be seen as feature selection.
TF is one method of extracting features from text: counting words.
IDF is one method of assigning weights to features.
Neither of them is classification... they are popular for text classification, but they are even more popular for information retrieval, which is not classification...
However, many classifiers work on numeric data, so the common process is to 1. Extract features (e.g.: TF) 2. Select features (e.g. remove stopwords) 3. Weight features (e.g. IDF) 4. Train a classifier on the resulting numerical vectors. 5. Predict the classes of new/unlabeled documents.
Taking a look at this explanation may help a lot when it comes to understanding text classifiers.
TF-IDF is a good way to find a document that answers a given query, but it does not necessarily assigns documents with classes.
Examples that may be helpful:
1) You have a bunch of documents with subjects ranging from politics, economics, computer science and the arts. The documents belonging to each subject are separated into the appropriate directories for each subject (you have a labeled dataset). Now, you received a new document whose subject you do not know. In which directory should it be stored? A classifier can answer this question from the documents that are already labeled.
2) Now, you received a query regarding computer science. For instance, you received the query "Good methods for finding textual similarity". Which document in the directory of computer science can provide the best response to that query? TF-IDF would be a good approach to figure that out.
So, when you are classifying documents, you are trying to make a decision about whether a document is a member of a particular class (like, say, 'about birds' or 'not about birds').
Classifiers predict the value of the class given a set of features. A good set of features will be highly discriminative - they will tell you a lot about whether the document is of one class or another.
Tf-idf (term frequency inverse document frequency) is a particular feature that seems to be discriminative for document classification tasks. There are others, like word counts (tf or term frequency) or whether a regexp matches the text or what have you.
Feature selection is the task of selecting good (discriminative) features. Tfidf is probably a good feature to select.

How can Topic Modeling noise be removed?

I am working on Topic Modeling where the given text corpus have lots of noise in form of supporting words after removal of stop words. These words have high term frequency but does not help in forming topic terms by using LDA along with other words with high frequency that are useful . How can this noise be removed?
LDA algorithms don't take tf-idf weights in input, but bag of words, however you could first filter words from your corpus based on their tf-idf score, and then feed the new texts to your LDA program.
Basic thing is that you do a TF-IDF and clean on scores, if that still doesnt help then you can create domain specific custom stopwords list. Suppose if I'm in a jobs domain, the word "job" is not a regular stopword but in jobs domain it is or the company name is a stopword since it repeats across many documents. So, building custom stopwords list is another way to go with.

Good training data for text classification by LDA?

I'm classifying content based on LDA into generic topics such as Music, Technology, Arts, Science
This is the process i'm using,
9 topics -> Music, Technology, Arts, Science etc etc.
9 documents -> Music.txt, Technology.txt, Arts.txt, Science.txt etc etc.
I've filled in each document(.txt file) with about 10,000 lines of content of what i think is "pure" categorical content
I then classify a test document, to see how well the classifier is trained
My Question is,
a.) Is this an efficient way to classify text (using the above steps)?
b.) Where should i be looking for "pure" topical content to fill each of these files? Sources which are not too large (text data > 1GB)
classification is only on "generic" topics such as the above
a) The method you describe sounds fine, but everything will depend on the implementation of labeled LDA that you're using. One of the best implementations I know is the Stanford Topic Modeling Toolbox. It is not actively developed anymore, but it worked great when I used it.
b) You can look for topical content on DBPedia, which has a structured ontology of topics/entities, and links to Wikipedia articles on those topics/entities.
I suggest you to use bag-of-words (bow) for each class you are using. Or vectors where each column is the frequency of important keywords related to the class you want to target.
Regarding the dictionaries you have DBPedia as yves referred or WordNet.
a.)The simplest solution is surely the k-nearest neighbors algorithm (knn). In fact, it will classify new texts with categorical content using an overlap metric.
You could find ressources here: https://github.com/search?utf8=✓&q=knn+text&type=Repositories&ref=searchresults
Dataset issue:
If you are dealing with classifying live user feeds, then I guess no single dataset will suffice your requirement.
Because if new movie X released, it might not catch by your classification dataset as the training dataset is obsoleted for it now.
For classification I guess to stay updated with latest datasets, use twitter training datasets. Develop dynamic algorithm which update the classifier with latest updated tweet datasets. You could select top 15-20 hash tag for each category of your choice to get most relevant dataset for each category.
Classifier:
Most of the classifier uses bag of words model, you can try out various classifiers and see which gives best result. see :
http://www.nltk.org/howto/classify.html
http://scikit-learn.org/stable/supervised_learning.html

Resources