Predicting product's category by search term - machine-learning

Problem: User performs product's search using search term, we should define most related category(categories is descending order) to that search term.
Given: Products set, around 50000(Could be ten times more) products. Products contains title, description, and list of categories it belongs to.
Model:
Pre-processing Perform stemming and remove stopwords from product's title and description. Put all unique stemmed words in WORDS list of size N. Put all categories to CATEGORIES list of size M.
Fitting Use neural network which has N input neurons and M outputs.
Training For product which has words w1, w3, w4, w6 input will be x=[1 0 1 1 0 1 ...] in which elements which index corresponds to thouse words index in WORDS will be set to 1. If product belongs to categories c1, c3, c25 it corresponds to y =[1 0 1 ... 1(25-th position)...] Predicting step. As input put user search term stemmed tokens that should as output give us prediction of most related category.
Is this model correct way for solving such a problem? What are the recommendation for hidden NN layers configuration. Any advice will be helpful, I'm completely new to Machine Learning.
Thank you!

I think that's the correct way of solving the problem, since you're dealing with a multi-label classification problem. That is, a sample can belong to several classes simultaneously, or to a single class, or to none of the classes (categories).
This is a good example on Python: multi-label classification.
You can get more details here.
As for hidden layers configuration, the first approach is to use cross-validation to test the accuracy on the test set. But if you want to go further, please read this.

Related

How to build separate classifiers for each label in the dataset?

I have a list of columns and each column is to be labelled by a label from another list of labels.
Eg: Two columns namely, ALT_ID and MTRC_NM are matched with labels Alternate ID and Metric Name respectively.
This fuzzy string matching has been taken care of. Problem is, I want to incorporate a learning model in this.
Essentially, after the matched results are displayed, the user curates the matches as CORRECT or INCORRECT. Based on this feedback and other features of the column (like minimum value, maximum value), I want to train a classifier such that the learning model will eventually stop making the incorrect matches in the future.
Note: In the first run, only the name of the column is used to produce the first set of results. After this, I want to use other features(like minimum value) to train the model.
Problem is, there can be 10,000 terms (or labels), maybe even more and the user just marks these as CORRECT or INCORRECT. For incorrect classifications, the user does not tell us what the correct classification should be.
I believe one solution could be to make separate classifiers for each label and based on the Correct/Incorrect feedback for a particular classification, we can use these feature vectors to train a classifier for this classification. So in the future, if the fuzzy string matching nominates Metric Name as the classification for some column, we can let the "Metric Name" classifier decide if it is correct or incorrect.
I don't know how to make separate classifiers for each label. I also don't know if this approach is feasible. Any other solution to this problem will also help.
You do not want to create separate models for each label as training more than 10 000 models isn't really feasible. Two possible things that come to my mind are:
Create a supervised learning model with one label as input and probability of each of 10 000 labels as output which only uses correct examples for predictions.
Create a reinforcement learning model with the same input but with output which maximises reward function defined as +1 for each positive prediction and -1 for each negative prediction. This model will also try to maximise the number of correct predictions but will be able to learn from incorrect predictions at the same time i.e. predict -1 score for an incorrect pair (x,y).

Loss function for Question Answering posed as Multiclass Classification?

I'm working on a question answering problem with limited data (~10,000s of data points) and very few features for both the context/question as well as the options/choices. Given:
a question Q and
options A, B, C, D, E (each characterized by some features, say, string similarity to Q or number of words in each option)
(while training) a single correct answer, say B.
I wish to predict exactly one of these as the correct answer. But I'm stuck because:
If I arrange ground truth as [0 1 0 0 0], and give the concatenation of QABCDE as input, then the model will behave as if classifying an image into dog, cat, rat, human, bird, i.e. each class will have a meaning, however that's not true here. If I switched the input to QBCDEA, the prediction should be [1 0 0 0 0].
If I split each data point into 5 data points, i.e. QA:0, QB:1, QC:0, QD:0, QE:0, then the model fails to learn that they're in fact interrelated, and only one of them must be predicted as 1.
One approach that seems viable is to make a custom loss function which penalizes multiple 1s for a single question, and which penalizes no 1s as well. But I think I might be missing something very obvious here :/
I'm also aware of how large models like BERT do this over SQuAD like datasets. They add positional embeddings to each option (eg. A gets 1, B gets 2), and then use a sort of concatenation over QA1 QB2 QC3 QD4 QE5 as input, and [0 1 0 0 0] as output. Unfortunately, I believe this will not work in my case given the very small dataset I have.
The problem you're having is that you removed all useful information from your "ground truth". The training target is not the ABCDE labels -- the target is the characteristics of the answers that those letters briefly represent.
Those five labels are merely array subscripts for classifications that are a 5Pn (5 objects chosen from n) shuffled subset of your training space. Bottom line: there is no information in those labels.
Rather, extract the salient characteristics from those answers. Your training needs to find the answer (characteristic set) that sufficiently matches the question. As such, what you're doing is close to multi-label training.
Multi-label models should handle this situation. This will include those that label photos, identifying multiple classes represented in the input.
Does that get you moving?
Response to OP comment
You understand correctly: predicting 0/1 for five arbitrary responses is meaningless to the model; the single-letter variables are of only transitory meaning, and have no relation to anything trainable.
A short thought experiment will demonstrate this. Imagine that we sort the answers such that A is always the correct answer; this doesn't change the information in the inputs and outputs; it's a valid arrangement of the multiple-choice test.
Train the model; we'll get to 100% accuracy in short order. Now, consider the model weights: what has the model learned from the input? Nothing -- the weights will train to ignore the input and select A, or will have absolutely arbitrary values that come to the A conclusion.
You need to ignore the ABCDE designations entirely; the target information is in the answers themselves, not in those letters. Since you haven't posted any sample cases, we have little to guide us for an alternate approach.
If your paradigm is a typical multiple-choice examination, with few restrictions on the questions and answers, then the problem you're tackling is far larger than your project is likely to solve -- you're in "Watson" territory, requiring a large knowledge base and a strong NLP system to parse the inputs and available responses.
If you have a restricted paradigm for the answers, perhaps you can parse them into phrases and relations, yielding a finite set of classes to consider in your training. In this case, a multi-label model might well be able to solve your problem.
If your application is open-ended, i.e. open topic, then I expect that you need a different model class (such as BERT), but you'll still need to consider the five answers as text sequences, not as letters. You need a holistic match to the subject at hand. If this is a typical multiple-choice exam, then your model will still have classification troubles, as all five answers are likely to be on topic; finding the correct answer should depend on some level of semantic insight into question and answer, something stronger than "bag of words" processing.

Complex or several simple models must be used?

I can't understand how models may be organised.
I want try create some algorithms which helps analyze product name (description) and gets product properties (category, some parameters).
I have tree structured data:
Category (name, null parent)
|Category (name, parent)
|Product (name+description)
|Param(key-value)
|Param(key-value)
|Param(key-value)
|...
I use model which will clasify top category for product, and after I use an other model which is trained on products which are belongs to the classified top category (so I can classify second level category).
For next step I use own models for every param key for param value classification
In general, do I need a model for every leaf of my tree structure for next classification step?
Are my thoughts right?
That's one way of proceeding. However I seed 2 problem of in the approach:
First, you segment training data and the final classifiers may not have enough data to be trained.
Second, I guess that the Param Key-Value can be repeated across different categories and products. So you are training different classifiers for the same things on different products and categories may not be a good idea because of the training data segmentation.
It is to have a classifier for the categories and one classifier for the products. But having a classifier for each property may be too much. I would recommend you to take a look into multi class classification. These algorithms can handle several classes for each input. So you could use them to model all the Param Key-Value
http://scikit-learn.org/stable/modules/multiclass.html
And if you really have a huge amount of leafs then you can try Extreme Multi Label
"extreme multi label learning text classification"

Can a list of websites be considered a corpus for a particular category?

I am trying to build my own corpus for particular categories such as Engineering, Business, Math, Science and etc... This will be for automatic web page categorization. Let's say I manually collect 100 websites that are related to Math. Can these 100 websites be considered a corpus for Math?
Another related question. How does this differentiate from a lexicon wherein instead of a list of websites it shows a list of words with weights such as 0 or 1 to particular categories? Example would be a sentiment lexicon with words that has weights for positive and negative. But instead of positive and negative, categories such as Math, Science are used.
You say you want to make some web page categorization, then the problem you're facing is a supervised learning problem. The data you get are web pages, so I guess you actually extract their content as text. You work with textual input data. Since you want to categorize them, each of your input data has one or more corresponding labels, which are the outputs you want to predict. You have multiple label so you want to do multi-label classification
To tackle this problem, since most machine learning algorithms work with numerical vector, you need to transform your corpus of texts into vectors (or into one matrix). To do so, you can use the bag of word technique which first build a dictionary or lexicon and then count the occurrences of each word of the dictionary in each text. Actually, you can transform your output label in the same way, attributing an index of you output vector for each category.
The final pipeline would be something like this:
[input_text] --bag_of_word--> [input_vector] --prediction--> [output_vector] --label_matchnig--> [labels]

multi-label text classification with zero or more labels

I need to classify website text with zero or more categories/labels (5 labels such as finance, tech, etc). My problem is handling text that isn't one of these labels.
I tried ML libraries (maxent, naive bayes), but they match "other" text incorrectly with one of the labels. How do I train a model to handle the "other" text? The "other" label is so broad and it's not possible to pick a representative sample.
Since I have no ML background and don't have much time to build a good training set, I'd prefer a simpler approach like a term frequency count, using a predefined list of terms to match for each label. But with the counts, how do I determine a relevancy score, i.e. if the text is actually that label? I don't have a corpus and can't use tf-idf, etc.
Another idea , is to user neural networks with softmax output function, softmax will give you a probability for every class, when the network is very confident about a class, will give it a high probability, and lower probabilities to the other classes, but if its insecure, the differences between probabilities will be low and none of them will be very high, what if you define a treshold like : if the probability for every class is less than 70% , predict "other"
Whew! Classic ML algorithms don't combine both multi-classification and "in/out" at the same time. Perhaps what you could do would be to train five models, one for each class, with a one-against-the-world training. Then use an uber-model to look for any of those five claiming the input; if none claim it, it's "other".
Another possibility is to reverse the order of evaluation: train one model as a binary classifier on your entire data set. Train a second one as a 5-class SVM (for instance) within those five. The first model finds "other"; everything else gets passed to the second.
What about creating histograms? You could use a bag of words approach using significant indicators of for e.g. Tech and Finance. So, you could try to identify such indicators by analyzing the certain website's tags and articles or just browse the web for such inidicators:
http://finance.yahoo.com/news/most-common-words-tech-finance-205911943.html
Let's say your input vactor X has n dimensions where n represents the number of indicators. For example Xi then holds the count for the occurence of the word "asset" and Xi+k the count of the word "big data" in the current article.
Instead of defining 5 labels, define 6. Your last category would be something like a "catch-all" category. That's actually your zero-match category.
If you must match the zero or more category, train a model which returns probability scores (such as a neural net as Luis Leal suggested) per label/class. You could than rate your output by that score and say that every class with a score higher than some threshold t is a matching category.
Try this NBayes implementation.
For identifying "Other" categories, dont bother much. Just train on your required categories which clearly identifies them, and introduce a threshold in the classifier.
If the values for a label does not cross a threshold, then the classifier adds the "Other" label.
It's all in the training data.
AWS Elasticsearch percolate would be ideal, but we can't use it due to the HTTP overhead of percolating documents individually.
Classify4J appears to be the best solution for our needs because the model looks easy to train and it doesn't require training of non-matches.
http://classifier4j.sourceforge.net/usage.html

Resources