machine learning from negative tagging - machine-learning

I apply machine learning on free text in order to classify it to one of 120 classes. The algorithm learns from set of X Strings that are pre-classified.
"this is text 1" (class 1)
"this is text 2" (class 2)
...
Given new text the algorithm returns match rank for each of the possible classes.
In order to make the model better I must feed it with more and more data of positive classified text.
I wonder if there are models that can also learn from negatively classified texts. Meaning "text X" is definitely not (class 3,4,5) - but without knowing what is the positive class.
For example: say that I only take the top 5 ranked classes, and all 5 are wrong, I want to be able to give negative feedback so that the next time I ask for the top 5 on the same text, the machine will return something else.

Related

How to build separate classifiers for each label in the dataset?

I have a list of columns and each column is to be labelled by a label from another list of labels.
Eg: Two columns namely, ALT_ID and MTRC_NM are matched with labels Alternate ID and Metric Name respectively.
This fuzzy string matching has been taken care of. Problem is, I want to incorporate a learning model in this.
Essentially, after the matched results are displayed, the user curates the matches as CORRECT or INCORRECT. Based on this feedback and other features of the column (like minimum value, maximum value), I want to train a classifier such that the learning model will eventually stop making the incorrect matches in the future.
Note: In the first run, only the name of the column is used to produce the first set of results. After this, I want to use other features(like minimum value) to train the model.
Problem is, there can be 10,000 terms (or labels), maybe even more and the user just marks these as CORRECT or INCORRECT. For incorrect classifications, the user does not tell us what the correct classification should be.
I believe one solution could be to make separate classifiers for each label and based on the Correct/Incorrect feedback for a particular classification, we can use these feature vectors to train a classifier for this classification. So in the future, if the fuzzy string matching nominates Metric Name as the classification for some column, we can let the "Metric Name" classifier decide if it is correct or incorrect.
I don't know how to make separate classifiers for each label. I also don't know if this approach is feasible. Any other solution to this problem will also help.
You do not want to create separate models for each label as training more than 10 000 models isn't really feasible. Two possible things that come to my mind are:
Create a supervised learning model with one label as input and probability of each of 10 000 labels as output which only uses correct examples for predictions.
Create a reinforcement learning model with the same input but with output which maximises reward function defined as +1 for each positive prediction and -1 for each negative prediction. This model will also try to maximise the number of correct predictions but will be able to learn from incorrect predictions at the same time i.e. predict -1 score for an incorrect pair (x,y).

Does the training+testing set have to be different from the predicting set (so that you need to apply a time-shift to ALL columns)?

I know the general rule that we should test a trained classifier only on the testing set.
But now comes the question: When I have an already trained and tested classifier ready, can I apply it to the same dataset that was the base of the training and testing set? Or do I have to apply it to a new predicting set that is different from the training+testing set?
And what if I predict a label column of a time series (edited later: I do not mean to create a classical time series analysis here, but just a broad selection of columns from a typical database, weekly, monthly or randomly stored data that I convert into separate feature columns, each for one week / month / year ...), do I have to shift all of the features (not just the past columns of the time series label column, but also all other normal features) of the training+testing set back to a point in time where the data has no "knowledge" interception with the predicting set?
I would then train and test the classifier on features shifted to the past by n months, scoring against a label column that is unshifted and most recent, and then predicting from most recent, unshifted features. Shifted and unshifted features have the same number of columns, I align shifted and unshifted features by assigning the column names of the shifted features to the unshifted features.
p.s.:
p.s.1: The general approach on https://en.wikipedia.org/wiki/Dependent_and_independent_variables
In data mining tools (for multivariate statistics and machine learning), the dependent variable is assigned a role as target variable (or in some tools as label attribute), while an independent variable may be assigned a role as regular variable.[8] Known values for the target variable are provided for the training data set and test data set, but should be predicted for other data.
p.s.2: In this basic tutorial we can see that the predicting set is made different: https://scikit-learn.org/stable/tutorial/basic/tutorial.html
We select the training set with the [:-1] Python syntax, which produces a new array that contains all > but the last item from digits.data: […] Now you can predict new values. In this case, you’ll predict using the last image from digits.data [-1:]. By predicting, you’ll determine the image from the training set that best matches the last image.
I think you are mixing up some concepts, so I will try to give a general explanation for Supervised Learning.
The training set is what your algorithm LEARNS on. You split it in X (features) and Y (target variable).
The test set is a set that you use to SCORE your model, and it must contain data that was not in the training set. This means that a test set also has X and Y (meaning that you know the value of the target). What happens is that you PREDICT f(Y) based on X, and compare it with the Y you have, and see how good your predictions are
A prediction set is simply new data! This means that usually you DO NOT have a target, since the whole point of supervised learning is predicting it. You will only have your X (features) and you will predict f(X) (your estimate of the target Y) and use it for whatever you need.
So, in the end a test set is simply a prediction set for which you have a target to compare your estimation to.
For time series, it is a bit more complicated, because often the features (X) are transformations on past data of the target variable (Y). For example, if you want to predict today's SP500 price, you might want to use the average of the last 30 days as a feature. This means that for every new day, you need to recompute this feature over the past days.
In general though, I would suggest starting with NON time series data if you're new to ML, as Time Series is much harder in terms of feature engineering and data management and it is easy to make mistakes.
The question above When I have an already trained and tested classifier ready, can I apply it to the same dataset that was the base of the training and testing set? has the simple answer: No.
The question above Do I have to shift all of the features has the simple answer: Yes.
In short, if I predict a month's class column: I have to shift all of the non-class columns also back in time in addition to the previous class months I converted to features, all data must have been known before the month in that the class is predicted.
This also means: the predicting set has to be different from the dataset that contains the testing set. If you included the testing set, the training set loses valuable up-to-date data of the latest month(s) available! The term of a final "predicting set" is meant to be the "most current input to be used without a testing set" to get the "most current results" for the prediction.
This is confirmed by the following overview offered by this user who seems to have made the image, using days instead of months here, but the idea is the same:
Source: Answer on "Cross Validated" - Splitting Time Series Data into Train/Test/Validation Sets; the whole Q/A is recommended (!).
See the last line of the image and the valuable comments of that answer on "Cross Validated" to understand this.
230106:
The image shows that the last step is a training on the whole dataset, this is the "predicting set" that is the newest and that does not have a testing set.
On that image, there is one "mistake" which shows that this seemingly easy question of taking former labels as features for upcoming labels seems to be hard to be understood. I myself did not see this and posted the image without this remark: The "T&V" is in the past of the "Test". And that would be a wrong validation for a model that shall predict the future, the V must be in the "future" test block (unless you have a dataset that is not dynamically changing over time, like in physics).
You would have to change it to a "walk-forward" model, with the validation set - if at all - split k-fold from the testing set, not from the training set. That would look like this:
See also:
Can / should I use past (e.g. monthly) label columns from a database as features in an ML prediction (no time-series!)? with the "walk-forward" main image,
Splitting Time Series Data into Train/Test/Validation Sets with more insight into this and the comment that brought up the model name "walk-forward".

multi-label text classification with zero or more labels

I need to classify website text with zero or more categories/labels (5 labels such as finance, tech, etc). My problem is handling text that isn't one of these labels.
I tried ML libraries (maxent, naive bayes), but they match "other" text incorrectly with one of the labels. How do I train a model to handle the "other" text? The "other" label is so broad and it's not possible to pick a representative sample.
Since I have no ML background and don't have much time to build a good training set, I'd prefer a simpler approach like a term frequency count, using a predefined list of terms to match for each label. But with the counts, how do I determine a relevancy score, i.e. if the text is actually that label? I don't have a corpus and can't use tf-idf, etc.
Another idea , is to user neural networks with softmax output function, softmax will give you a probability for every class, when the network is very confident about a class, will give it a high probability, and lower probabilities to the other classes, but if its insecure, the differences between probabilities will be low and none of them will be very high, what if you define a treshold like : if the probability for every class is less than 70% , predict "other"
Whew! Classic ML algorithms don't combine both multi-classification and "in/out" at the same time. Perhaps what you could do would be to train five models, one for each class, with a one-against-the-world training. Then use an uber-model to look for any of those five claiming the input; if none claim it, it's "other".
Another possibility is to reverse the order of evaluation: train one model as a binary classifier on your entire data set. Train a second one as a 5-class SVM (for instance) within those five. The first model finds "other"; everything else gets passed to the second.
What about creating histograms? You could use a bag of words approach using significant indicators of for e.g. Tech and Finance. So, you could try to identify such indicators by analyzing the certain website's tags and articles or just browse the web for such inidicators:
http://finance.yahoo.com/news/most-common-words-tech-finance-205911943.html
Let's say your input vactor X has n dimensions where n represents the number of indicators. For example Xi then holds the count for the occurence of the word "asset" and Xi+k the count of the word "big data" in the current article.
Instead of defining 5 labels, define 6. Your last category would be something like a "catch-all" category. That's actually your zero-match category.
If you must match the zero or more category, train a model which returns probability scores (such as a neural net as Luis Leal suggested) per label/class. You could than rate your output by that score and say that every class with a score higher than some threshold t is a matching category.
Try this NBayes implementation.
For identifying "Other" categories, dont bother much. Just train on your required categories which clearly identifies them, and introduce a threshold in the classifier.
If the values for a label does not cross a threshold, then the classifier adds the "Other" label.
It's all in the training data.
AWS Elasticsearch percolate would be ideal, but we can't use it due to the HTTP overhead of percolating documents individually.
Classify4J appears to be the best solution for our needs because the model looks easy to train and it doesn't require training of non-matches.
http://classifier4j.sourceforge.net/usage.html

classification algorithms that return confidences?

Given a machine learning model built on top of scikit-learn, how can I classify new instances but then choose only those with the highest confidence? How do we define confidence in machine learning and how to generate it (if not generated automatically by scikit-learn)? What should I change in this approach if I had more that 2 potential classes?
This is what I have done so far:
# load libraries
from sklearn import neighbors
# initialize NearestNeighbor classifier
knn = neighbors.KNeighborsClassifier(n_neighbors=3)
# train model
knn.fit([[1],[2],[3],[4],[5],[6]], [0,0,0,1,1,1])
# predict ::: get class probabilities
print(knn.predict_proba(1.5))
print(knn.predict_proba(37))
print(knn.predict_proba(3.5))
Example:
Let's assume that we have created a model using the XYZ machine learning algorithm. Let's also assume that we are trying to classify users based on their gender using information such as location, hobbies, and income. Then, we have 10 say new instances that we want to classify. As normal, upon the applying of the model, we get 10 outputs, either M (for male) or F (for female). So far so good. However, I would like to somehow measure the precision of these results and then, by using a hard-coded threshold, leave out those with low precision. My question is on how to measure the precession. Is probability (as given by the predict_proba() function) a good measure? For example, can I say that if probably is between 0.9 and 1 then "keep" (otherwise "omit")? Or I should use a more sophisticated method for doing that? As you can see, I lack theoretical background so any help would be highly appreciated.
While this is more of a stats question I can give answers relative to scikit-learn.
Confidence in machine learning depends on the method used for the model. For exemple with 3-NN (what you used), predict_proba(x) will give you n/3 with x the number of "class 1" among the 3 nearest neighbours from x. You can easily say that if n/3 is smaller than 0.5 that means there are less than 2 "class 1" among the nearest neighbours and that there are more than 2 "class 0". That means your x is more likely to be from "class 0". (I assume you knew that already)
For another method like SVM the confidence can be the distance from the point considered to the hyperplan or for ensemble models it could be the number of aggregated votes towards a certain class. Scikit-learn's predict_proba() uses what is available from the model.
For multiclass problems (imagine Y can be equal to A, B or C) ypu have two main approach that are sometimes directly taken into consideration in scikit learn.
The first approach is OneVsOne. It basically compute every new sample as a AvsB AvsC and BvsC model and takes the most probable (imagine if A wins against B and against C it is very likely that the right class is A, the annoying cases are resolved by taking the class that has the highest confidence in the match ups e.g. if A wins against B, B wins against C and C wins against C, if the confidence of A winning against B is higher than the rest it will most likely be A).
The second approach is OneVsAll, in wich you compute A vs B and C, B vs A and C, C vs A and B and take the class that is the most likely by looking at the confidence scores.
Using scikit-learn's predict() will always give the most likely class based on the confidence scores that predict_proba would give.
I suggest you read this http://scikit-learn.org/stable/modules/multiclass.html very carefully.
EDIT :
Ah I see what you are trying to do. predict_proba() has a big flaw : let's assume you have a big outlier in your new instances (e.g. female with video games and guns as hobbies, software developper as a job etc.) if you use for instance k-NN and your outlier will be in the flock of the other classe's cloud of point predict_proba() could give 1 as a confidence score for Male while the instance is Female. However it will well for undecisive cases (e.g. male or female, with video games and guns as hobbies, and works in a nursery) as predict_proba() will give something around ~0.5.
I don't know if something better can be used tought. If you have enough training samples for doing cross validation I suggest you maybe look toward ROC and PR curves for optimizing your threshold.

machine learning predict classification

I have the following problem. I have a training dataset comprising of a range of numbers. Each number belongs to a certain class. There are five classes.
Range:
1...10
Training Dataset:
{1,5,6,6,10,2,3,4,1,8,6,...}
Classes:
[1,2][3,4][5,6][7,8][9,10]
Is it possible to use a machine learning algorithm to find likelihoods for class prediction and what algorithm would be suited for this?
best, US
As described in the question's comment, I want to calculate the likelihood of a certain class to appear based on the given distribution of the training set,
the problem is trivial and hardly a machine learning one:
Simply count the number of occurrences of each class in the "training set", Count_12, Count_34, ... Count_910. The likelihood that a given class xy would appear is simply given by
P(xy) = Count_xy / Total Number of elements in the "training set"
= Count_xy / (Count_12 + Count_34 + Count_56 + Count_78 + Count_910)
A more interesting problem...
...would be to consider the training set as a sequence and to guess what would the next item in that sequence be. The probability that the next item be from a given category would then not only be based on the prior for that category (the P(xy) computed above), but it would also be taking into account the items which precede it in the sequence. One of the interesting parts of this problem would then be to figure out how "far back" to look and much "weight" to give to the preceding sequences of items.
Edit (now that OP indicated his/her interest for the "more interesting problem").
This "prediction-given-preceding-sequence" problem maps almost directly to the
machine-learning-algorithm-for-predicting-order-of-events StackOverflow question.
The slight differences being that the alphabet here has 10 distinct code (4 in the other question) and the fact that here we try and predict a class of codes, rather that just the code itself. With regards to this aggregation of, here, 2 codes per class, we have several options:
work with classes from the start, i.e. replace each code read in the sequence by its class, and only consider and keep track of classes then on.
work with codes only, i.e. create a predictor of 1-thru-10 codes, and only consider the class at the very end, adding the probability of the two codes which comprise a class, to produce the likelihood of the next item being of that class.
some hybrid solution: consider / work with the codes but sometimes aggregate to the class.
My personal choice would be to try first with the code predictor (only aggregating at the very end), and maybe adapt from there if somehow insight gained from this initial attempt were to tell us that the logic or its performance could be simplified or improved would we aggregate earlier. Indeed the very same predictor could be used to try both approaches, one would simply need to alter the input stream, replacing all even numbers by the odd number preceding it. I'm guessing that valuable information (for the purpose of guessing upcoming codes) is lost when we aggregate early.

Resources