Loss function for Question Answering posed as Multiclass Classification? - machine-learning

I'm working on a question answering problem with limited data (~10,000s of data points) and very few features for both the context/question as well as the options/choices. Given:
a question Q and
options A, B, C, D, E (each characterized by some features, say, string similarity to Q or number of words in each option)
(while training) a single correct answer, say B.
I wish to predict exactly one of these as the correct answer. But I'm stuck because:
If I arrange ground truth as [0 1 0 0 0], and give the concatenation of QABCDE as input, then the model will behave as if classifying an image into dog, cat, rat, human, bird, i.e. each class will have a meaning, however that's not true here. If I switched the input to QBCDEA, the prediction should be [1 0 0 0 0].
If I split each data point into 5 data points, i.e. QA:0, QB:1, QC:0, QD:0, QE:0, then the model fails to learn that they're in fact interrelated, and only one of them must be predicted as 1.
One approach that seems viable is to make a custom loss function which penalizes multiple 1s for a single question, and which penalizes no 1s as well. But I think I might be missing something very obvious here :/
I'm also aware of how large models like BERT do this over SQuAD like datasets. They add positional embeddings to each option (eg. A gets 1, B gets 2), and then use a sort of concatenation over QA1 QB2 QC3 QD4 QE5 as input, and [0 1 0 0 0] as output. Unfortunately, I believe this will not work in my case given the very small dataset I have.

The problem you're having is that you removed all useful information from your "ground truth". The training target is not the ABCDE labels -- the target is the characteristics of the answers that those letters briefly represent.
Those five labels are merely array subscripts for classifications that are a 5Pn (5 objects chosen from n) shuffled subset of your training space. Bottom line: there is no information in those labels.
Rather, extract the salient characteristics from those answers. Your training needs to find the answer (characteristic set) that sufficiently matches the question. As such, what you're doing is close to multi-label training.
Multi-label models should handle this situation. This will include those that label photos, identifying multiple classes represented in the input.
Does that get you moving?
Response to OP comment
You understand correctly: predicting 0/1 for five arbitrary responses is meaningless to the model; the single-letter variables are of only transitory meaning, and have no relation to anything trainable.
A short thought experiment will demonstrate this. Imagine that we sort the answers such that A is always the correct answer; this doesn't change the information in the inputs and outputs; it's a valid arrangement of the multiple-choice test.
Train the model; we'll get to 100% accuracy in short order. Now, consider the model weights: what has the model learned from the input? Nothing -- the weights will train to ignore the input and select A, or will have absolutely arbitrary values that come to the A conclusion.
You need to ignore the ABCDE designations entirely; the target information is in the answers themselves, not in those letters. Since you haven't posted any sample cases, we have little to guide us for an alternate approach.
If your paradigm is a typical multiple-choice examination, with few restrictions on the questions and answers, then the problem you're tackling is far larger than your project is likely to solve -- you're in "Watson" territory, requiring a large knowledge base and a strong NLP system to parse the inputs and available responses.
If you have a restricted paradigm for the answers, perhaps you can parse them into phrases and relations, yielding a finite set of classes to consider in your training. In this case, a multi-label model might well be able to solve your problem.
If your application is open-ended, i.e. open topic, then I expect that you need a different model class (such as BERT), but you'll still need to consider the five answers as text sequences, not as letters. You need a holistic match to the subject at hand. If this is a typical multiple-choice exam, then your model will still have classification troubles, as all five answers are likely to be on topic; finding the correct answer should depend on some level of semantic insight into question and answer, something stronger than "bag of words" processing.

Related

Is splitting a long document of a dataset for BERT considered bad practice?

I am fine-tuning a BERT model on a labeled dataset with many documents longer than the 512 token limit set by the tokenizer.
Since truncating would lose a lot of data I would rather use, I started looking for a workaround. However I noticed that simply splitting the documents after 512 tokens (or another heuristic) and creating new entries in the dataset with the same label was never mentioned.
In this answer, someone mentioned that you would need to recombine the predictions, is that necessary when splitting the documents?
Is this generally considered bad practice or does it mess with the integrity of the results?
You have not mentioned if your intention is to classify, but given that you refer to an article on classification I will refer to an approach where you classify the whole text.
The main question is - which part of the text is the most informative for your purpose - or - in other words - does it make sense to use more than the first / last split of text?
When considering long passages of text, frequently, it is enough to consider the first (or last) 512 tokens to correctly predict the class in substantial majority of cases (say 90%). Even though you may loose some precision, you gain on speed and performance of the overall solution and you are getting rid of a nasty problem of figuring out the correct class out of a set of classifications. Why?
Consider an example of text 2100 tokens long. You split it by 512 tokens, obtaining pieces: 512, 512, 512, 512, 52 (notice the small last piece - should you even consider it?). Your target class for this text is, say, A, however you get the following predictions on the pieces: A, B, A, B, C. So you have now a headache to figure out the right method to determine the class. You can:
use majority voting but it is not conclusive here.
weight the predictions by the length of the piece. Again non conclusive.
check that prediction of the last piece is class C but it is barely above the threshold and class C is kinda A. So you are leaning towards A.
re-classify starting the split from the end. In the same order as before you get: A, B, C, A, A. So, clearly A. You also get it when you majority vote combining all of the classifications (forward and backward splits).
consider the confidence of the classifications, e.g. A: 80, B: 70, A: 90, B: 60, C: 55% - avg. 85% for A vs. 65% for B.
reconfirm the correction of labelling of the last piece manually: if it turns out to be B, then it changes all of the above.
then you can train an additional network to classify out of the raw classifications of pieces. Getting again into trouble of figuring out what to do with particularly long sequences or non-conclusive combinations of predictions resulting in poor confidence of the additional classification layer.
It turns out that there is no easy way. And you will notice that text is a strange classification material exhibiting all of the above (and more) issues while typically the difference in agreement between the first piece prediction and the annotation vs. the ultimate, perfect classifier is slim at best.
So, spare the effort and strive for simplicity, performance, and heuristic... and clip it!
On details of the best practices you should probably refer to the article from this answer.

How to evaluate word2vec build on a specific context files

Using gensim word2vec, built a CBOW model with a bunch of litigation files for representation of word as vector in a Named-Entity-recognition problem, but I want to known how to evaluate my representation of words. If I use any other datasets like wordsim353(NLTK) or other online datasets of google, it doesn't work because I built the model specific to my domain dataset of files. How do I evaluate my word2vec's representation of word vectors .I want words belonging to similar context to be closer in vector space.How do I ensure that the build model is doing it ?
I started by using a techniques called odd one out. Eg:
model.wv.doesnt_match("breakfast cereal dinner lunch".split()) --> 'cereal'
I created my own dataset(for validating) using the words in the training of word2vec .Started evaluating with taking three words of similar context and an odd word out of context.But the accuracy of my model is only 30 % .
Will the above method really helps in evaluating my w2v model ? Or Is there a better way ?
I want to go with word_similarity measure but I need a reference score(Human assessed) to evaluate my model or is there any techniques to do it? Please ,do suggest any ideas or techniques .
Ultimately this depends on the purpose you intend for the word-vectors – your evaluation should mimic the final use as much as possible.
The "odd one out" approach may be reasonable. It's often done with just 2 words that are somehow, via external knowledge/categorization, known to be related (in the aspects that are important for your end use), then a 3rd word picked at random.
If you think your hand-crafted evaluation set is of high-quality for your purposes, but your word-vectors aren't doing well, it may just be that there are other problems with your training: too little data, errors in preprocessing, poorly-chosen metaparameters, etc.
You'd have to look at individual failure cases in more detail to pick what to improve next. For example, even when it fails at one of your odd-one-out tests, do the lists of most-similar words, for each of the words included, still make superficial sense in an eyeball-test? Does using more data or more training iterations significantly improve the evaluation scoring?
A common mistake during both training and evaluation/deployment is to retain too many rare words, on the (mistaken) intuition that "more info must be better". In fact, words with only a few occurrences can't get very high-quality vectors. (Compared to more-frequent words, their end vectors are more heavily influenced by the random original initialization, and by the idiosyncracies of their few occurrences available rather than their most-general meaning.) And further, their presence tends to interfere with the improvement of other nearby more-frequent words. Then, if you include the 'long tail' of weaker vectors in your evaluations, they tend to somewhat arbitrarily intrude in rankings ahead of common words with strong vectors, hiding the 'right' answers to your evaluation questions.
Also, note that the absolute value of an evaluation score may not be that important, because you're just looking for something that points your other optimizations in the right direction for your true end-goal. Word-vectors that are just slightly-better at precise evaluation questions might still work well-enough in other fuzzier information-retrieval contexts.

How to classify text with Knime

I'm trying to classify some data using knime with knime-labs deep learning plugin.
I have about 16.000 products in my DB, but I have about 700 of then that I know its category.
I'm trying to classify as much as possible using some DM (data mining) technique. I've downloaded some plugins to knime, now I have some deep learning tools as some text tools.
Here is my workflow, I'll use it to explain what I'm doing:
I'm transforming the product name into vector, than applying into it.
After I train a DL4J learner with DeepMLP. (I'm not really understand it all, it was the one that I thought I got the best results). Than I try to apply the model in the same data set.
I thought I would get the result with the predicted classes. But I'm getting a column with output_activations that looks that gets a pair of doubles. when sorting this column I get some related date close to each other. But I was expecting to get the classes.
Here is a print of the result table, here you can see the output with the input.
In columns selection it's getting just the converted_document and selected des_categoria as Label Column (learning node config). And in Predictor node I checked the "Append SoftMax Predicted Label?"
The nom_produto is the text column that I'm trying to use to predict the des_categoria column that it the product category.
I'm really newbie about DM and DL. If you could get me some help to solve what I'm trying to do would be awesome. Also be free to suggest some learning material about what attempting to achieve
PS: I also tried to apply it into the unclassified data (17,000 products), but I got the same result.
I won't answer with a workflow on this one because it is not going to be a simple one. However, be sure to find the text mining example on the KNIME server, i.e. the one that makes use of the bag of words approach.
The task
Product mapping to categories should be a straight-forward data mining task because the information that explains the target variable is available in a quasi-exhaustive manner. Depending on the number of categories to train though, there is a risk that you might need more than 700 instances to learn from.
Some resources
Here are some resources, only the first one being truly specialised in text mining:
Introduction on Information Retrieval, in particular chapter 13;
Data Science for Business is an excellent introduction to data mining, including text mining (chapter 10), also do not forget the chapter about similarity (chapter 6);
Machine Learning with R has the advantage of being accessible enough (chapter 4 provides an example of text classification with R code).
Preprocessing
First, you will have to preprocess your product labels a bit. Use KNIME's text analytics preprocessing nodes for that purpose, that is after you've transformed the product labels with Strings to Document:
Case Convert, Punctuation Erasure and Snowball Stemmer;
you probably won't need Stop Word Filter, however, there may be quasi-stop words such as "product", which you may need to remove manually with Dictionary Filter;
Be careful not to use any of the following without testing testing their impact first: N Chars Filter (g may be a useful word), Number Filter (numbers may indicate quantities, which may be useful for classification).
Should you encounter any trouble with the relevant nodes (e.g. Punctuation Erasure can be tricky amazingly thanks to the tokenizer), you can always apply String Manipulation with regex before converting the Strings to Document.
Keep it short and simple: the lookup table
You could build a lookup table based on the 700 training instances. The book Data mining techniques as well as resource (2) present this approach in some detail. If any model performs any worse than the lookup table, you should abandon the model.
Nearest neighbors
Neural networks are probably overkill for this task.
Start with a K Nearest Neighbor node (applying a string distance such as Cosine, Levensthein or Jaro-Winkler). This approach requires the least amount of data wrangling. At the very least, it will provide an excellent baseline model, so it is most definitely worth a shot.
You'll need to tune the parameter k and to experiment with the distance types. The Parameter Optimization Loop pair will help you with optimizing k, you can include a Cross-Validation meta node inside of the said loop to obtain an estimate of the expected performance given k instead of only one point estimate per value of k. Use Cohen's Kappa as an optimization criterion, as proposed by the resource number (3) and available via the Scorer node.
After the parameter tuning, you'll have to evaluate the relevance of your model using yet another Cross-Validation meta node, then follow up with a Loop pair including Scorer to calculate the descriptives on performance metric(s) per iteration, finally use Statistics. Kappa is a convenient metric for this task because the target variable consists of many product categories.
Don't forget to test its performance against the lookup table.
What next ?
Should lookup table or k-nn work well for you, then there's nothing else to add.
Should any of those approaches fail, you might want to analyse the precise cases on which it fails. In addition, training set size may be too low, so you could manually classify another few hundred or thousand instances.
If after increasing the training set size, you are still dealing with a bad model, you can try the bag of words approach together with a Naive Bayes classifier (see chapter 13 of the Information Retrieval reference). There is no room here to elaborate on the bag of words approach and Naive Bayes but you'll find the resources here above useful for that purpose.
One last note. Personally, I find KNIME's Naive Bayes node to perform poorly, probably because it does not implement Laplace smoothening. However, KNIME's R Learner and R Predictor nodes will allow you to use R's e1071 package, as demonstrated by resource (3).

What type of ML is this? Algorithm to repeatedly choose 1 correct candidate from a pool (or none)

I have a set of 3-5 black box scoring functions that assign positive real value scores to candidates.
Each is decent at ranking the best candidate highest, but they don't always agree--I'd like to find how to combine the scores together for an optimal meta-score such that, among a pool of candidates, the one with the highest meta-score is usually the actual correct candidate.
So they are plain R^n vectors, but each dimension individually tends to have higher value for correct candidates. Naively I could just multiply the components, but I hope there's something more subtle to benefit from.
If the highest score is too low (or perhaps the two highest are too close), I just give up and say 'none'.
So for each trial, my input is a set of these score-vectors, and the output is which vector corresponds to the actual right answer, or 'none'. This is kind of like tech interviewing where a pool of candidates are interviewed by a few people who might have differing opinions but in general each tend to prefer the best candidate. My own application has an objective best candidate.
I'd like to maximize correct answers and minimize false positives.
More concretely, my training data might look like many instances of
{[0.2, 0.45, 1.37], [5.9, 0.02, 2], ...} -> i
where i is the ith candidate vector in the input set.
So I'd like to learn a function that tends to maximize the actual best candidate's score vector from the input. There are no degrees of bestness. It's binary right or wrong. However, it doesn't seem like traditional binary classification because among an input set of vectors, there can be at most 1 "classified" as right, the rest are wrong.
Thanks
Your problem doesn't exactly belong in the machine learning category. The multiplication method might work better. You can also try different statistical models for your output function.
ML, and more specifically classification, problems need training data from which your network can learn any existing patterns in the data and use them to assign a particular class to an input vector.
If you really want to use classification then I think your problem can fit into the category of OnevsAll classification. You will need a network (or just a single output layer) with number of cells/sigmoid units equal to your number of candidates (each representing one). Note, here your number of candidates will be fixed.
You can use your entire candidate vector as input to all the cells of your network. The output can be specified using one-hot encoding i.e. 00100 if your candidate no. 3 was the actual correct candidate and in case of no correct candidate output will be 00000.
For this to work, you will need a big data set containing your candidate vectors and corresponding actual correct candidate. For this data you will either need a function (again like multiplication) or you can assign the outputs yourself, in which case the system will learn how you classify the output given different inputs and will classify new data in the same way as you did. This way, it will maximize the number of correct outputs but the definition of correct here will be how you classify the training data.
You can also use a different type of output where each cell of output layer corresponds to your scoring functions and 00001 means that the candidate your 5th scoring function selected was the right one. This way your candidates will not have to be fixed. But again, you will have to manually set the outputs of the training data for your network to learn it.
OnevsAll is a classification technique where there are multiple cells in the output layer and each perform binary classification in between one of the classes vs all others. At the end the sigmoid with the highest probability is assigned 1 and rest zero.
Once your system has learned how you classify data through your training data, you can feed your new data in and it will give you output in the same way i.e. 01000 etc.
I hope my answer was able to help you.:)

machine learning predict classification

I have the following problem. I have a training dataset comprising of a range of numbers. Each number belongs to a certain class. There are five classes.
Range:
1...10
Training Dataset:
{1,5,6,6,10,2,3,4,1,8,6,...}
Classes:
[1,2][3,4][5,6][7,8][9,10]
Is it possible to use a machine learning algorithm to find likelihoods for class prediction and what algorithm would be suited for this?
best, US
As described in the question's comment, I want to calculate the likelihood of a certain class to appear based on the given distribution of the training set,
the problem is trivial and hardly a machine learning one:
Simply count the number of occurrences of each class in the "training set", Count_12, Count_34, ... Count_910. The likelihood that a given class xy would appear is simply given by
P(xy) = Count_xy / Total Number of elements in the "training set"
= Count_xy / (Count_12 + Count_34 + Count_56 + Count_78 + Count_910)
A more interesting problem...
...would be to consider the training set as a sequence and to guess what would the next item in that sequence be. The probability that the next item be from a given category would then not only be based on the prior for that category (the P(xy) computed above), but it would also be taking into account the items which precede it in the sequence. One of the interesting parts of this problem would then be to figure out how "far back" to look and much "weight" to give to the preceding sequences of items.
Edit (now that OP indicated his/her interest for the "more interesting problem").
This "prediction-given-preceding-sequence" problem maps almost directly to the
machine-learning-algorithm-for-predicting-order-of-events StackOverflow question.
The slight differences being that the alphabet here has 10 distinct code (4 in the other question) and the fact that here we try and predict a class of codes, rather that just the code itself. With regards to this aggregation of, here, 2 codes per class, we have several options:
work with classes from the start, i.e. replace each code read in the sequence by its class, and only consider and keep track of classes then on.
work with codes only, i.e. create a predictor of 1-thru-10 codes, and only consider the class at the very end, adding the probability of the two codes which comprise a class, to produce the likelihood of the next item being of that class.
some hybrid solution: consider / work with the codes but sometimes aggregate to the class.
My personal choice would be to try first with the code predictor (only aggregating at the very end), and maybe adapt from there if somehow insight gained from this initial attempt were to tell us that the logic or its performance could be simplified or improved would we aggregate earlier. Indeed the very same predictor could be used to try both approaches, one would simply need to alter the input stream, replacing all even numbers by the odd number preceding it. I'm guessing that valuable information (for the purpose of guessing upcoming codes) is lost when we aggregate early.

Resources