Lex confidence score is not accurate - amazon-lex

I have a Lex bot with one intent (PolicyIntent), and one training phrase. The training phrase is "What is the policy system status?"
No matter what I type such as "What time is it?" Lex matches to this intent with confidence scores as high as 0.8 -- the phrase is completely different.
Has anyone worked with this type of problem before? What did you do? Find some way to correct the confidence score (anything less than 1.0 should be subtracted by 20%?)

Related

What do confidence scores mean in speech recognition?

A lot of speech to text services (such as Google's) provide a confidence score. At least for Google it is between 0 and 1, but is clearly not the probability that a particular transcription is correct, as confidences for alternative transcriptions add up to more than 1. Also a higher-confidence result is sometimes ranked lower.
So, what is it? Is there a recognized meaning of 'confidence score' in the speech recognition community? I have seen references to minimum Bayes risk but even if that is what they are doing, this doesn't much answer the question since that depends on a choice of auxiliary loss function.
but is clearly not the probability that a particular transcription is correct, as confidences for alternative transcriptions add up to more than 1
Statistical algorithms never give you the value of probability, they give you estimates. The estimate might not be accurate in some cases, it is more that in average they approach the ideal. The confidence has to be calibrated. You can check some theory in
Calibration of Confidence Measures in Speech
Recognition
Dong Yu, Senior Member, IEEE, Jinyu Li, Member, IEEE, Li Deng, Fellow, IE
https://www.microsoft.com/en-us/research/wp-content/uploads/2011/01/ConfidenceCalibration.pdf
Is there a recognized meaning of 'confidence score' in the speech recognition community?
Not really, everyone uses own algorithms. From simple Bayes Risk (which is not the best estimate at all) to much more advanced methods. It is not really possible to know what Google does. In Kaldi there is also an implementation of a good algorithm: https://github.com/kaldi-asr/kaldi/blob/master/egs/ami/s5/local/confidence_calibration.sh

How to handle <UKN> tokens in text generation

In my text generation dataset, I have converted all infrequent words into the token (unknown word), as suggested by most text-generation literature.
However, when training an RNN to take in part of a sentence as input and predict the rest of the sentence, I am not sure how I should stop the network from generating tokens.
When the network encounters an unknown (infrequent) word in the training set, what should its output be?
Example:
Sentence: I went to the mall and bought a <ukn> and some groceries
Network input: I went to the mall and bought a
Current network output: <unk> and some groceries
Desired network output: ??? and some groceries
What should it be outputting instead of the <unk>?
I don't want to build a generator that outputs words it does not know.
A RNN will give you a sampling of tokens that are most likely to appear next in your text. In your code you choose the token with the highest probability, in this case «unk».
In this case you can omit the «ukn» token and simply take the next most likely token that the RNN suggests based on the probability values that it renders.
I've seen <UNK> occasionally, but never <UKN>.
Even more common in word-embedding-training is dropping rare words entirely, to keep vocabularies compact, and avoid having words-without-sufficient-examples from serving as 'noise' in the training of other words. (Folding them all into a single magic unknown-token – which then becomes more frequent than real tokens! – would just tend to throw a big unnatural pseudo-word with no clear meaning into every other word's contexts.)
So, I'm not sure it's accurate to describe this as "suggested by most text-generation literature". And to the extent it might be, wouldn't any source suggesting this then also suggest what-to-do when a prediction is the UNK token?
If your specific application needed any real known word instead, even if the NN has low confidence that the right word is any known-word, it would seem you'd just read the next-best-non-<UKN> prediction from the NN, as suggested by #petezurich's answer.

Identifying positivity and negativity separately from a negative dataset

First of all, i would like you to know that I am new to machine learning (ML). I am working on a project which detects how positive or negative a set of words can be, therefore i have created a database containing possible negative words. So that ML can take place and predict the overall score on how positive or negative the whole set of words.
My questions are is it possible to classify positive words with only negative words in the dataset? Does it affect the accuracy of predicting if it is possible?
No, it's not generally possible. The model will have no way to differentiate among (1) new negative phrases; (2) neutral phrases; (3) positive phrases. In fact, with only negative phrases, the model will have a hard time learning that "bad" and "not bad" are opposites, as it has seen plenty of "not" references in the negative literature, such as "not worth watching, even for free."

How to calculate precision and recall of a web service ranking algorithm?

I want to calculate precision and recall of a web service ranking algorithm. I have different web services in a data base.
A customer specify some conditions in his/her search. According to the customer`s requirements, my algorithm should assign a score for each web service in data base and retrieve web services with highest scores.
I have searched the net and have read all the questions in this site about this topic, and know about precision and recall,but I dont know how to calculate them in my case. The most relevant search was in this link:
http://ijcsi.org/papers/IJCSI-8-3-2-452-460.pdf
According to this article,
Precision = Highest rank score / Total rank score of all services
Recall= Highest rank score / Score of 2nd highest service
But, I think it is not true. Can you help me please?
Thanks a lot.
There is no such thing as "precision and recall for ranking". Precision and recall are defined for binary classification task and extended to multi label tasks. Rankings require different measures as this is much more complex problem. There are numerous ways to compute something similar to precision and recall, I will summarize some basic approaches for the precision, recall goes similarly:
limit search algorithm to some K best results and count true positives as number of queries for which the desired results is in those K results. So precision is fraction of queries for which you can find relevant result in K best outputs
very strict variation of the above, set K=1, meaning that results has to come "the best of all"
assign weights to each position, so for example you can give 1/T "true positive" to each query where valid result vame T'th. In other words, if the valid result was not returned you assign 1/inf=0, if it was the first one on the list then 1/1=1, if second 1/2, etc. now precision is simply a mean of these scores
As lejlot pointed out, using "precision and recall for ranking" is weired to measure ranking performance. The definition of "precision" and "recall" is very "customized" in the referenced paper you pointed out.
It is a measure of the tradeoff between the precision and
recall of the particular ranking algorithm. Precision is the
accuracy of the ranks i.e. how well the algorithm has
ranked the services according to the user preferences.
Recall is the deviation between the top ranked service and
the next relevant service in the list. Both these metrics are
used together to arrive at the f-measure which then tests the
algorithm efficiency.
Probably the original author has some specific motivation to use such definition. Some usual metric for evaluating ranking algorithms include:
Normalized discounted information gain or nDCG (used in a lot of kaggle competitions)
Precision#K, Recall#K
This paper also listed a few common ranking measures.
This is what I could think of:
Recall could be fraction of getting user click for top 5 queries and precision could be getting the fraction of user getting the click in the first query when compared to rest of the queries. I don't know but it seems very vague to speak about precision and recall in such a scenario.

Neural networks for email spam detection

Let's say you have access to an email account with the history of received emails from the last years (~10k emails) classified into 2 groups
genuine email
spam
How would you approach the task of creating a neural network solution that could be used for spam detection - basically classifying any email either as spam or not spam?
Let's assume that the email fetching is already in place and we need to focus on classification part only.
The main points which I would hope to get answered would be:
Which parameters to choose as the input for the NN, and why?
What structure of the NN would most likely work best for such task?
Also any resource recommendations, or existing implementations (preferably in C#) are more than welcome
Thank you
EDIT
I am set on using neural networks as the main aspect on the project is to test how the NN approach would work for spam detection
Also it is a "toy problem" simply to explore subject on neural networks and spam
If you insist on NNs... I would calculate some features for every email
Both Character-Based, Word-based, and Vocabulary features (About 97 as I count these):
Total no of characters (C)
Total no of alpha chars / C Ratio of alpha chars
Total no of digit chars / C
Total no of whitespace chars/C
Frequency of each letter / C (36 letters of the keyboard – A-Z, 0-9)
Frequency of special chars (10 chars: *, _ ,+,=,%,$,#,ـ , \,/ )
Total no of words (M)
Total no of short words/M Two letters or less
Total no of chars in words/C
Average word length
Avg. sentence length in chars
Avg. sentence length in words
Word length freq. distribution/M Ratio of words of length n, n between 1 and 15
Type Token Ratio No. Of unique Words/ M
Hapax Legomena Freq. of once-occurring words
Hapax Dislegomena Freq. of twice-occurring words
Yule’s K measure
Simpson’s D measure
Sichel’s S measure
Brunet’s W measure
Honore’s R measure
Frequency of punctuation 18 punctuation chars: . ، ; ? ! : ( ) – “ « » < > [ ] { }
You could also add some more features based on the formatting: colors, fonts, sizes, ... used.
Most of these measures can be found online, in papers, or even Wikipedia (they're all simple calculations, probably based on the other features).
So with about 100 features, you need 100 inputs, some number of nodes in a hidden layer, and one output node.
The inputs would need to be normalized according to your current pre-classified corpus.
I'd split it into two groups, use one as a training group, and the other as a testing group, never mixing them. Maybe at a 50/50 ratio of train/test groups with similar spam/nonspam ratios.
Are you set on doing it with a Neural Network? It sounds like you're set up pretty well to use Bayesian classification, which is outlined well in a couple of essays by Paul Graham:
A Plan for Spam
Better Bayesian Filtering
The classified history you have access to would make very strong corpora to feed to a Bayesian algorithm, you'd probably end up with quite an effective result.
You'll basically have an entire problem, of similar scope to designing and training the neural net, of feature extraction. Where I would start, if I were you, is in slicing and dicing the input text in a large number of ways, each one being a potential feature input along the lines of "this neuron signals 1.0 if 'price' and 'viagra' occur within 3 words of each other", and culling those according to best absolute correlation with spam identification.
I'd start by taking my best 50 to 200 input feature neurons and hooking them up to a single output neuron (values trained for 1.0 = spam, -1.0 = not spam), i.e. a single-layer perceptron. I might try a multi-layer backpropagation net if that worked poorly, but wouldn't be holding my breath for great results.
Generally, my experience has led me to believe that neural networks will show mediocre performance at best in this task, and I'd definitely recommend something Bayesian as Chad Birch suggests, if this is something other than a toy problem for exploring neural nets.
Chad, the answers you've gotten so far are reasonable, but I'll respond to your update that:
I am set on using neural networks as the main aspect on the project is to test how the NN approach would work for spam detection.
Well, then you have a problem: an empirical test like this can't prove unsuitability.
You're probably best off learning a bit about what NN actually do and don't do, to see why they are not a particularly good idea for this sort of classification problem. Probably a helpful way to think about them is as universal function approximators. But for some idea of how this all fits together in the area of classification (which is what the spam filtering problem is), browsing an intro text like pattern classification might be helpful.
Failing that if you are dead set on seeing it run, just use any general NN library for the network itself. Most of your issue is going to be how to represent the input data anyway. The `best' structure is non-obvious, and it probably doesn't matter that much. The inputs are going to have to be a number of (normalized) measurements (features) on the corpus itself. Some are obvious (counts of 'spam' words, etc), some much less so. This is the part you can really play around with, but you should expect to do poorly compared to Bayesian filters (which have their own problems here) due to the nature of the problem.

Resources