MS Luis.ai | Maximum number of utterances per intent/app

MS Luis.ai | Maximum number of utterances per intent/app - machine-learning

Re my question on their GitHub page: https://github.com/Microsoft/Cognitive-LUIS-Windows/issues/17 my question is:
"Is there a maximum number of utterances that LUIS can handle per app/per intent?"
So, if we throw 2000 utterances at a single intent, and we have say 20 intents per luis-app, are we at risk of 'overloading/overfitting' luis? Will it's training time performance degrade considerably?
I know we could just go ahead and do that - throw thousands of utterances at each intent & back-up a copy of our luis-apps, but you know, if someone who helped make luis knows already, that would be great.

Yes, training will be ridiculously slow, especially if you have custom features (phrase features).
The idea of LUIS is to enable getting a reasonable model with few examples, so you might be overloading the app with insignificant gain.
If you want to enhance the prediction accuracy, you might need to check the quality of the utterances provided not just adding more.

Related

How do i represent the chromosome using Genetic Algorithm?

My task is to calculate clashes between alert time schedule and the user calendar schedule to generate the clashes less alert time schedule.
How should i represent the chromosome according to this problem?
How should i represent the time slots? (Binary or Number)
Thank You
(Please Consider i'm a beginner to the genetic algorithm studies)

Questions would be: What have you tried so far? How good are your results so far? Also your Problem is
stated quite unspecific. Thus here is what I can give:
The Chromosome should probably be the starttime of the alerts in your schedule (if I understood your Problem correctly).
As important is to think of the ways you want to evaluate and calculate the Fitness of your individuals (here clashes (e.g. amount or time overlap between appointments), but it is obvious that you might find better heuristics to receive better solutions / faster convergence)
Binary or continuous number might both work: I am usually going for numbers whenever there is no strong reason to not do so (since it is easier to Interpret, debug, etc.). Binary comes with some nice opportunities with respect to Mutation and Recombination.
I strongly recommend playing around and reading about those Things. This might look like a lot of extra work to implement, but you should rather come to see them as hyperparameters which Need to be tuned in order to receive the best Outcome.

Is LIBSVM suitable for many categories and samples?

I'm building a text classifier, which should be able to give probabilities that a document belongs to certain categories (i.e. 80% fiction, 30% marketing etc)
I believe Libsvm does this via the "predict" method, but the problem is that I have approximately 20 categories to test for. Also, I have several hundred documents that can be used for the training.
The problem is that the training file gets 1 GB - 2 GB big, and this makes Libsvc super-slow.
How can this issue be solved? And should I go for Liblinear instead, or are there better options?

Regarding this specific question, I had to use Liblinear as LibSVC kept running forever.
But if anyone wants to know how it eventually turned out:
I switched from PHP / C++ to Python, which was tremendously
easier and did not encounter any memory issues
My case was "multi-labelling". This article put me in the right direction, and the magpie project helped me accomplish the task.

Evaluation metrics for algoritthms to calculate topic-hotness

How do you evaluate algorithms for calculation of hotness of a post? As in how would you know, which performs better an exponential-decay or the redddit's algo? I understand the question may be a bit naive, but I am looking into performance metrics, or cost functions to help with this?

As with evaluation of any piece of software, you have to first set out problems for it solve and from those derive goals you want to achieve. After you have those, then you can start to determine what metrics will provide a useful approximation of progress towards the goals.
Perhaps you want your site to be great at breaking news. You probably will derive goals from that like "given sufficient votes, a new post should be able to make it to the top 30 listings in the first 10 minutes after it's posted". Then you can build out some test cases and see if you meet them.
Or perhaps you want to be the place with the "best" stuff from across the web. Your goals will weight more heavily towards user approval than newness.
You have to evaluate your own situation to come up with reasonable performance metrics.

Should I remove test samples that are identical to some training sample?

I've been having a bit of a debate with my adviser about this issue, and I'd like to get your opinion on it.
I have a fairly large dataset that I've used to build a classifier. I have a separate, smaller testing dataset that was obtained independently from the training set (in fact, you could say that each sample in either set was obtained independently). Each sample has a class label, along with metadata such as collection date and location.
There is no sample in the testing set that has the same metadata as any sample in the training set (as each sample was collected at a different location or time). However, it is possible that the feature vector itself could be identical to some sample in the training set. For example, there could be two virus strains that were sampled in Africa and Canada, respectively, but which both have the same protein sequence (the feature vector).
My adviser thinks that I should remove such samples from the testing set. His reasoning is that these are like "freebies" when it comes to testing, and may artificially boost the reported accuracy.
However, I disagree and think they should be included, because it may actually happen in the real world that the classifier sees a sample that it has already seen before. To remove these samples would bring us even further from reality.
What do you think?

It would be nice to know if you're talking about a couple of repetitions in million samples or 10 repetitions in 15 samples.
In general I don't find what you're doing reasonable. I think your advisor has a very good point. Your evaluation needs to be as close as possible to using your classifier outside your control -- You can't just assume your going to be evaluated on a datapoint you've already seen. Even if each data point is independent, you're going to be evaluated on never-before-seen data.
My experience is in computer vision, and it would be very highly questionable to train and test with the same picture of a one subject. In fact I wouldn't be comfortable training and testing with frames of the same video (not even the same frame).
EDIT:
There are two questions:
The distribution permits that these repetitions naturally happen. I
believe you, you know your experiment, you know your data, you're
the expert.
The issue that you're getting a boost by doing this and that this
boost is possibly unfair. One possible way to address your advisor's
concerns is to evaluate how significant a leverage you're getting
from the repeated data points. Generate 20 test cases 10 in which
you train with 1000 and test on 33 making sure there are not
repetitions in the 33, and another 10 cases in which you train with
1000 and test on 33 with repetitions allowed as they occur
naturally. Report the mean and standard deviation of both
experiments.

It depends... Your adviser suggested the common practice. You usually test a classifier on samples which have not been used for training. If the samples of the test set matching the training set are very few, your results are not going to have statistical difference because of the reappearance of the same vectors. If you want to be formal and still keep your logic, you have to prove that the reappearance of the same vectors has no statistical significance on the testing process. If you proved this theoretically, I would accept your logic. See this ebook on statistics in general, and this chapter as a start point on statistical significance and null hypothesis testing.
Hope I helped!

In as much as the training and testing datasets are representative of the underlying data distribution, I think it's perfectly valid to leave in repetitions. The test data should be representative of the kind of data you would expect your method to perform on. If you genuinely can get exact replicates, that's fine. However, I would question what your domain is where it's possible to generate exactly the same sample multiple times. Are your data synthetic? Are you using a tiny feature set with few possible values for each of your features, such that different points in input space map to the same point in feature space?
The fact that you're able to encounter the same instance multiple times is suspicious to me. Also, if you have 1,033 instances, you should be using far more than 33 of them for testing. The variance in your test accuracy will be huge. See the answer here.

Having several duplicate or very similar samples seems somewhat analogous to the distribution of the population you're attempting to classify being non-uniform. That is, certain feature combinations are more common than others, and the high occurrence of them in your data is giving them more weight. Either that, or your samples are not representative.
Note: Of course, even if a population is uniformly distributed there is always some likelihood of drawing similar samples (perhaps even identical depending on the distribution).
You could probably make some argument that identical observations are a special case, but are they really? If your samples are representative it seems perfectly reasonable that some feature combinations would be more common than others (perhaps even identical depending on your problem domain).

Quality vs. ROI - When is Good Enough, good enough? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
UPDATED: I'm asking this from a development perspective, however to illustrate, a canoical non-development example that comes to mind is that if it costs, say, $10,000
to keep a uptime rate of 99%, then it theoretically can cost $100,000 to keep a rate
of 99.9%, and possibly $1,000,000 to keep a rate of 99.99%.
Somewhat like calculus in approaching 0, as we closely approach 100%,
the cost can increase exponentially. Therefore, as a developer or PM, where do you decide
that the deliverable is "good enough" given the time and monetary constraints, e.g.: are you getting a good ROI at 99%, 99.9%,
99.99%?
I'm using a non-development example because I'm not sure of a solid metric for development. Maybe in the above example "uptime" could be replaced with "function point to defect ratio", or some such reasonable measure rate of bugs vs. the complexity of code. I would also welcome input regarding all stages of a software development lifecycle.
Keep the classic Project Triangle constraints in mind (quality vs. speed vs. cost). And let's assume that the customer wants the best quality you can deliver given the original budget.

There's no way to answer this without knowing what happens when your application goes down.
If someone dies when your application goes down, uptime is worth spending millions or even billions of dollars on (aerospace, medical devices).
If someone may be injured if your software goes down, uptime is worth hundreds of thousands or millions of dollars (industrial control systems, auto safety devices)
If someone looses millions of dollars if your software goes down, uptime is worth spending millions on (financial services, large e-commerce apps).
If someone looses thousands of dollars if your software goes down, uptime is worth spending thousands on (retail, small e-commerce apps).
If someone will swear at the computer and looses productivity while it reboots when your software goes down, then uptime is worth spending thousands on (most internal software).
etc.
Basically take (cost of going down) x (number of times the software will go down) and you know how much to spend on uptime.

The Quality vs Good Enough discussion I've seen has a practical ROI at 95% defect fixes. Obviously show stoppers / critical defects are fixed (and always there are the exceptions like air-plane autopilots etc, that need to not have so many defects).
I can't seem to find the reference to the 95% defect fixes, it is either in Rapid Development or in Applied Software Measurement by Caper Jones.
Here is a link to a useful strategy for attacking code quality:
http://www.gamedev.net/reference/articles/article1050.asp

The client, of course, would likely balk at that number and might say no more than 1 hour of downtime per year is acceptable. That's 12 times more stable. Do you tell the customer, sorry, we can't do that for $100,000, or do you make your best attempt, hoping your analysis was conservative?
Flat out tell the customer what they want isn't reasonable. In order to gain that kind of uptime, a massive amount of money would be needed, and realistically, the chances of reaching that percentage of uptime constantly just isn't possible.
I personally would go back to the customer and tell them that you'll provide them with the best setup with 100k and set up an outage report guideline. Something like, for every outage you have, we will complete an investigation as to why this outage happened, and how what we will do to make the chances of it happening again almost non existent.
I think offering SLAs is just a mistake.

I think the answer to this question depends entirely on the individual application.
Software that has an impact on human safety has much different requirements than, say, an RSS feed reader.

The project triangle is a gross simplification. In lots of cases you can actually save time by improving quality. For example by reducing repairs and avoiding costs in maintenance. This is not only true in software development.Toyota lean production proved that this works in manufacturing too.
The whole process of software development is far too complex to make generalizations on cost vs quality. Quality is a fuzzy concept that consists of multiple factors. Is testable code of higher quality than performant code? Is maintainable code of higher quality than testable code? Do you need testable code for an RSS reader or performant code? And for a fly-by-wire F16?
It's more productive to make informed desisions on a case-by-case basis. And don't be afraid to over-invest in quality. It's usually much cheaper and safer than under-investing.

To answer in an equally simplistic way..
..When you stop hearing from the customers (and not because they stopped using your product).. except for enhancement requests and bouquets :)
And its not a triangle, it has 4 corners - Cost Time Quality and Scope.

To expand on what "17 of 26" said, the answer depends on value to the customer. In the case of critical software, like aircrafct controller applications, the value to the customer of a high quality rating by whatever measure they use is quite high. To the user of an RSS feed reader, the value of high quality is considerably lower.
It's all about the customer (notice I didn't say user - sometimes they're the same, and sometimes they're not).

Chasing the word "Quality" is like chasing the horizon. I have never seen anything (in the IT world or outside) that is 100% quality. There's always room for improvement.
Secondly, "quality" is an overly broad term. It means something different to everyone and subjective in it's degree of implementation.
That being said, every effort boils down to what "engineering" means--making the right choices to balance cost, time and key characteristics (ie. speed, size, shape, weight, etc.) These are constraints.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

MS Luis.ai | Maximum number of utterances per intent/app - machine-learning

Related

How do i represent the chromosome using Genetic Algorithm?

Is LIBSVM suitable for many categories and samples?

Evaluation metrics for algoritthms to calculate topic-hotness

Should I remove test samples that are identical to some training sample?

Quality vs. ROI - When is Good Enough, good enough? [closed]

Categories

Resources