When true positives are rare - machine-learning

Suppose you're trying to use machine learning for a classification task like, let's say, looking at photographs of animals and distinguishing horses from zebras. This task would seem to be within the state of the art.
But if you take a bunch of labelled photographs and throw them at something like a neural network or support vector machine, what happens in practice is that zebras are so much rarer than horses that the system just ends up learning to say 'always a horse' because this is actually the way to minimize its error.
Minimal error that may be but it's also not a very useful result. What is the recommended way to tell the system 'I want the best guess at which photographs are zebras, even if this does create some false positives'? There doesn't seem to be a lot of discussion of this problem.

One of the things I usually do with imbalanced classes (or skewed data sets) is simply generate more data. I think this is the best approach. You could go out in the real world and gather more data of the imbalanced class (e.g. find more pictures of zebras). You could also generate more data by simply making copies or duplicating it with transformations (e.g. flip horizontally).
You could also pick a classifier that uses an alternate evaluation (performance) metric over the one usually used - accuracy. Look at precision/recall/F1 score.
Week 6 of Andrew Ng's ML course talks about this topic: link
Here is another good web page I found on handling imbalanced classes: link

With this type of unbalanced data problem, it is a good approach to learn patterns associated with each class as opposed to simply comparing classes - this can be done via unsupervised learning learning first (such as with autoencoders). A good article with this available at https://www.r-bloggers.com/autoencoders-and-anomaly-detection-with-machine-learning-in-fraud-analytics/amp/. Another suggestion - after running the classifier, the confusion matrix can be used to determine where additional data should be pursued (I.e. many zebra errors)

Related

How to give a logical reason for choosing a model

I used machine learning to train depression related sentences. And it was LinearSVC that performed best. In addition to LinearSVC, I experimented with MultinomialNB and LogisticRegression, and I chose the model with the highest accuracy among the three. By the way, what I want to do is to be able to think in advance which model will fit, like ml_map provided by Scikit-learn. Where can I get this information? I searched a few papers, but couldn't find anything that contained more detailed information other than that SVM was suitable for text classification. How do I study to get prior knowledge like this ml_map?
How do I study to get prior knowledge like this ml_map?
Try to work with different example datasets on different data types by using different algorithms. There are hundreds to be explored. Once you get the good grasp of how they work, it will become more clear. And do not forget to try googling something like advantages of algorithm X, it helps a lot.
And here are my thoughts, I think I used to ask such questions before and I hope it can help if you are struggling: The more you work on different Machine Learning models for a specific problem, you will soon realize that data and feature engineering play the more important parts than the algorithms themselves. The road map provided by scikit-learn gives you a good view of what group of algorithms to use to deal with certain types of data and that is a good start. The boundaries between them, however, are rather subtle. In other words, one problem can be solved by different approaches depending on how you organize and engineer your data.
To sum it up, in order to achieve a good out-of-sample (i.e., good generalization) performance while solving a problem, it is mandatory to look at the training/testing process with different setting combinations and be mindful with your data (for example, answer this question: does it cover most samples in terms of distribution in the wild or just a portion of it?)

What is the difference between machine learning and deep learning in building a chatbot?

To be more specific, The traditional chatbot framework consists of 3 components:
NLU (1.intent classification 2. entity recognition)
Dialogue Management (1. DST 2. Dialogue Policy)
NLG.
I am just confused that If I use a deep learning model(seq2seq, lstm, transformer, attention, bert…) to train a chatbot, Is it cover all those 3 components? If so, could you explain more specifically how it related to those 3 parts? If not, how can I combine them?
For example, I have built a closed-domain chatbot, but it is only task-oriented which cannot handle the other part like greeting… And it can’t handle the problem of Coreference Resolution (it seems doesn't have Dialogue Management).
It seems like your question can be split into two smaller questions:
What is the difference between machine learning and deep learning?
How does deep learning factor into each of the three components of chatbot frameworks?
For #1, deep learning is an example of machine learning. Think of your task as a graphing problem. You transform your data so it has an n-dimensional representation on a plot. The goal of the algorithm is to create a function that represents a line drawn on the plot that (ideally) cleanly separates the points from one another. Each sector of the graph represents whatever output you want (be it a class/label, related words, etc). Basic machine learning creates a line on a 'linearly separable' problem (i.e. it's easy to draw a line that cleanly separates the categories). Deep learning enables you to tackle problems where the line might not be so clean by creating a really, really, really complex function. To do this, you need to be able to introduce multiple dimensions to the mapping function (which is what deep learning does). This is a very surface-level look at what deep learning does, but that should be enough to handle the first part of your question.
For #2, a good quick answer for you is that deep learning can be a part of each component of the chatbot framework depending on how complex your task is. If it's easy, then classical machine learning might be good enough to solve your problem. If it's hard, then you can begin to look into deep learning solutions.
Since it sounds like you want the chatbot to go a bit beyond simple input-output matching and handle complicated semantics like coreference resolution, your task seems sufficiently difficult and a good candidate for a deep learning solution. I wouldn't worry so much about identifying a specific solution for each of the chatbot framework steps because the tasks involved in each of those steps blend into one another with deep learning (e.g. a deep learning solution wouldn't need to classify intent and then manage dialogue, it would simply learn from hundreds of thousands of similar situations and apply a variation of the most similar response).
I would recommend handling the problem as a translation problem - but instead of translating from one language to another, you're translating from the input query to the output response. Translation frequently needs to resolve coreference and solutions people have used to solve that might be an ideal course of action for you.
Here are some excellent resources to read up on in order to frame your problem and how to solve it:
Google's Neural Machine Translation
Fine Tuning Tasks with BERT
There is always a trade-off between using traditional machine learning models and using deep learning models.
Deep learning models require large data to train and there will be an increase in training time & testing time. But it will give better results.
Traditional ML models work well with fewer data with moderate performance comparatively. The inference time is also less.
For Chatbots, latency matters a lot. And the latency depends on the application/domain.
If the domain is banking or finance, people are okay with waiting for a few seconds but they are not okay with wrong results. On the other hand in the entertainment domain, you need to deliver the results at the earliest.
The decision depends on the application domain + the data size you are having + the expected precision.
RASA is something worth looking into.

Transfer Learning for small datasets of structured data

I am looking to implement machine learning for a problems that are built on small data sets related to approvals of expenses in a specific supply chain domain. Typically labelled data is unavailable
I was looking to build models in one data set that I have labelled data and then use that model developed in similar contexts- where the feature set is very similar, but not identical. The expectation is that this allows the starting point for recommendations and gather labelled data in the new context.
I understand this is the essence of Transfer Learning. Most of the examples I read in this domain speak of image data sets- any guidance how this can be leveraged in small data sets using standard tree-based classification algorithms
I can’t really speak to tree-based algos, I don’t know how to do transfer learning with them. But, for deep learning models, the customary method for transfer learning is to load up a pretrained model, then retrain the last layer of the dataset using your new data, and then fine-tune the rest of the network.
If you don’t have much data to go on, you might look into creating synthetic data.
raghu, I believe you are looking for a kernel method when you are saying abstraction layer in deep learning. There are several ML algorithms that support kernel functions. With kernel functions, you might be able to do it; but using kernel functions might be more complex than solving your original problem. I would lean toward Tdoggo's suggestion of using Decision Tree.
Sorry, I want to add a comment, but they won't allow me, so I posted a new answer.
Ok with tree-based algos you can do just what you said: train the tree on one dataset and apply it to another similar dataset. All you would need to do is change the terms/nodes on the second tree.
For instance, let’s say you have a decision tree trained for filtering expenses for a construction company. You will outright deny any reimbursements for workboots, because workers should provide those themselves.
You want to use the trained tree on your accounting firm, and so instead of workboots, you change that term to laptops, because accountants should be buying their own.
Does that make sense, and is that helpful to you?
After some research, we have decided to proceed with random forest models with the intuition that trees in the original model that have common features will form the starting point for decisions.
As we gain more labelled data in the new context, we will start replacing the original trees with new trees that comprise of (a)only new features and (b) combination of old and new features
This has worked to provide reasonable results in initial trials

How to build a good training data set for machine learning and predictions?

I have a school project to make a program that uses the Weka tools to make predictions on football (soccer) games.
Since the algorithms are already there (the J48 algorithm), I need just the data. I found a website that offers football game data for free and I tried it in Weka but the predictions were pretty bad so I assume my data is not structured properly.
I need to extract the data from my source and format it another way in order to make new attributes and classes for my model. Does anyone know of a course/tutorial/guide on how to properly create your attributes and classes for machine learning predictions? Is there a standard that describes the best way of choosing the attributes of a data set for training a machine learning algorithm? What's the approach on this?
here's an example of the data that I have at the moment: http://www.football-data.co.uk/mmz4281/1516/E0.csv
and here is what the columns mean: http://www.football-data.co.uk/notes.txt
The problem may be that the data set you have is too small. Suppose you have ten variables and each variable has a range of 10 values. There are 10^10 possible configurations of these variables. It is unlikely your data set will be this large let alone cover all of the possible configurations. The trick is to narrow down the variables to the most relevant to avoid this large potential search space.
A second problem is that certain combinations of variables may be more significant than others.
The J48 algorithm attempts to to find the most relevant variable using entropy at each level in the tree. each path through the tree can be thought of as an AND condition: V1==a & V2==b ...
This covers the significance due to joint interactions. But what if the outcome is a result of A&B&C OR W&X&Y? The J48 algorithm will find only one and it will be the one where the the first variable selected will have the most overall significance when considered alone.
So, to answer your question, you need to not only find a training set which will cover the most common variable configurations in the "general" population but find an algorithm which will faithfully represent these training cases. Faithful meaning it will generally apply to unseen cases.
It's not an easy task. Many people and much money are involved in sports betting. If it were as easy as selecting the proper training set, you can be sure it would have been found by now.
EDIT:
It was asked in the comments how to you find the proper algorithm. The answer is the same way you find a needle in a haystack. There is no set rule. You may be lucky and stumble across it but in a large search space you won't ever know if you have. This is the same problem as finding the optimum point in a very convoluted search space.
A short-term answer is to
Think about what the algorithm can really accomplish. The J48 (and similar) algorithms are best suited for classification where the influence of the variables on the result are well known and follow a hierarchy. Flower classification is one example where it will likely excel.
Check the model against the training set. If it does poorly with the training set then it will likely have poor performance with unseen data. In general, you should expect the model to performance against the training to exceed the performance against unseen data.
The algorithm needs to be tested with data it has never seen. Testing against the training set, while a quick elimination test, will likely lead to overconfidence.
Reserve some of your data for testing. Weka provides a way to do this. The best case scenario would be to build the model on all cases except one (Leave On Out Approach) then see how the model performs on the average with these.
But this assumes the data at hand are not in some way biased.
A second pitfall is to let the test results bias the way you build the model.For example, trying different models parameters until you get an acceptable test response. With J48 it's not easy to allow this bias to creep in but if it did then you have just used your test set as an auxiliary training set.
Continue collecting more data; testing as long as possible. Even after all of the above, you still won't know how useful the algorithm is unless you can observe its performance against future cases. When what appears to be a good model starts behaving poorly then it's time to go back to the drawing board.
Surprisingly, there are a large number of fields (mostly in the soft sciences) which fail to see the need to verify the model with future data. But this is a matter better discussed elsewhere.
This may not be the answer you are looking for but it is the way things are.
In summary,
The training data set should cover the 'significant' variable configurations
You should verify the model against unseen data
Identifying (1) and doing (2) are the tricky bits. There is no cut-and-dried recipe to follow.

What subjects, topics does a computer science graduate need to learn to apply available machine learning frameworks, esp. SVMs

I want to teach myself enough machine learning so that I can, to begin with, understand enough to put to use available open source ML frameworks that will allow me to do things like:
Go through the HTML source of pages
from a certain site and "understand"
which sections form the content,
which the advertisements and which
form the metadata ( neither the
content, nor the ads - for eg. -
TOC, author bio etc )
Go through the HTML source of pages
from disparate sites and "classify"
whether the site belongs to a
predefined category or not ( list of
categories will be supplied
beforhand )1.
... similar classification tasks on
text and pages.
As you can see, my immediate requirements are to do with classification on disparate data sources and large amounts of data.
As far as my limited understanding goes, taking the neural net approach will take a lot of training and maintainance than putting SVMs to use?
I understand that SVMs are well suited to ( binary ) classification tasks like mine, and open source framworks like libSVM are fairly mature?
In that case, what subjects and topics
does a computer science graduate need
to learn right now, so that the above
requirements can be solved, putting
these frameworks to use?
I would like to stay away from Java, is possible, and I have no language preferences otherwise. I am willing to learn and put in as much effort as I possibly can.
My intent is not to write code from scratch, but, to begin with putting the various frameworks available to use ( I do not know enough to decide which though ), and I should be able to fix things should they go wrong.
Recommendations from you on learning specific portions of statistics and probability theory is nothing unexpected from my side, so say that if required!
I will modify this question if needed, depending on all your suggestions and feedback.
"Understanding" in machine learn is the equivalent of having a model. The model can be for example a collection of support vectors, the layout and weights of a neural network, a decision tree, or more. Which of these methods work best really depends on the subject you're learning from and on the quality of your training data.
In your case, learning from a collection of HTML sites, you will like to preprocess the data first, this step is also called "feature extraction". That is, you extract information out of the page you're looking at. This is a difficult step, because it requires domain knowledge and you'll have to extract useful information, or otherwise your classifiers will not be able to make good distinctions. Feature extraction will give you a dataset (a matrix with features for each row) from which you'll be able to create your model.
Generally in machine learning it is advised to also keep a "test set" that you do not train your models with, but that you will use at the end to decide on what is the best method. It is of extreme importance that you keep the test set hidden until the very end of your modeling step! The test data basically gives you a hint on the "generalization error" that your model is making. Any model with enough complexity and learning time tends to learn exactly the information that you train it with. Machine learners say that the model "overfits" the training data. Such overfitted models seem to appear good, but this is just memorization.
While software support for preprocessing data is very sparse and highly domain dependent, as adam mentioned Weka is a good free tool for applying different methods once you have your dataset. I would recommend reading several books. Vladimir Vapnik wrote "The Nature of Statistical Learning Theory", he is the inventor of SVMs. You should get familiar with the process of modeling, so a book on machine learning is definitely very useful. I also hope that some of the terminology might be helpful to you in finding your way around.
Seems like a pretty complicated task to me; step 2, classification, is "easy" but step 1 seems like a structure learning task. You might want to simplify it to classification on parts of HTML trees, maybe preselected by some heuristic.
The most widely used general machine learning library (freely) available is probably WEKA. They have a book that introduces some ML concepts and covers how to use their software. Unfortunately for you, it is written entirely in Java.
I am not really a Python person, but it would surprise me if there aren't also a lot of tools available for it as well.
For text-based classification right now Naive Bayes, Decision Trees (J48 in particular I think), and SVM approaches are giving the best results. However they are each more suited for slightly different applications. Off the top of my head I'm not sure which would suit you the best. With a tool like WEKA you could try all three approaches with some example data without writing a line of code and see for yourself.
I tend to shy away from Neural Networks simply because they can get very very complicated quickly. Then again, I haven't tried a large project with them mostly because they have that reputation in academia.
Probability and statistics knowledge is only required if you are using probabilistic algorithms (like Naive Bayes). SVMs are generally not used in a probabilistic manner.
From the sound of it, you may want to invest in an actual pattern classification textbook or take a class on it in order to find exactly what you are looking for. For custom/non-standard data sets it can be tricky to get good results without having a survey of existing techniques.
It seems to me that you are now entering machine learning field, so I'd really like to suggest to have a look at this book: not only it provides a deep and vast overview on the most common machine learning approaches and algorithms (and their variations) but it also provides a very good set of exercises and scientific paper links. All of this is wrapped in an insightful language starred with a minimal and yet useful compendium about statistics and probability

Resources