Transfer Learning for small datasets of structured data - machine-learning

I am looking to implement machine learning for a problems that are built on small data sets related to approvals of expenses in a specific supply chain domain. Typically labelled data is unavailable
I was looking to build models in one data set that I have labelled data and then use that model developed in similar contexts- where the feature set is very similar, but not identical. The expectation is that this allows the starting point for recommendations and gather labelled data in the new context.
I understand this is the essence of Transfer Learning. Most of the examples I read in this domain speak of image data sets- any guidance how this can be leveraged in small data sets using standard tree-based classification algorithms

I can’t really speak to tree-based algos, I don’t know how to do transfer learning with them. But, for deep learning models, the customary method for transfer learning is to load up a pretrained model, then retrain the last layer of the dataset using your new data, and then fine-tune the rest of the network.
If you don’t have much data to go on, you might look into creating synthetic data.

raghu, I believe you are looking for a kernel method when you are saying abstraction layer in deep learning. There are several ML algorithms that support kernel functions. With kernel functions, you might be able to do it; but using kernel functions might be more complex than solving your original problem. I would lean toward Tdoggo's suggestion of using Decision Tree.
Sorry, I want to add a comment, but they won't allow me, so I posted a new answer.

Ok with tree-based algos you can do just what you said: train the tree on one dataset and apply it to another similar dataset. All you would need to do is change the terms/nodes on the second tree.
For instance, let’s say you have a decision tree trained for filtering expenses for a construction company. You will outright deny any reimbursements for workboots, because workers should provide those themselves.
You want to use the trained tree on your accounting firm, and so instead of workboots, you change that term to laptops, because accountants should be buying their own.
Does that make sense, and is that helpful to you?

After some research, we have decided to proceed with random forest models with the intuition that trees in the original model that have common features will form the starting point for decisions.
As we gain more labelled data in the new context, we will start replacing the original trees with new trees that comprise of (a)only new features and (b) combination of old and new features
This has worked to provide reasonable results in initial trials

Related

Do unsupervised machine learning model features need to be independent?

I'm training an unsupervised machine learning model and want to make sure my features are as useful as possible!
Do unsupervised machine learning model featured need to be independent? For example, I have a feature (subscriptionId) that is the subscription Id of different cloud accounts within a Tenant. I also have a feature that is the resourceId of a resource within the subscription.
However, this resourceId contains the subscriptionId. Is it best practice to combine these features or remove one feature (e.g. subscriptionId) to avoid dependence and duplication among dataset features?
For unsupervised learning, commonly used for clustering, association, or dimensionality reduction, features don't need to be fully independent, but if you have many unique values it's likely that your models can learn to differentiate on these high entropy values instead of learning interesting or significant things as you might hope.
If you're working on generative unsupervised models, for customers, I cannot express how much risk this may create, for security and secret disclosure, for Oracle Cloud Infrastructure (OCI) customers. Generative models are premised on regurgitating their inputs, and thousands of papers have been written on getting private information back out of trained models.
It's not clear what problem you're working on, and the question seems early in its formulation.
I recommend you spend time delving into the limits of statistics and data science, which are the foundation of modern popular machine learning methods.
Once you have an idea of what questions can be answered well by ML, and what can't, then you might consider something like fastAI's course.
https://towardsdatascience.com/the-actual-difference-between-statistics-and-machine-learning-64b49f07ea3
https://www.nature.com/articles/nmeth.4642
Again, depending on how the outputs will be used or who can view or (even indirectly) query the model, it seems unwise to train on private values, especially if you want to generate outputs. ML methods are only useful if you have access to a lot of data, and if you have access to the data of many users, you need to be good steward of Oracle Cloud customer data.

What is the difference between machine learning and deep learning in building a chatbot?

To be more specific, The traditional chatbot framework consists of 3 components:
NLU (1.intent classification 2. entity recognition)
Dialogue Management (1. DST 2. Dialogue Policy)
NLG.
I am just confused that If I use a deep learning model(seq2seq, lstm, transformer, attention, bert…) to train a chatbot, Is it cover all those 3 components? If so, could you explain more specifically how it related to those 3 parts? If not, how can I combine them?
For example, I have built a closed-domain chatbot, but it is only task-oriented which cannot handle the other part like greeting… And it can’t handle the problem of Coreference Resolution (it seems doesn't have Dialogue Management).
It seems like your question can be split into two smaller questions:
What is the difference between machine learning and deep learning?
How does deep learning factor into each of the three components of chatbot frameworks?
For #1, deep learning is an example of machine learning. Think of your task as a graphing problem. You transform your data so it has an n-dimensional representation on a plot. The goal of the algorithm is to create a function that represents a line drawn on the plot that (ideally) cleanly separates the points from one another. Each sector of the graph represents whatever output you want (be it a class/label, related words, etc). Basic machine learning creates a line on a 'linearly separable' problem (i.e. it's easy to draw a line that cleanly separates the categories). Deep learning enables you to tackle problems where the line might not be so clean by creating a really, really, really complex function. To do this, you need to be able to introduce multiple dimensions to the mapping function (which is what deep learning does). This is a very surface-level look at what deep learning does, but that should be enough to handle the first part of your question.
For #2, a good quick answer for you is that deep learning can be a part of each component of the chatbot framework depending on how complex your task is. If it's easy, then classical machine learning might be good enough to solve your problem. If it's hard, then you can begin to look into deep learning solutions.
Since it sounds like you want the chatbot to go a bit beyond simple input-output matching and handle complicated semantics like coreference resolution, your task seems sufficiently difficult and a good candidate for a deep learning solution. I wouldn't worry so much about identifying a specific solution for each of the chatbot framework steps because the tasks involved in each of those steps blend into one another with deep learning (e.g. a deep learning solution wouldn't need to classify intent and then manage dialogue, it would simply learn from hundreds of thousands of similar situations and apply a variation of the most similar response).
I would recommend handling the problem as a translation problem - but instead of translating from one language to another, you're translating from the input query to the output response. Translation frequently needs to resolve coreference and solutions people have used to solve that might be an ideal course of action for you.
Here are some excellent resources to read up on in order to frame your problem and how to solve it:
Google's Neural Machine Translation
Fine Tuning Tasks with BERT
There is always a trade-off between using traditional machine learning models and using deep learning models.
Deep learning models require large data to train and there will be an increase in training time & testing time. But it will give better results.
Traditional ML models work well with fewer data with moderate performance comparatively. The inference time is also less.
For Chatbots, latency matters a lot. And the latency depends on the application/domain.
If the domain is banking or finance, people are okay with waiting for a few seconds but they are not okay with wrong results. On the other hand in the entertainment domain, you need to deliver the results at the earliest.
The decision depends on the application domain + the data size you are having + the expected precision.
RASA is something worth looking into.

When true positives are rare

Suppose you're trying to use machine learning for a classification task like, let's say, looking at photographs of animals and distinguishing horses from zebras. This task would seem to be within the state of the art.
But if you take a bunch of labelled photographs and throw them at something like a neural network or support vector machine, what happens in practice is that zebras are so much rarer than horses that the system just ends up learning to say 'always a horse' because this is actually the way to minimize its error.
Minimal error that may be but it's also not a very useful result. What is the recommended way to tell the system 'I want the best guess at which photographs are zebras, even if this does create some false positives'? There doesn't seem to be a lot of discussion of this problem.
One of the things I usually do with imbalanced classes (or skewed data sets) is simply generate more data. I think this is the best approach. You could go out in the real world and gather more data of the imbalanced class (e.g. find more pictures of zebras). You could also generate more data by simply making copies or duplicating it with transformations (e.g. flip horizontally).
You could also pick a classifier that uses an alternate evaluation (performance) metric over the one usually used - accuracy. Look at precision/recall/F1 score.
Week 6 of Andrew Ng's ML course talks about this topic: link
Here is another good web page I found on handling imbalanced classes: link
With this type of unbalanced data problem, it is a good approach to learn patterns associated with each class as opposed to simply comparing classes - this can be done via unsupervised learning learning first (such as with autoencoders). A good article with this available at https://www.r-bloggers.com/autoencoders-and-anomaly-detection-with-machine-learning-in-fraud-analytics/amp/. Another suggestion - after running the classifier, the confusion matrix can be used to determine where additional data should be pursued (I.e. many zebra errors)

What subjects, topics does a computer science graduate need to learn to apply available machine learning frameworks, esp. SVMs

I want to teach myself enough machine learning so that I can, to begin with, understand enough to put to use available open source ML frameworks that will allow me to do things like:
Go through the HTML source of pages
from a certain site and "understand"
which sections form the content,
which the advertisements and which
form the metadata ( neither the
content, nor the ads - for eg. -
TOC, author bio etc )
Go through the HTML source of pages
from disparate sites and "classify"
whether the site belongs to a
predefined category or not ( list of
categories will be supplied
beforhand )1.
... similar classification tasks on
text and pages.
As you can see, my immediate requirements are to do with classification on disparate data sources and large amounts of data.
As far as my limited understanding goes, taking the neural net approach will take a lot of training and maintainance than putting SVMs to use?
I understand that SVMs are well suited to ( binary ) classification tasks like mine, and open source framworks like libSVM are fairly mature?
In that case, what subjects and topics
does a computer science graduate need
to learn right now, so that the above
requirements can be solved, putting
these frameworks to use?
I would like to stay away from Java, is possible, and I have no language preferences otherwise. I am willing to learn and put in as much effort as I possibly can.
My intent is not to write code from scratch, but, to begin with putting the various frameworks available to use ( I do not know enough to decide which though ), and I should be able to fix things should they go wrong.
Recommendations from you on learning specific portions of statistics and probability theory is nothing unexpected from my side, so say that if required!
I will modify this question if needed, depending on all your suggestions and feedback.
"Understanding" in machine learn is the equivalent of having a model. The model can be for example a collection of support vectors, the layout and weights of a neural network, a decision tree, or more. Which of these methods work best really depends on the subject you're learning from and on the quality of your training data.
In your case, learning from a collection of HTML sites, you will like to preprocess the data first, this step is also called "feature extraction". That is, you extract information out of the page you're looking at. This is a difficult step, because it requires domain knowledge and you'll have to extract useful information, or otherwise your classifiers will not be able to make good distinctions. Feature extraction will give you a dataset (a matrix with features for each row) from which you'll be able to create your model.
Generally in machine learning it is advised to also keep a "test set" that you do not train your models with, but that you will use at the end to decide on what is the best method. It is of extreme importance that you keep the test set hidden until the very end of your modeling step! The test data basically gives you a hint on the "generalization error" that your model is making. Any model with enough complexity and learning time tends to learn exactly the information that you train it with. Machine learners say that the model "overfits" the training data. Such overfitted models seem to appear good, but this is just memorization.
While software support for preprocessing data is very sparse and highly domain dependent, as adam mentioned Weka is a good free tool for applying different methods once you have your dataset. I would recommend reading several books. Vladimir Vapnik wrote "The Nature of Statistical Learning Theory", he is the inventor of SVMs. You should get familiar with the process of modeling, so a book on machine learning is definitely very useful. I also hope that some of the terminology might be helpful to you in finding your way around.
Seems like a pretty complicated task to me; step 2, classification, is "easy" but step 1 seems like a structure learning task. You might want to simplify it to classification on parts of HTML trees, maybe preselected by some heuristic.
The most widely used general machine learning library (freely) available is probably WEKA. They have a book that introduces some ML concepts and covers how to use their software. Unfortunately for you, it is written entirely in Java.
I am not really a Python person, but it would surprise me if there aren't also a lot of tools available for it as well.
For text-based classification right now Naive Bayes, Decision Trees (J48 in particular I think), and SVM approaches are giving the best results. However they are each more suited for slightly different applications. Off the top of my head I'm not sure which would suit you the best. With a tool like WEKA you could try all three approaches with some example data without writing a line of code and see for yourself.
I tend to shy away from Neural Networks simply because they can get very very complicated quickly. Then again, I haven't tried a large project with them mostly because they have that reputation in academia.
Probability and statistics knowledge is only required if you are using probabilistic algorithms (like Naive Bayes). SVMs are generally not used in a probabilistic manner.
From the sound of it, you may want to invest in an actual pattern classification textbook or take a class on it in order to find exactly what you are looking for. For custom/non-standard data sets it can be tricky to get good results without having a survey of existing techniques.
It seems to me that you are now entering machine learning field, so I'd really like to suggest to have a look at this book: not only it provides a deep and vast overview on the most common machine learning approaches and algorithms (and their variations) but it also provides a very good set of exercises and scientific paper links. All of this is wrapped in an insightful language starred with a minimal and yet useful compendium about statistics and probability

Information retrieval (IR) vs data mining vs Machine Learning (ML)

People often throw around the terms IR, ML, and data mining, but I have noticed a lot of overlap between them.
From people with experience in these fields, what exactly draws the line between these?
This is just the view of one person (formally trained in ML); others might see things quite differently.
Machine Learning is probably the most homogeneous of these three terms, and the most consistently applied--it's limited to the pattern-extraction (or pattern-matching) algorithms themselves.
Of the terms you mentioned, "Machine Learning" is the one most used by Academic Departments to describe their Curricula, their academic departments, and their research programs, as well as the term most used in academic journals and conferences proceedings. ML is clearly the least context-dependent of the terms you mentioned.
Information Retrieval and Data Mining are much closer to describing complete commercial processes--i.e., from user query to retrieval/delivery of relevant results. ML algorithms might be somewhere in that process flow, and in the more sophisticated applications, often are, but that's not a formal requirement. In addition, the term Data Mining seems usually to refer to application of some process flow on big data (i.e, > 2BG) and therefore usually includes a distributed processing (map-reduce) component near the front of that workflow.
So Information Retrieval (IR) and Data Mining (DM) are related to Machine Learning (ML) in an Infrastructure-Algorithm kind of way. In other words, Machine Learning is one source of tools used to solve problems in Information Retrieval. But it's only one source of tools. But IR doesn't depend on ML--for instance, a particular IR project might be storage and rapid retrieval of the fully-indexed data responsive to a user's search query IR, the crux of which is optimizing performance of the data flow, i.e., the round-trip from query to delivering the search results to the user. Prediction or pattern matching might not be useful here. Likewise, a DM project might use an ML algorithm for the predictive engine, yet a DM project is more likely to also be concerned with the entire processing flow--for instance, parallel computation techniques for efficient input of an enormous data volume (TB perhaps) which delivers a proto-result to a processing engine for computation of descriptive statistics (mean, standard deviation, distribution, etc. on the variables (columns).
Lastly consider the Netflix Prize. This competition was directed solely to Machine Learning--the focus was on the prediction algorithm, as evidenced by the fact that there was a single success criterion: accuracy of the predictions returned by the algorithm. Imagine if the 'Netflix Prize' were rebranded as a Data Mining competition. The success criteria would almost certainly be expanded to more accurately access the algorithm's performance in the actual commercial setting--so for instance overall execution speed (how quickly are the recommendations delivered to the user) would probably be considered along with accuracy.
The terms "Information Retrieval" and "Data Mining" are now in mainstream use, though for a while I only saw these terms in my job description or in vendor literature (usually next to the word "solution.") At my employer, we recently hired a "Data Mining" analyst. I don't know what he does exactly, but he wears a tie to work every day.
I'd try to draw the line as follows:
Information retrieval is about finding something that already is part of your data, as fast as possible.
Machine learning are techniques to generalize existing knowledge to new data, as accurate as possible.
Data mining is primarly about discovering something hidden in your data, that you did not know before, as "new" as possible.
They intersect and often use techniques of one another. DM and IR both use index structures to accelerate processes. DM uses a lot of ML techniques, for example a pattern in the data set that is useful for generalization might be a new knowledge.
They are often hard to separate. Do yourself a favor and don't just go for the buzzwords. In my opinion the best way of distinguishing them is by their intention, as given above: find data, generalize to new data, find new properties of existing data.
You can also add pattern recognition and (computational?) statistics as another couple of areas that overlap with the three you mentioned.
I'd say there is no well-defined line between them. What separates them is their history and their emphases. Statistics emphasizes mathematical rigor, data mining emphasizes scaling to large datasets, ML is somewhere in between.
Data mining is about discovering hidden patterns or unknown knowledge, which can be used
for decision making by people.
Machine learning is about learning a model to classify new objects.

Resources