transfer knowledge learned from distributed source domains - tensorflow-federated

To resolve the problem of non-iid data in federated learning, I read a paper which add a new node with a different data domain and transfer knowledge from decentralized nodes. My question is what is the information transfered, is that updates or data ?

In layman terms, non-idd means that not all class labels are distributed evenly between clients for training. For obvious reasons, in federated environment it is not feasible for every client to hold and train on idd data. With regards your specific query of how it works in the paper mentioned in your question, you may please share the link of the paper.

Related

Do unsupervised machine learning model features need to be independent?

I'm training an unsupervised machine learning model and want to make sure my features are as useful as possible!
Do unsupervised machine learning model featured need to be independent? For example, I have a feature (subscriptionId) that is the subscription Id of different cloud accounts within a Tenant. I also have a feature that is the resourceId of a resource within the subscription.
However, this resourceId contains the subscriptionId. Is it best practice to combine these features or remove one feature (e.g. subscriptionId) to avoid dependence and duplication among dataset features?
For unsupervised learning, commonly used for clustering, association, or dimensionality reduction, features don't need to be fully independent, but if you have many unique values it's likely that your models can learn to differentiate on these high entropy values instead of learning interesting or significant things as you might hope.
If you're working on generative unsupervised models, for customers, I cannot express how much risk this may create, for security and secret disclosure, for Oracle Cloud Infrastructure (OCI) customers. Generative models are premised on regurgitating their inputs, and thousands of papers have been written on getting private information back out of trained models.
It's not clear what problem you're working on, and the question seems early in its formulation.
I recommend you spend time delving into the limits of statistics and data science, which are the foundation of modern popular machine learning methods.
Once you have an idea of what questions can be answered well by ML, and what can't, then you might consider something like fastAI's course.
https://towardsdatascience.com/the-actual-difference-between-statistics-and-machine-learning-64b49f07ea3
https://www.nature.com/articles/nmeth.4642
Again, depending on how the outputs will be used or who can view or (even indirectly) query the model, it seems unwise to train on private values, especially if you want to generate outputs. ML methods are only useful if you have access to a lot of data, and if you have access to the data of many users, you need to be good steward of Oracle Cloud customer data.

Transfer Learning for small datasets of structured data

I am looking to implement machine learning for a problems that are built on small data sets related to approvals of expenses in a specific supply chain domain. Typically labelled data is unavailable
I was looking to build models in one data set that I have labelled data and then use that model developed in similar contexts- where the feature set is very similar, but not identical. The expectation is that this allows the starting point for recommendations and gather labelled data in the new context.
I understand this is the essence of Transfer Learning. Most of the examples I read in this domain speak of image data sets- any guidance how this can be leveraged in small data sets using standard tree-based classification algorithms
I can’t really speak to tree-based algos, I don’t know how to do transfer learning with them. But, for deep learning models, the customary method for transfer learning is to load up a pretrained model, then retrain the last layer of the dataset using your new data, and then fine-tune the rest of the network.
If you don’t have much data to go on, you might look into creating synthetic data.
raghu, I believe you are looking for a kernel method when you are saying abstraction layer in deep learning. There are several ML algorithms that support kernel functions. With kernel functions, you might be able to do it; but using kernel functions might be more complex than solving your original problem. I would lean toward Tdoggo's suggestion of using Decision Tree.
Sorry, I want to add a comment, but they won't allow me, so I posted a new answer.
Ok with tree-based algos you can do just what you said: train the tree on one dataset and apply it to another similar dataset. All you would need to do is change the terms/nodes on the second tree.
For instance, let’s say you have a decision tree trained for filtering expenses for a construction company. You will outright deny any reimbursements for workboots, because workers should provide those themselves.
You want to use the trained tree on your accounting firm, and so instead of workboots, you change that term to laptops, because accountants should be buying their own.
Does that make sense, and is that helpful to you?
After some research, we have decided to proceed with random forest models with the intuition that trees in the original model that have common features will form the starting point for decisions.
As we gain more labelled data in the new context, we will start replacing the original trees with new trees that comprise of (a)only new features and (b) combination of old and new features
This has worked to provide reasonable results in initial trials

Data prediction from previous data history using AI/ML

I am looking for solutions where I can automatically approve or disapprove different supplier invoices based on historical data.
Let's say, I got an invoice from an HP laptop supplier and based on the previous data, I have to approve or reject that invoice.
Basically, I want to make a decision or prediction based on the data already available based on the history with artificial intelligence, machine learning or any other cloud service
This isn't a direct question though but you can start by looking into various methods of classifications. There is a huge amount of material available online. Try reading about K-Nearest Neighbors, Naive Bayes, K-means, etc. to get an idea about how algorithms in Machine Learning domain work. Once you start understanding what is written in the documentation then start implementing them. You will face a lot of problems which you can search online and I'm sure you will find most of them answered here in this portal.

Incorporating user feedback in a ML model

I have developed a ML model for a classification (0/1) NLP task and deployed it in production environment. The prediction of the model is displayed to users, and the users have the option to give a feedback (if the prediction was right/wrong).
How can I continuously incorporate this feedback in my model ? From a UX stand point you dont want a user to correct/teach the system more than twice/thrice for a specific input, system shld learn fast i.e. so the feedback shld be incorporated "fast". (Google priority inbox does this in a seamless way)
How does one build this "feedback loop" using which my system can improve ? I have searched a lot on net but could not find relevant material. any pointers will be of great help.
Pls dont say retrain the model from scratch by including new data points. Thats surely not how google and facebook build their smart systems
To further explain my question - think of google's spam detector or their priority inbox or their recent feature of "smart replies". Its a well known fact that they have the ability to learn / incorporate (fast) user feed.
All the while when it incorporates the user feedback fast (i.e. user has to teach the system correct output atmost 2-3 times per data point and the system start to give correct output for that data point) AND it also ensure it maintains old learnings and does not start to give wrong outputs on older data points (where it was giving right output earlier) while incorporating the learning from new data point.
I have not found any blog/literature/discussion w.r.t how to build such systems - An intelligent system that explains in detaieedback loop" in ML systems
Hope my question is little more clear now.
Update: Some related questions I found are:
Does the SVM in sklearn support incremental (online) learning?
https://datascience.stackexchange.com/questions/1073/libraries-for-online-machine-learning
http://mlwave.com/predicting-click-through-rates-with-online-machine-learning/
https://en.wikipedia.org/wiki/Concept_drift
Update: I still dont have a concrete answer but such a recipe does exists. Read the section "Learning from the feedback" in the following blog Machine Learning != Learning Machine. In this Jean talks about "adding a feedback ingestion loop to machine". Same in here, here, here4.
There could be couple of ways to do this:
1) You can incorporate the feedback that you get from the user to only train the last layer of your model, keeping the weights of all other layers intact. Intuitively, for example, in case of CNN this means you are extracting the features using your model but slightly adjusting the classifier to account for the peculiarities of your specific user.
2) Another way could be to have a global model ( which was trained on your large training set) and a simple logistic regression which is user specific. For final predictions, you can combine the results of the two predictions. See this paper by google on how they do it for their priority inbox.
Build a simple, light model(s) that can be updated per feedback. Online Machine learning gives a number of candidates for this
Most good online classifiers are linear. In which case we can have a couple of them and achieve non-linearity by combining them via a small shallow neural net
https://stats.stackexchange.com/questions/126546/nonlinear-dynamic-online-classification-looking-for-an-algorithm

Information retrieval (IR) vs data mining vs Machine Learning (ML)

People often throw around the terms IR, ML, and data mining, but I have noticed a lot of overlap between them.
From people with experience in these fields, what exactly draws the line between these?
This is just the view of one person (formally trained in ML); others might see things quite differently.
Machine Learning is probably the most homogeneous of these three terms, and the most consistently applied--it's limited to the pattern-extraction (or pattern-matching) algorithms themselves.
Of the terms you mentioned, "Machine Learning" is the one most used by Academic Departments to describe their Curricula, their academic departments, and their research programs, as well as the term most used in academic journals and conferences proceedings. ML is clearly the least context-dependent of the terms you mentioned.
Information Retrieval and Data Mining are much closer to describing complete commercial processes--i.e., from user query to retrieval/delivery of relevant results. ML algorithms might be somewhere in that process flow, and in the more sophisticated applications, often are, but that's not a formal requirement. In addition, the term Data Mining seems usually to refer to application of some process flow on big data (i.e, > 2BG) and therefore usually includes a distributed processing (map-reduce) component near the front of that workflow.
So Information Retrieval (IR) and Data Mining (DM) are related to Machine Learning (ML) in an Infrastructure-Algorithm kind of way. In other words, Machine Learning is one source of tools used to solve problems in Information Retrieval. But it's only one source of tools. But IR doesn't depend on ML--for instance, a particular IR project might be storage and rapid retrieval of the fully-indexed data responsive to a user's search query IR, the crux of which is optimizing performance of the data flow, i.e., the round-trip from query to delivering the search results to the user. Prediction or pattern matching might not be useful here. Likewise, a DM project might use an ML algorithm for the predictive engine, yet a DM project is more likely to also be concerned with the entire processing flow--for instance, parallel computation techniques for efficient input of an enormous data volume (TB perhaps) which delivers a proto-result to a processing engine for computation of descriptive statistics (mean, standard deviation, distribution, etc. on the variables (columns).
Lastly consider the Netflix Prize. This competition was directed solely to Machine Learning--the focus was on the prediction algorithm, as evidenced by the fact that there was a single success criterion: accuracy of the predictions returned by the algorithm. Imagine if the 'Netflix Prize' were rebranded as a Data Mining competition. The success criteria would almost certainly be expanded to more accurately access the algorithm's performance in the actual commercial setting--so for instance overall execution speed (how quickly are the recommendations delivered to the user) would probably be considered along with accuracy.
The terms "Information Retrieval" and "Data Mining" are now in mainstream use, though for a while I only saw these terms in my job description or in vendor literature (usually next to the word "solution.") At my employer, we recently hired a "Data Mining" analyst. I don't know what he does exactly, but he wears a tie to work every day.
I'd try to draw the line as follows:
Information retrieval is about finding something that already is part of your data, as fast as possible.
Machine learning are techniques to generalize existing knowledge to new data, as accurate as possible.
Data mining is primarly about discovering something hidden in your data, that you did not know before, as "new" as possible.
They intersect and often use techniques of one another. DM and IR both use index structures to accelerate processes. DM uses a lot of ML techniques, for example a pattern in the data set that is useful for generalization might be a new knowledge.
They are often hard to separate. Do yourself a favor and don't just go for the buzzwords. In my opinion the best way of distinguishing them is by their intention, as given above: find data, generalize to new data, find new properties of existing data.
You can also add pattern recognition and (computational?) statistics as another couple of areas that overlap with the three you mentioned.
I'd say there is no well-defined line between them. What separates them is their history and their emphases. Statistics emphasizes mathematical rigor, data mining emphasizes scaling to large datasets, ML is somewhere in between.
Data mining is about discovering hidden patterns or unknown knowledge, which can be used
for decision making by people.
Machine learning is about learning a model to classify new objects.

Resources