Feature weightage from Azure Machine Learning Deployed Web Service - machine-learning

I am trying to predict from my past data which has around 20 attribute columns and a label. Out of those 20, only 4 are significant for prediction. But i also want to know that if a row falls into one of the classified categories, what other important correlated columns apart from those 4 and what are their weight. I want to get that result from my deployed web service on Azure.

You can use permutation feature importance module but that will give importance of the features across the sample set. Retrieving the weights on per call basis is not available in Azure ML.

Related

Cascading two Machine Learning models

I built a machine learning (ML) model to classify real-time network traffic as an attack or normal traffic using a dataset consisting of approximately 3 million records.
Then, I built a second ML model to classify the real-time network traffic according to their application, i.e., Google, Facebook, YouTube, etc. using another dataset consisting of approximately 1.5 million records.
Now I want to cascade these two models so that if the traffic is normal, then the traffic should be classified by the second ML model. Otherwise, it should be discarded since there is no need to pass through the second model.
Can I cascade these two models even though they are built using different datasets? And if so, how can I do that?
I do the cascading logic simply in a programming language code C++ or Python, not using ML-tool features. If the data from the second model, doesn't contribute to the decision of the first model - just keep the models separated.

How to build federated learning model of unbalanced and small dataset

I am working to build a federated learning model using TFF and I have some questions:
I am preparing the dataset, I have separate files of data, with same features and different samples. I would consider each of these files as a single client. How can I maintain this in TFF?
The data is not balanced, meaning, the size of data varies in each file. Is this affecting the modeling process?
The size of the data is a bit small, one file (client) is having 300 records and another is 1500 records, is it suitable to build a federated learning model?
Thanks in advance
You can create a ClientData for your dataset, see Working with tff's ClientData.
The dataset doesn't have to balanced to build a federated learning model. In https://arxiv.org/abs/1602.05629, the server takes weighted federated averaging of client's model updates, where the weights are the number of samples each client has.
A few hundred records per client is no less than the EMNIST dataset, so that would be fine. About the total number of clients: this tutorial shows FL with 10 clients, you can run the colab with smaller NUM_CLIENTS to see how it works on the example dataset.

Classification of industry based on tags

I have a dataset (1M entries) on companies where all companies are tagged based on what they do.
For example, Amazon might be tagged with "Retail;E-Commerce;SaaS;Cloud Computing" and Google would have tags like "Search Engine;Advertising;Cloud Computing".
So now I want to analyze a cluster of companies, e.g. all online marketplaces like Amazon, eBay, etsy, and the like. But there is no single tag that I can look for, but I have to use a set of tags to quantify the likelihood for a company to be a marketplace.
For example tags like "Retail", "Shopping", "E-Commerce" are good tags, but then there might be some small consulting agencies or software development firms that consult / build software for online marketplaces and have tags like "consulting;retail;e-commerce" or "software development;e-commerce;e-commerce tools", which I want to exclude as they are not online marketplaces.
I'm wondering what is the best way of identifying all online market places from my dataset. What machine learning algorithm, is suited to select the maximum amount of companies which are in the industry I'm looking for while excluding the ones that are obviously not part of it.
I thought about supervised learning, but I'm not sure because of a few issues:
Labelling needed, which means I would have to go through thousands of companies and flag them on multiple industries (marketplace, finance, fashion, ...) as I'm interested in 20-30 industries overall
There are more than 1,000 tags associated with the companies. How would I define my features? 1 feature per tag would lead to a massive dimensionality.
Are there any best practices for such cases?
UPDATE:
It should be possible to assign companies to multiple clusters, e.g. Amazon should be identified as "Marketplace", but also as "Cloud Computing" or "Online Streaming".
I used tf-idf and kmeans to identify tags that form clusters, but I don't know how to assign likelihoods / scores to companies that indicate how good the company fits into the cluster based on its tags.
UPDATE:
While tf-idf in combination with kmeans delivered pretty neat clusters (meaning the companies within a cluster were actually similiar), I also tried to calculate probabilities of belonging to a cluster with Gaussian Mixture Models (GMMs), which led to completely messed up results where companies within a cluster were more or less random or came from a handful different industries.
No idea why this happened though...
UPDATE:
Found the error. I applied a PCA before the GMM to reduce dimensionality, however, this apparently led to the random results. Removing the PCA improved the results significantly.
However, the resulting posterior probabilities of my GMM are 0. or 1. exactly 99.9% of the time. Is there a parameter (I'm using a sklearn BayesianGMM) that needs to be adjusted to get more valuable probablilities that are a little bit more centered? Because right now everything < 1.0 is not part of a cluster anymore, but there's also few outliers that get a posterior of 1.0 and are thus assigned to an industry. For example, a company with "Baby;Consumer" gets assigned to the "Consumer Electronics" cluster, even though only 1 out of 2 tags may be suggesting this. So I'd like this to get a probability of < 1. such that I can define a threshold based on some cross-validation.

Supervised Machine Learning for .Net

I have a problem whereby our users receive the balance of an account each day, and based on the balance, perform an action.
Given the list of historical balances and resulting actions, is it possible to use machine learning to predict the future actions? Preferably in the .net platform.
Thanks.
Ark
I've never used .NET for any data analytics, but I'm sure it won't be too terribly difficult to transpose what I say here into logic in .NET
One of the things people don't like about data sciences is that in order to see if something IS actually possible (predicting future outcomes in this case), you need to do a lot of exploring with the data and see if the data has enough of a pattern to be learned (by either human or by a ML algorithm).
The way to do this would be to shuffle and split the data in some way...let's say into one group with 70 percent of the data and a second with 30 percent of the data.
Once you do this, you want to train some algorithm with the first group (training set) and use the second group(test set) to verify the accuracy of your algorithm.
So how do you chose an algorithm? That's the trickiest part. Only you can say which is best for your particular scenario given full access to the data. However, given that your output seems to be very discrete (let's say max 5 actions), that makes this a supervised learning classification problem. I'd do some analysis using one of these algorithms (SVM, kNN, and DecisionsTrees are a few popular ones), and use some error LIKE F1 or R^2 to determine how well your fitted algorithm performs on your test set.
To perform supervised Machine Learning in .NET, the ML.NET Framework has been announced, and a preview is now available (as of 7th May 2018).
A good starting place for ML.NET is here.

Publish azure machine learning service with feature hashing

I have created an experiment in azure machine learning studio, this experiment is multi-class classification problem using multi-class neural network algorithm, I have also add 'feature hashing' module to transform a stream of English text into a set of features represented as integers. I have successfully run the experiment but when i publish it as web service endpoint i got message "Reduce the total number of input and output columns to less than 1000 and try publishing again."
I understood after some research that feature hashing convert text into thousands of feature but the problem is how i publish it as web service? and i don't want to remove 'feature hashing' module.
It sounds like you are trying to output all those thousands of columns as an output. What you really only need is the scored probability or the scored label. To solve this, just drop all the feature hashed columns from the score model module. To do this add in a project columns module, and tell it to start with "no columns" then "include" by "column names", and just add predicted column (scored probability/scored label).
Then hook up the output of that project columns module to your web service output module. Your web service should now be returning only 1-3 columns rather than thousands.

Resources