How to deploy machine learning algorithm in production environment? - machine-learning

I'm new to machine learning algorithm. I'm learning basic algorithms like regression, classification, clustering, sequence modelling, on-line algorithms. All the article that are available on internet shows how to use these algorithm with specific data. There is no article regarding deployment of those algorithm in production environment. So my questions are
1) How to deploy machine learning algorithm in production environment?
2) The typical approach follows in machine learning tutorial is to build the model using some training data, use it for testing data. But is it advisable to use that kind of model in production environment? Incoming data may keep changing so the model will be ineffective. What should be duration for the model refresh cycle to accommodate such changes?

I am not sure if this is a good question (since it is too general and not formulated good), but I suggest you to read about bias - variance tradeoff. Long story short, you could have low bias\high variance machine-learning model and get 100% accurate results on your test data (the data you used to implement a model), but you could cause your model to overfit the training data. As result, when you will try to use it on data which you haven't used during training it will lead to poor performance. On the other hand, you may have high bias\low variance model, which will be poorly fit to your training data and will also perform just as bad on new production data. Keeping this in mind general guideline will be:
1) Obtain some good amount of data which you could use to build a prototype of machine-learning system
2) Split your data into train set, cross-validation set and test set
3) Create a model which will have relatively low bias (good accuracy, actually - good F1 score) on your test data. Then try this model on cross-validation set to see the results. If the results are bad - you have a high variance problem, you used a model which overfit the data and can't generalize well. Re-write your model, play with model parameters or use different algorithm. Repeat until you get a good result on CV set
4) Since we played with the model in order to get a good result on CV set, you want to test your final model on test set. If it is good - that's it, you have a final version of model and could use it on prod environment.
Second question has no answer, it is based on your data and your application. But 2 general approaches might be used:
1) Do everything I mentioned earlier to build a model with a good performance on test set. Re-train your model on new data once in some period (try different periods, but you could try to re-train your model once you see that performance of model dropped down).
2) Use online-learning approach. This is not applicable for many algorithms, but for some cases it could be used. Generally, if you see that you could use stochastic gradient descent learning method - you could use online-learning and just keep your model up-to-date with the newest production data.
Keep in mind that even if you use #2 (online-learning approach) you can't be sure that your model will be good forever. Sooner or later the data you get may change significantly and you may want to use whole different model (for example switch to ANN instead of SWM or logistic regression).

1) In order to deploy, you should first split up your code into preprocessing, training and test. This way you can easily encapsulate the required components for deployment. Usually, you will then want to take your preprocessing, test, as well as your weights file (the output of your training process) and put them in one folder. Next, you will want to host this on a server and wrap an API server around this. I would suggest a Flask Restful API so that you can use query parameters as your inputs and output your response in standard JSON blobs.
To host it on a server, you can use this article which talks about how you can deploy a Flask API on EC2.
You can load and model and serve it as API as given in this code.
2) Hard for me to answer without more details. It's highly dependent on the type of data and the type of model. For example, for deep learning, there is no such thing as online learning.

For "deployment", you should
Have ML algorithms: You may use free-tools, or develop your own tool using libraries in Python, R, Java, .Net, .. or use a system on cloud..)
Train those ML models using training datasets
Save those trained models (You should search this topic based on your development environment. There are some file formats that Tensorflow/Keras provide, or formats like pickle, ONNX,.. I would like to write a whole list here, with their supporting language & environment, advantage&disadvantage and loadability but I am also trying to investigate this topic, as a newbie)
And THEN, you can deploy these saved-models on production. On production you should either have your own-developed application to run the saved model (For example: an application that you developed with Python that takes trained&saved .pickle file and TestData as input; and simply gives "prediction for the test data" as output) or you should have an environment/framework that runs the saved models (search for ML environments/frameworks on cloud). At first, you should clarify your need: Do you need a stand-alone program on production, or will you serve a internal web-service, or via-cloud, etc.
For the second question; as above answers indicate the issue is "online training ability" of the models. Please additionally note that; for "online learning", your production environment has to feed your production tool/system with the real-correct label of the test data as well. Will you have that capability?
Thoughts about train_test_split for machine learning

I just noticed that many people tend to use train_test_split even before handling the missing data, and seem like they split the data at the very beginning
and there are also a bunch of people, they tend to slipt the data right before model building step after they do all the data cleaning and feature engineering, feature selection stuff.
The people tend to split the data at the very first saying that it is to prevent the data leakage.
I am right now just so confused about the pipeline of building a model.
why do we need to slipt the data at the very beginning? and to clean the train set and test set separately when we can actually do all the data cleaning and feature engineering or things like transforming the categorical variable to dummy variable together for convenience purpose?
You should split the data as early as possible.
To put it simply, your data engineering pipeline builds models too.
Consider the simple idea of filling in missing values. To do this you need to "train" a mini-model to generate the mean or mode or some other average to use. Then you use this model to "predict" missing values.
If you include the test data in the training process for these mini-models, then you are letting the training process peek at that data and cheat a little bit because of that. When it fills in the missing data, with values built using the test data, it is leaving little hints about what the test set is like. This is what "data leakage" means in practice. In an ideal world you could ignore it, and instead just use all data for training use the training score to decide which model is best.
But that won't work, because in practice a model is only useful once it is able to predict any new data, and not just the data available at training time. Google Translate needs to work on whatever you and I type in today, not just what it was trained with earlier.
So, in order to ensure that the model will continue to work well when that happens, you should test it on some new data in a more controlled way. Using a test set, which has been split out as early as possible and then hidden away, is the standard way to do that.
Yes, it means some inconvenience to split the data engineering up for training vs testing. But many tools like scikit, which splits the fit and transform stages, make it convenient to build an end-to-end data engineering and modeling pipeline with the right train/test separation.

Transfer Learning for small datasets of structured data

I am looking to implement machine learning for a problems that are built on small data sets related to approvals of expenses in a specific supply chain domain. Typically labelled data is unavailable
I was looking to build models in one data set that I have labelled data and then use that model developed in similar contexts- where the feature set is very similar, but not identical. The expectation is that this allows the starting point for recommendations and gather labelled data in the new context.
I understand this is the essence of Transfer Learning. Most of the examples I read in this domain speak of image data sets- any guidance how this can be leveraged in small data sets using standard tree-based classification algorithms
I can’t really speak to tree-based algos, I don’t know how to do transfer learning with them. But, for deep learning models, the customary method for transfer learning is to load up a pretrained model, then retrain the last layer of the dataset using your new data, and then fine-tune the rest of the network.
If you don’t have much data to go on, you might look into creating synthetic data.
raghu, I believe you are looking for a kernel method when you are saying abstraction layer in deep learning. There are several ML algorithms that support kernel functions. With kernel functions, you might be able to do it; but using kernel functions might be more complex than solving your original problem. I would lean toward Tdoggo's suggestion of using Decision Tree.
Ok with tree-based algos you can do just what you said: train the tree on one dataset and apply it to another similar dataset. All you would need to do is change the terms/nodes on the second tree.
For instance, let’s say you have a decision tree trained for filtering expenses for a construction company. You will outright deny any reimbursements for workboots, because workers should provide those themselves.
You want to use the trained tree on your accounting firm, and so instead of workboots, you change that term to laptops, because accountants should be buying their own.
Does that make sense, and is that helpful to you?
After some research, we have decided to proceed with random forest models with the intuition that trees in the original model that have common features will form the starting point for decisions.
As we gain more labelled data in the new context, we will start replacing the original trees with new trees that comprise of (a)only new features and (b) combination of old and new features
This has worked to provide reasonable results in initial trials

When true positives are rare

Suppose you're trying to use machine learning for a classification task like, let's say, looking at photographs of animals and distinguishing horses from zebras. This task would seem to be within the state of the art.
But if you take a bunch of labelled photographs and throw them at something like a neural network or support vector machine, what happens in practice is that zebras are so much rarer than horses that the system just ends up learning to say 'always a horse' because this is actually the way to minimize its error.
Minimal error that may be but it's also not a very useful result. What is the recommended way to tell the system 'I want the best guess at which photographs are zebras, even if this does create some false positives'? There doesn't seem to be a lot of discussion of this problem.
One of the things I usually do with imbalanced classes (or skewed data sets) is simply generate more data. I think this is the best approach. You could go out in the real world and gather more data of the imbalanced class (e.g. find more pictures of zebras). You could also generate more data by simply making copies or duplicating it with transformations (e.g. flip horizontally).
You could also pick a classifier that uses an alternate evaluation (performance) metric over the one usually used - accuracy. Look at precision/recall/F1 score.
Week 6 of Andrew Ng's ML course talks about this topic: link
Here is another good web page I found on handling imbalanced classes: link
With this type of unbalanced data problem, it is a good approach to learn patterns associated with each class as opposed to simply comparing classes - this can be done via unsupervised learning learning first (such as with autoencoders). A good article with this available at Another suggestion - after running the classifier, the confusion matrix can be used to determine where additional data should be pursued (I.e. many zebra errors)

Supervised Machine Learning for .Net

I have a problem whereby our users receive the balance of an account each day, and based on the balance, perform an action.
Given the list of historical balances and resulting actions, is it possible to use machine learning to predict the future actions? Preferably in the .net platform.
I've never used .NET for any data analytics, but I'm sure it won't be too terribly difficult to transpose what I say here into logic in .NET
One of the things people don't like about data sciences is that in order to see if something IS actually possible (predicting future outcomes in this case), you need to do a lot of exploring with the data and see if the data has enough of a pattern to be learned (by either human or by a ML algorithm).
The way to do this would be to shuffle and split the data in some way...let's say into one group with 70 percent of the data and a second with 30 percent of the data.
Once you do this, you want to train some algorithm with the first group (training set) and use the second group(test set) to verify the accuracy of your algorithm.
So how do you chose an algorithm? That's the trickiest part. Only you can say which is best for your particular scenario given full access to the data. However, given that your output seems to be very discrete (let's say max 5 actions), that makes this a supervised learning classification problem. I'd do some analysis using one of these algorithms (SVM, kNN, and DecisionsTrees are a few popular ones), and use some error LIKE F1 or R^2 to determine how well your fitted algorithm performs on your test set.
To perform supervised Machine Learning in .NET, the ML.NET Framework has been announced, and a preview is now available (as of 7th May 2018).
A good starting place for ML.NET is here.

machine learning in GATE tool

After running the Machine Learner Algorithm (SVM) on training data using GATE tool, I would like to test it on testing data. My question is, should I use the same trained data to be tested, also, how could the model extract the entities from the test data while the test data not annotated with the annotations that have been learnt in the trained data.
I followed the tutorial on this link but at the end it was a bit confusing when it talks about splitting the dataset into training and testing.
In GATE you have 3 modes of the machine learning PR - for training, evaluation and application.
What happens when you train is that the ML PR is checking the selected annotation (let's say Token), collecting it's features and learning the target class (i.e. Person, Mention or whatever). Using the example docs, the ML PR creates a model which holds values for features and basically "learns" how to classify new Tokens (or sentences, or other).
When testing, you provide the ML PR only the Tokens with all their features. Then the ML PR uses them as input for its model and decides if or what Mention to create. The ML PR actually needs everything that was there in the training corpus, except the label / target class / mention - the decision that should be made.
I think the GATE ML PR ignores the labels when in test mode, so it's not crucial to remove it.
Evaluation is a helpful option, where training and testing are done automatically, the corpus is split and results are presented. What it does is split the corpus in 2, train on one part, apply the model on the other, compare the gold standard to what it labeled. Repeat with different splits.
The usual sequence is to train and evaluate, check results, fix, add features, etc. and when you're happy with the evaluation results, switch to application and run on data that doesn't have labels.
It is crucial that you run the same pre-processing when you're training and testing. For instance if in training you've run a POS tagger and you skip this when testing, the ML PR won't have the "Token.category" feature and will calculate very different results.
Now to your questions :)
NO! Don't use the same data for testing, that is a very common mistake, if you get suspiciously good results, first check if you're doing that.
In the tutorial, when you split the corpus both parts will have all the annotations as before, so the ML PR will have all the features it needed. In real life, you'll have to run some pre-processing first as docs will come without tokens or anything.
Splitting in their case is done very simple - just save all docs to files, split files in two folders, load them as two corpora.
Hope this helps :)
