Experiment design for machine learning algorithm in production

Experiment design for machine learning algorithm in production - machine-learning

I have a machine learning algorithm ready. I would like to put it into production in a country of 70 cities. But before rolling it out to 70 cities, I would like to do experimentation in 1 city to evaluate it's performance in production.
However, I'm now facing a question that what criteria should I set in terms if:
1. Time ( how many months I can keep it in production )
2. Data ( how much data I would need in live environment order to evaluate the algorithms performance)
Can anyone guide with this machine learning experimentation in production environment ?
Edit:
I'm applying machine learning for price optimization in US.

It depends on what is your output and how is affected by the time. In which area are you using ML? It is hard to directly say something.

Related

Is reinforcement learning applicable to a RANDOM environment?

I have a fundamental question on the applicability of reinforcement learning (RL) on a problem we are trying to solve.
We are trying to use RL for inventory management - where the demand is entirely random (it probably has a pattern in real life but for now let us assume that we have been forced to treat as purely random).
As I understand, RL can help learn how to play a game (say chess) or help a robot learn to walk. But all games have rules and so does the ‘cart-pole’ (of OpenAI Gym) – there are rules of ‘physics’ that govern when the cart-pole will tip and fall over.
For our problem there are no rules – the environment changes randomly (demand made for the product).
Is RL really applicable to such situations?
If it does - then what will improve the performance?
Further details:
- The only two stimuli available from the ‘environment’ are the currently available level of product 'X' and the current demand 'Y'
- And the ‘action’ is binary - do I order a quantity 'Q' to refill or do I not (discrete action space).
- We are using DQN and an Adam optimizer.
Our results are poor - I admit I have trained only for about 5,000 or 10,000 - should I let it train on for days because it is a random environment?
thank you
Rajesh

You are saying random in the sense of non-stationary, so, no, RL is not the best here.
Reinforcement learning assumes your environment is stationary. The underlying probability distribution of your environment (both transition and reward function) must be held constant throughout the learning process.
Sure, RL and DRL can deal with some slightly non-stationary problems, but it struggles at that. Markov Decision Processes (MDPs) and Partially-Observable MDPs assume stationarity. So value-based algorithms, which are specialized in exploiting MDP-like environments, such as SARSA, Q-learning, DQN, DDQN, Dueling DQN, etc., will have a hard time learning anything in non-stationary environments. The more you go towards policy-based algorithms, such as PPO, TRPO, or even better gradient-free, such as GA, CEM, etc., the better chance you have as these algorithms don't try to exploit this assumption. Also, playing with the learning rate would be essential to make sure the agent never stops learning.
Your best bet is to go towards black-box optimization methods such as Genetic Algorithms, etc.

Randomness can be handled by replacing single average reward output with a distribution with possible values. By introducing a new learning rule, reflecting the transition from Bellman’s (average) equation to its distributional counterpart, the Value distribution approach has been able to surpass the performance of all other comparable approaches.
https://www.deepmind.com/blog/going-beyond-average-for-reinforcement-learning

Need answer about some Machine Learning related questions?

Recently, we planned to build a system for image processing to extract info from images. At present we are using AWS Rekognition to do that. But, in some cases, we are not getting accurate information from AWS. So, we've planned to build our own custom one.
We've 4/5 months to do that. At least a POC version. Also, we've planned to use Tensorflow for that. We all have no prior experience about Machine Learning & Deep Learning but already have 5/6yrs of experience on Computer Programming by using different languages.
Currently, I'm studying ML from a course of Udemy & my approach to solve this problem is...
Learn Machine Learning(ML)
Learn Deep Learning(DL)
Above ML & DL maybe I'll be ready to understand the whole thing & can able to build a system for Image Processing.
In abstract what I've understood is, I've to write one Deep Learning program in Python by using Tensorflow. By using that Program I've to build a Model. Then I've to train that Model by using some training data. Then, when my Model achieves a certain level of accuracy I'll use some test data.
Now, there some places at where I've bit confused & here are my questions regarding that confusion...
I know tensorflow is a library but at some places, it's also mentioned as a system. So, is it really a library(piece of code) only & something more than that?
I got some Image Processing Python code in Tensorflow tutorial section (https://www.tensorflow.org/tutorials/image_recognition). We've tested that code & it's working exactly the way AWS Recognition service work. So, here my doubt is... can I use this Python code as it is in our production work?
After train a model with some training data does those training data get part of the whole system or Machine Learning Model extract some META info from those training data & keep with itself rather whole raw training data(in my case it'll be raw images).
Can I do all these ML+DL programmings over my Linux System? It has Pentium 4 with 8GB RAM.
Also, want to know... the approach which I've mentioned to build a solution for my problem is sufficient or I need to do something else also.
Need some guidance to clear out all these confusion.
Thanks

1 : tensor-flow is like anything else we have been worked with (like Numpy ) but only difference is we have to first defined what we want to use the use it , every thing in tensor-flow are running into a computational graph and evaluating every thing in that graph require a Session , we could call it library because it just piece of code and have interface in python , and system because of all those mechanism it uses
2 :
can I use this Python code as it is in our production work? Why not !
3:
yes you could do that with your system , but the main advantage of tensor-flow and theano , .. the tool like those is that you could run your code on GPU it a more faster way than on CPU because the GPU could handle a lot more matrix multiplication and stuff like that
4:
you know you don't have to learn all the machine learning stuff to built a image recognition system , it may be take years for you to understand whats going on there , Udemy course is very good source but you I highly recommend you to see the machine learning courses of coursera , there is to courses there about machine learning : the great Andrew NG course and Emily fox course , the first one is more theoretical than practical , but second on is more practical ,
and about the Deep learning , there is nothing fancy about Deep learning and it's just a method in machine learning , after you gain some experience in machine learning and understood some basic or you could do it right know , go to fast.ai , it has a really good course about deep learning for coder and it's also free
I hope this will help you

Why machine learning algorithms like xgboost cannot be used in the production environment?

I am a data scientist and I have seen at my workplace that all the major production solution at maximum involves random forest.
Why machine learning algorithms like xgboost cannot be used in the production environment? Why is there a need of reproducibility?

I can't speak for everyone but in most cases you want to have a reason for a decision. You need to be able to convince your clients/your boss that this is the right decision/prediction. If you use neural networks or other black box models you only have the resulting prediction and if you are lucky also a confidence estimate.
"White box" models or models which can be interpreted are better, because you can point to specific features of a sample and say that these are the reasons for the resulting prediction. Decision trees (but not too deep) or simple thresholding belong to this category.
If I understand the concept of xgboost correctly, you train your new trees to correct the mistakes of the previous ones. This means that the trees are not independent and therefore difficult to interpret.

I've seen xgboost being used in production lots of times, I've used it myself (in python and java workers) and I would recommend it if it gives better results over random forest for example (which usually happens).

How to deploy machine learning algorithm in production environment?

I'm new to machine learning algorithm. I'm learning basic algorithms like regression, classification, clustering, sequence modelling, on-line algorithms. All the article that are available on internet shows how to use these algorithm with specific data. There is no article regarding deployment of those algorithm in production environment. So my questions are
1) How to deploy machine learning algorithm in production environment?
2) The typical approach follows in machine learning tutorial is to build the model using some training data, use it for testing data. But is it advisable to use that kind of model in production environment? Incoming data may keep changing so the model will be ineffective. What should be duration for the model refresh cycle to accommodate such changes?

I am not sure if this is a good question (since it is too general and not formulated good), but I suggest you to read about bias - variance tradeoff. Long story short, you could have low bias\high variance machine-learning model and get 100% accurate results on your test data (the data you used to implement a model), but you could cause your model to overfit the training data. As result, when you will try to use it on data which you haven't used during training it will lead to poor performance. On the other hand, you may have high bias\low variance model, which will be poorly fit to your training data and will also perform just as bad on new production data. Keeping this in mind general guideline will be:
1) Obtain some good amount of data which you could use to build a prototype of machine-learning system
2) Split your data into train set, cross-validation set and test set
3) Create a model which will have relatively low bias (good accuracy, actually - good F1 score) on your test data. Then try this model on cross-validation set to see the results. If the results are bad - you have a high variance problem, you used a model which overfit the data and can't generalize well. Re-write your model, play with model parameters or use different algorithm. Repeat until you get a good result on CV set
4) Since we played with the model in order to get a good result on CV set, you want to test your final model on test set. If it is good - that's it, you have a final version of model and could use it on prod environment.
Second question has no answer, it is based on your data and your application. But 2 general approaches might be used:
1) Do everything I mentioned earlier to build a model with a good performance on test set. Re-train your model on new data once in some period (try different periods, but you could try to re-train your model once you see that performance of model dropped down).
2) Use online-learning approach. This is not applicable for many algorithms, but for some cases it could be used. Generally, if you see that you could use stochastic gradient descent learning method - you could use online-learning and just keep your model up-to-date with the newest production data.
Keep in mind that even if you use #2 (online-learning approach) you can't be sure that your model will be good forever. Sooner or later the data you get may change significantly and you may want to use whole different model (for example switch to ANN instead of SWM or logistic regression).

DISCLAIMER: I work for this company, Datmo building a better workflow for ML. We’re always looking to help fellow developers working on ML so feel free to reach out to me at anand#datmo.com if you have any questions.
1) In order to deploy, you should first split up your code into preprocessing, training and test. This way you can easily encapsulate the required components for deployment. Usually, you will then want to take your preprocessing, test, as well as your weights file (the output of your training process) and put them in one folder. Next, you will want to host this on a server and wrap an API server around this. I would suggest a Flask Restful API so that you can use query parameters as your inputs and output your response in standard JSON blobs.
To host it on a server, you can use this article which talks about how you can deploy a Flask API on EC2.
You can load and model and serve it as API as given in this code.
2) Hard for me to answer without more details. It's highly dependent on the type of data and the type of model. For example, for deep learning, there is no such thing as online learning.

I am sorry that my comments does not include too much detail* since I am also a newbie in "deployment" of ML. But since the author is also new in ML, I hope these basic guidance could be helpful as well.
For "deployment", you should
Have ML algorithms: You may use free-tools, or develop your own tool using libraries in Python, R, Java, .Net, .. or use a system on cloud..)
Train those ML models using training datasets
Save those trained models (You should search this topic based on your development environment. There are some file formats that Tensorflow/Keras provide, or formats like pickle, ONNX,.. I would like to write a whole list here, with their supporting language & environment, advantage&disadvantage and loadability but I am also trying to investigate this topic, as a newbie)
And THEN, you can deploy these saved-models on production. On production you should either have your own-developed application to run the saved model (For example: an application that you developed with Python that takes trained&saved .pickle file and TestData as input; and simply gives "prediction for the test data" as output) or you should have an environment/framework that runs the saved models (search for ML environments/frameworks on cloud). At first, you should clarify your need: Do you need a stand-alone program on production, or will you serve a internal web-service, or via-cloud, etc.
For the second question; as above answers indicate the issue is "online training ability" of the models. Please additionally note that; for "online learning", your production environment has to feed your production tool/system with the real-correct label of the test data as well. Will you have that capability?
Note: All above are just small "comments" instead of a clear answer, but technically I am not able to write comments yet. Thanks for not de-voting :)

Regarding the first question, my service mlrequest makes deploying models to production simple. You can get started with a free API key that provides 50k model transactions a month.
This code will train and deploy, or update your model across 5 global data centers.
from mlrequest import Classifier
classifier = Classifier('my-api-key')
features = {'feature1': 'val1', 'feature2': 45}
training_data = {'features': features, 'label': 2}
r = classifier.learn(training_data=training_data, model_name='my-model', class_count=2)
This is how you make predictions, latency-routed to the nearest data center to get the quickest response.
features = {'feature1': 'val1', 'feature2': 77}
r = classifier.predict(features=features, model_name='my-model', class_count=2)
r.predict_result
Regarding your second question, it completely depends on the problem you are solving. Some models need to be frequently updated, while others almost never need to be updated.

Map Reduce Algorithms on Terabytes of Data?

This question does not have a single "right" answer.
I'm interested in running Map Reduce algorithms, on a cluster, on Terabytes of data.
I want to learn more about the running time of said algorithms.
What books should I read?
I'm not interested in setting up Map Reduce clusters, or running standard algorithms. I want rigorous theoretical treatments or running time.
EDIT: The issue is not that map reduce changes running time. The issue is -- most algorithms do not distribute well to map reduce frameworks. I'm interested in algorithms that run on the map reduce framework.

Technically, there's no real different in the runtime analysis of MapReduce in comparison to "standard" algorithms - MapReduce is still an algorithm just like any other (or specifically, a class of algorithms that occur in multiple steps, with a certain interaction between those steps).
The runtime of a MapReduce job is still going to scale how normal algorithmic analysis would predict, when you factor in division of tasks across multiple machines and then find the maximum individual machine time required for each step.
That is, if you have a task which requires M map operations, and R reduce operations, running on N machines, and you expect that the average map operation will take m time and the average reduce operation r time, then you'll have an expected runtime of ceil(M/N)*m + ceil(R/N)*r time to complete all of the tasks in question.
Prediction of the values for M,R,m, and r are all something that can be accomplished with normal analysis of whatever algorithm you're plugging into MapReduce.

There are only two books that i know of that are published, but there are more in the works:
Pro hadoop and Hadoop: The Definitive Guide
Of these, Pro Hadoop is more of a beginners book, whilst The Definitive Guide is for those that know what Hadoop actually is.
I own The Definitive Guide and think its an excellent book. It provides good technical details on how the HDFS works, as well as covering a range of related topics such as MapReduce, Pig, Hive, HBase etc. It should also be noted that this book was written by Tom White who has been involved with the development of Hadoop for a good while, and now works at cloudera.
As far as the analysis of algorithms goes on Hadoop you could take a look at the TeraByte sort benchmarks. Yahoo have done a write up of how Hadoop performs for this particular benchmark: TeraByte Sort on Apache Hadoop. This paper was written in 2008.
More details about the 2009 results can be found here.

There is a great book about Data Mining algorithms applied to the MapReduce model.
It was written by two Stanford Professors and it if available for free:
http://infolab.stanford.edu/~ullman/mmds.html

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart