I am using Azure Machine Learning to build a model which will predict if a project will be approved (1) or not (0).
My dataset is composed of a list of projects. Each line represents a project and its details - starting day, theme, author, place, people involved, stage, date of last stage and approved.
There are 15 possible crescent stages a project can pass through before being approved. On the other hand, in some special cases, a project can be approved mid-way, that is, before getting to the last stage which is the most commom
I will be receiving daily updates on some projects, as well as, new projects that are coming in. I am trying to build a model which will predict the probability of it being approved or not based on my inputs (which will inclue stage).
I want to use stage as an input, but if I use it with a two-class boosted decision tree it will indirectly give the answer to my model.
I've read a little bit about HMM and tried to learn how to apply to my ML model but did not understand how to. Could anyone guide me to the right path, please? Should I really use HMM?
rather than stage, I would recommend to use duration in last stage, duration in last stage-1, duration in last stage -2
Related
I started working with Azure Machine Learning Service. It has a feature called Pipeline, which I'm currently trying to use. There are, however, are bunch of things that are completely unclear from the documentation and the examples and I'm struggling to fully grasp the concept.
When I look at 'batch scoring' examples, it is implemented as a Pipeline Step. This raises the question: does this mean that the 'predicting part' is part of the same pipeline as the 'training part', or should there be separate 2 separate pipelines for this? Making 1 pipeline that combines both steps seems odd to me, because you don't want to run your predicting part every time you change something to the training part (and vice versa).
What parts should be implemented as a Pipeline Step and what parts shouldn't? Should the creation of the Datastore and Dataset be implemented as a step? Should registering a model be implemented as a step?
What isn't shown anywhere is how to deal with model registry. I create the model in the training step and then write it to the output folder as a pickle file. Then what? How do I get the model in the next step? Should I pass it on as a PipelineData object? Should train.py itself be responsible for registering the trained model?
Anders has a great answer, but I'll expand on #1 a bit. In the batch scoring examples you've seen, the assumption is that there is already a trained model, which could be coming from another pipeline, or in the case of the notebook, it's a pre-trained model not built in a pipeline at all.
However, running both training and prediction in the same pipeline is a valid use-case. Use the allow_reuse param and set to True, which will cache the step output in the pipeline to prevent unnecessary reruns.
Take a model training step for example, and consider the following input to that step:
training script
input data
additional step params
If you set allow_reuse=True, and your training script, input data, and other step params are the same as the last time the pipeline ran, it will not rerun that step, it will use the cached output from the last time the pipeline ran. But let's say your data input changed, then the step would rerun.
In general, pipelines are pretty modular and you can build them how you see fit. You could maintain separate pipelines for training and scoring, or bundle everything in one pipeline but leverage the automatic caching.
Azure ML pipelines best practices are emergent, so I can give you some recommendations, but I'd be surprised if others respond with divergent deeply-held opinions. The Azure ML product group also is improving and expanding on the product at a phenomenal pace, so I fully expect things to change (for the better) over time. This article does a good job of explaining ML pipelines
3 Passing a model to a downstream step
How do I get the model in the next step?
During development, I recommend that you don't register your model and that the scoring step receives your model via a PipelineData as a pickled file.
In production, the scoring step should use a previously registered model.
Our team uses a PythonScriptStep that has a script argument that allows a model to be passed from an upstream step or fetched from the registry. The screenshot below shows our batch score step usings a PipelineData named best_run_data which contains the best model (saved as model.pkl) from a HyperDriveStep.
The definition of our batch_score_step has an boolean argument, '--use_model_registry', that determines whether to use the recently trained model, or whether to use the model registry. We use a function, get_model_path() to pivot on the script arg. Here are some code snippets of the above.
2 Control Plane vs Data Plane
What parts should be implemented as a Pipeline Step and what parts shouldn't?
All transformations you do to your data (munging, featurization, training, scoring) should take place inside of PipelineStep's. The inputs and outputs of which should be PipelineData's.
Azure ML artifacts should be:
- created in the pipeline control plane using PipelineData, and
- registered either:
- ad-hoc, as opposed to with every run, or
- when you need to pass artifacts between pipelines.
In this way PipelineData is the glue that connects pipeline steps directly rather than being indirectly connected w/ .register() and .download()
PipelineData's are ultimately just ephemeral directories that can also be used as placeholders before steps are run to create and register artifacts.
Dataset's are abstractions of PipelineDatas in that they make things easier to pass to AutoMLStep and HyperDriveStep, and DataDrift
1 Pipeline encapsulation
does this mean that the 'predicting part' is part of the same pipeline as the 'training part', or should there be separate 2 separate pipelines for this?
your pipeline architecture depends on if:
you need to predict live (else batch prediction is sufficient), and
your data is already transformed and ready for scoring.
If you need live scoring, you should deploy your model. If batch scoring, is fine. You could either have:
a training pipeline at the end of which you register a model that is then used in a scoring pipeline, or
do as we do and have one pipeline that can be configured to do either using script arguments.
First of all I found difficulties formulating my question, feedback is welcome.
I have to make a machine learning agent to play dots and boxes.
I'm just in the early stages but came up with the question: if I let my machine learning agent (with a specific implementation) play against a copy of itself to learn and improve it's gameplay, wouldn't it just make a strategy against that specific kind of gameplay?
Would it be more interesting if I let my agent play and learn against different forms of other agents in an arbitrary fashion?
The idea of having an agent learn by playing against a copy of itself is referred to as self-play. Yes, in self-play, you can sometimes see that agents will "overfit" against their "training partner", resulting in an unstable learning process. See this blogpost by OpenAI (in particular, the "Multiplayer" section), where exactly this issue is described.
The easiest way to address this that I've seen appearing in research so far is indeed to generate a more diverse set of training partners. This can, for example, be done by storing checkpoints of multiple past versions of your agent in memory / in files, and randomly picking one of them as training partner at the start of every episode. This is roughly what was done during the self-training process of the original AlphaGo Go program by DeepMind (the 2016 version), and is also described in another blogpost by OpenAI.
I am quite new to machine learning with small experience and I did some projects.
Now I have a project relates to insurance. So I have databases about clients that I will merge to get all possible information about the clients and I have one database for the claims. I need to build a model to identify how risky the client based on ranks.
My question: I need to build my target variable that ranks the clients based on how risky they are, counting on the claims. I could have different strategies to do that, but I am confused about how I will deal with the following:
- Shall I do a specific type of analysis before building the ranks such as clustering, or I need to have a strong theoretical assumption matching with the project provider vision.
- If I use some variables in the claims database to build up the ranks, how shall I deal with them later. In other words, shall I remove them from the final data set for training, to avoid correlation with target variable, or I can treat them in a different way and keep them.
- If I will keep them, is there a special treatment for them depending on whether they are categorical or continuous variables.
Every machine learning project's starting point is EDA. First create some feature, like how often do they get bad claims or how many do they get. Then do some EDA to find which features are more useful. Secondly, the problem looks like classification. Clustering is usually harder to evaluate.
In data sciences when you make a business model, EDA exploratory data analytics play a major role which includes data cleaning, feature engineering, filtering data. As mentioned how to build target variable, it all depends on the attributes you have and what model do you want to apply say linear regression or logistic or make a decision tree. You need to use those algorithms. But most importantly you need to find out the impacting variable. That's probably the core elation between the output and the given input and priority must be given accordingly. Also attributes which add no value must be removed as that would contribute to overfitting.
You can do clustering too. And interesting thing is any unspervisoned learning could be converted to a form of supervised learning. Probably you can try to do logistic regression or do linear regression etc... And find out which model fits best to your project.
I have two kind of profiles in database.one is candidate
prodile,another is job profile posted by recruiter.
in both the profiles i have 3 common field say location,skill and
experience
i know the algorithm but i am having problem in creating training data
set where my input feature will be location,skill and salary chosen
from candidate profile,but i am not getting how to choose output
(relevant job profile).
as far as i know output can only be a single variable, then how to
choose relevant job profile as a output in my training set
or should i choose some other method?another thought is clustering.
As I understand you want to predict job profile given candidate profile using some prediction algorithm.
Well, if you want to use regression you need to know some historical data -- which candidates were given which jobs, then you can create some model based on this historical data. If you don't have such training data you need some other algorithm. Say, you could set location,skill and experience as features in 3d and use clustering/nearest neighbors to find candidate profile closest to a job profile.
You could look at "recommender systems", they can be an answer to your problem.
Starting with a content based algorithm (you will have to find a way to automate the labels of the jobs, or manually do them), you can improve to an hybrid one by gathering which jobs your users were actually interested (and become an hybrid recommender)
After running the Machine Learner Algorithm (SVM) on training data using GATE tool, I would like to test it on testing data. My question is, should I use the same trained data to be tested, also, how could the model extract the entities from the test data while the test data not annotated with the annotations that have been learnt in the trained data.
I followed the tutorial on this link http://gate.ac.uk/sale/talks/gate-course-may11/track-3/module-11-machine-learning/module-11.pdf but at the end it was a bit confusing when it talks about splitting the dataset into training and testing.
In GATE you have 3 modes of the machine learning PR - for training, evaluation and application.
What happens when you train is that the ML PR is checking the selected annotation (let's say Token), collecting it's features and learning the target class (i.e. Person, Mention or whatever). Using the example docs, the ML PR creates a model which holds values for features and basically "learns" how to classify new Tokens (or sentences, or other).
When testing, you provide the ML PR only the Tokens with all their features. Then the ML PR uses them as input for its model and decides if or what Mention to create. The ML PR actually needs everything that was there in the training corpus, except the label / target class / mention - the decision that should be made.
I think the GATE ML PR ignores the labels when in test mode, so it's not crucial to remove it.
Evaluation is a helpful option, where training and testing are done automatically, the corpus is split and results are presented. What it does is split the corpus in 2, train on one part, apply the model on the other, compare the gold standard to what it labeled. Repeat with different splits.
The usual sequence is to train and evaluate, check results, fix, add features, etc. and when you're happy with the evaluation results, switch to application and run on data that doesn't have labels.
It is crucial that you run the same pre-processing when you're training and testing. For instance if in training you've run a POS tagger and you skip this when testing, the ML PR won't have the "Token.category" feature and will calculate very different results.
Now to your questions :)
NO! Don't use the same data for testing, that is a very common mistake, if you get suspiciously good results, first check if you're doing that.
In the tutorial, when you split the corpus both parts will have all the annotations as before, so the ML PR will have all the features it needed. In real life, you'll have to run some pre-processing first as docs will come without tokens or anything.
Splitting in their case is done very simple - just save all docs to files, split files in two folders, load them as two corpora.
Hope this helps :)