Distributed training of XGBoost with AI Platform (Serverless) Google Cloud [closed] - dask

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 days ago.
This post was edited and submitted for review 1 hour ago.
Improve this question
I want to train the XGBoost model with a very large dataset that would cause the OutOfMemory errors if trained on a single machine. I'm also interested in seeing how quickly the model training can be. For the sake of argument, lets assume training dataset is fixed so no posts about feature reduction pls.
To get this running as a Custom Job on Google Cloud, I figured I have to parse the TF_CONFIG environment variable which is set behind the scenes on all the nodes specified by an input argument when creating the job.
An example in the docker image used in this page. The image looks like it parses the TF_CONFIG variable and manually sets up the distributed training using the xgboost.rabit library from xgboost.
The other way I'm thinking to do this is to use libraries like DASK or Ray to run the distributed training but I don't have any experience with those.
Can someone provide examples of on or the other?

Ray provides a distributed xgboost trainer that can run on common cloud providers (including Google Cloud). See example here: https://docs.ray.io/en/latest/ray-air/examples/xgboost_example.html

Related

ML or rule based [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I already have 85 accuracy on my sklearn text classifier. What are the advantages and disadvantages of making a rule based system? Can save doing double the work? Maybe you can provide me with sources and evidence for each side, so that I can make the decision baed on my cirucumstances. Again, I want to know when ruls-based approach is favorable versus when a ML based approach is favorable? Thanks!
Here is an idea:
Instead of going one way or another, you can set up a hybrid model. Look at typical errors your machine learning classifier makes, and see if you can come up with a set of rules that capture those errors. Then run these rules on your input, and if they applied, finish there; if not, pass the input on to the classifier.
In the past I did this with a probabilistic part-of-speech tagger. It's difficult to tune a probabilistic model, but it's easy to add a few pre- or post-processing rules to capture some consistent errors.
https://www.linkedin.com/feed/update/urn:li:activity:6674229787218776064?commentUrn=urn%3Ali%3Acomment%3A%28activity%3A6674229787218776064%2C6674239716663156736%29
Yoel Krupnik (CTO & co-founder | smrt - AI For Accounting) writes:
I think it really depends on the specific problem. Some problems can be completely solved with rule based logic, some require machine learning (often in combination with rule based logic before or after).
Advantages of the rule based are that it doesn't require labeled training data, might quickly provide decent results used as a benchmark and helps you better understand the problem for future labeling / text manipulations required by the ML algorithm.

What are the good practices to building your own custom facial recognition? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I am working on building a custom facial recognition for our office.
I am planning to use Google FaceNet,
Now my question is that you can find or create your own version of facenet model in keras or pytorch there's no issue in that, but regarding creating dataset ,I want to know what are the best practices to capture photo of person when I don't have any prior photo of that person,all I have is a camera and a person ,should I create variance in by changing lightning condition or orientation or face size ?
A properly trained FaceNet model should already be somewhat invariant to lighting conditions, pose and other features that should not be a part of identifying a face. At least that is what is claimed in a draft of the FaceNet paper. If you only intend to compare feature vectors generated from the network, and intend to recognize a small group of people, your own dataset likely does not have to be particulary large.
Personally I have done something quite similar to what you are trying to achieve for a group of around ~100 people. The dataset consisted of 1 image per person and I used a 1-N-N classifier to classify the generated feature vectors. While I do not remember the exact results, it did work quite well. The pretrained network's architecture was different from FaceNet's but the overall idea was the same though.
The only way to truly answer your question though would be to experiment and see how well things work out in practice.

How neural networks/ML could help for micro service? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I am wondering whether neural networks could help in monitoring the user request for micro services and also for monolithic service which will improve the performance of the productivity. I need a detailed advice about my query.
I have got this to know when reading this article. I am also interested in any other ideas that ML could help micro services or in monitoring server.
It depends ... on what you want to achieve. ML/"AI" is typically used to predict a specific outcome based on existing data. So, if there is historical data which indicate that the state if the system is {relaxed|critical}, you might get an idea when to act, before "critical" is reached. But then again, it appears to be be an overkill of you can simply just monitor your resources and define a threshold, when more resources need to be applied (cloud service provider scale on demand).
If you are thinking about anomaly detection, here is where ML/"AI" might help. But: you need to have relevant data to actually train a useful net.
My tip: check for service providers like datadog and check what they have in store for you. Training, evaluating and putting a neural net is not a trivial task.

How to implement feature extraction in Julia [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am trying to make a binary classifier using machine learning and I am trying to develop other features for my data using correlated features (numerical attributes) I have. I searched much but could not get a block of code that will work with me.
What should i do?
I've searched in dimenshionality reduction and found library (Multivariate Statistics) but actually i did not understand and i felt lost :D
No one will make a choice for you what exact method to choose. They are many, many different ways of doing a binary classification and to do feature extraction. If you feel overwhelmed by all these names that libraries such as Multivariate Statistics offer, then take a look at a textbook on statistics and machine learning, understanding the methods is independent from the programming language.
Start with some simple methods such as principal compenent analysis (PCA), (MultivariateStats.jl provides that), then test others as you gain more knowledge on your data and the methods.
Some Julia libraries to take a look at: JuliaStats (https://github.com/JuliaStats) with its parts
StatsBase for the most basic stuff
MultivariateStats for methods like PCA
StatsModels (and DataFrames) for statistical models
many more ....
For Neural Networks there are Flux.jl and KNet.jl
For Clustering there is Clustering.jl
Then, there are also bindings to the python libraries Tensorflow (Neural Networks & more) and Scikit-Learn (all kinds of ML algorithms)
There are many more projects, but these are some that I think are important.

Google vision api vs build your own [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
I have quite a challenging use case for image recognition. I want to detect composition of mixed recycling e.g. Crushed cans,paper,bottles and detect any anomalies such as glass, bags, shoes etc.
Trying images with the google vision api the results are mainly "trash", "recycling" "plastic" etc likely because the api hasn't been trained on mixed and broken material like this?.
For something like this would I have to go for something like tensor flow and build a neural network from my own images? I guess I wouldn't need to use google for this as tensor flow is open source?
Thanks.
So generally, when ever you apply machine learning to a new, real world use case, it is a good idea to get your hands on a representative dataset, in your case it would be images of these trash materials.
Then you can pick an appropriate detection model (VGG, Inception, ResNet), modify the final classification layer to output as many category labels as you require (maybe 'normal' or 'anomaly' in your case, so 2 classes).
Then you load the pre-trained weights for this network, because the learned features generalize (google 'Transfer Learning'), initialize your modified classification layer randomly, and then train the last layer, maybe train the last two layers, or last three layers (depending on what works best, how much data you have, generalization) etc.
So, in short:
1. Pick a pretrained model.
2. Modify it for your problem.
3. Finetune the weights on your own dataset.

Resources