DataFlow itself has ETL,computation and streaming process why do we need to go for google's Dataproc?
Google Dataflow is a fully managed and self-optimizing cloud service that lets you use the Apache Beam programming model to write your batch and streaming data processing pipelines. It's integrated with many open source and Google Cloud data sources and sinks.
Google Dataproc is a fully managed cloud service for running Apache Hadoop and Apache Spark clusters in a simple cost-effective way. If you have existing data processing pipelines that use Spark, Hive, or Pig this is an quick and easy way to move your pipelines. You can install custom packages, start/stop and scale these clusters at any time. On top Google Dataproc is integrated with many of Google Clouds data services.
Related
I just started looking at Temporal and it looks like a great way to orchestrate microservices. I have knative & cloudrun based microservices in my project and I would like to adapted Temporal to orchestrate the workflow between my services.
From a quick look through docs I couldn't figure out if Temporal can manage serverless microservices (knative/cloudrun). Have you used Temporal and do you have serverless workloads in your project? If so can you share your experience?
Thanks
It looks like all temporal code runs inside a (persistent) temporal server. That probably makes it a poor fit for an environment like Cloud Run or Knative (or AWS Lambda containers).
Looking further through the doc, it also appears that multiple temporal servers end up individually addressing each other through their own clustering protocol.
From the video at the start, it does seem like you could use an Activity to encapsulate a call to a service running on Knative or Cloud Run.
I'm tasked with defining AWS tools for ML development at a medium-sized company. Assume about a dozen ML engineers plus other DevOps staff familiar with serverless ( lambdas and the framework ). The main questions are: a) what is an architecture that allows for the main tasks related to ML development (creating, training, fitting models, data pre-processing, hyper parameter optimization, job management, wrapping serverless services, gathering model metrics, etc ), b) what are the main tools that can be used for packaging and deploying things and c) what are the development tools (IDEs, SDKs, 'frameworks' ) used for it?
I just want to set Jupyter notebooks aside for a second. Jupyter notebooks are great for proof-of-concepts and the closest thing to PowerPoint for management... But I have a problem with notebooks when thinking about deployable units of code.
My intuition points to a preliminary target architecture with 5 parts:
1 - A 'core' with ML models supporting basic model operations (create blank, create pre-trained, train, test/fit, etc). I foresee core Python scripts here - no problem.
2- (optional) A 'containerized-set-of-things' that performs hyper parameter optimization and/or model versioning
3- A 'contained-unit-of-Python-scripts-around-models' that exposes an API and that does job management and incorporates data pre-processing. This also reads and writes to S3 buckets.
4- A 'serverless layer' with high level API ( in Python ). It talks to #3 and/or #1 above.
5- Some container or bundling thing that will unpack files from Git and deploy them onto various AWS services creating things from the previous 3 points.
As you can see, my terms are rather fuzzy:) If someone can be specific with terms that will be helpful.
My intuition and my preliminary readings say that the answer will likely include a local IDE like PyCharm or Anaconda or a cloud-based IDE (what can these be? - don't mention notebooks please).
The point that I'm not really clear about is #5. Candidates include Amazon SageMaker Components for Kubeflow Pipelines and/or Amazon SageMaker Components for Kubeflow Pipelines and/or AWS Step Functions DS SDK For SageMaker. It's unclear to me how they can perform #5, however. Kubeflow looks very interesting but does it have enough adoption or will it die in 2 years? Are Amazon SageMaker Components for Kubeflow Pipelines, Amazon SageMaker Components for Kubeflow Pipelines and AWS Step Functions DS SDK For SageMaker mutually exclusive? How can each of them help with 'containerizing things' and with basic provisioning and deployment tasks?
Its a long question although and these things totally make sense when you think to design ML infrastructure for production. So there are three levels that defines the maturity of your machine learning process.
1- CI/CDeployment: in this docker image will go through stages like build, test and push the versioned training image to the registry. You can also perform training in these and can store versioned model using git references.
2- Continuous Training: Here we deal with the ML Pipeline. Automation of the process using new data to retrain models. It becomes very useful when you have to run whole ML pipeline with new data or new implementation.
Tools for implementation: Kubeflow pipelines, Sagemaker, Nuclio
3- Continuous delivery: Where?? On cloud or on Edge? On cloud then you can use KF serving or use sage maker with kubeflow pipelines and deploy the model with sagemaker through Kubeflow.
Sagemaker and Kubeflow somehow give same functionality but each of them have their unique power. Kubeflow has power of kubernetes, pipelines, portability, cache and artifacts meanwhile Sagemaker have power of Manged infrastructure and scale from 0 capability and AWS ML services like Athena or Groundtruth.
Solution:
Kubeflow pipelines standalone + AWS Sagemaker(Training+Serving Model) + Lambda to trigger pipelines from S3 or Kinesis.
Infra required.
-Kubernettess cluster (Atleast 1 m5)
-MinIo or S3
-Container registry
-Sagemaker credentials
-MySQL or RDS
-Loadbalancer
-Ingress for using kubeflow SDK
Again you asked me my year journey in one question. If you are intrested lets connect :)
Permissions:
Kube --> registry (Read)
Kube --> S3 (Read, Write)
Kube --> RDS (Read, Write)
Lambda --> S3 (Read)
Lambda --> Kube (API Access)
Sagemaker --> S3, Registery
A good starting guide
https://github.com/kubeflow/pipelines/tree/master/manifests/kustomize/env/aws
https://aws.amazon.com/blogs/machine-learning/introducing-amazon-sagemaker-components-for-kubeflow-pipelines/
https://github.com/shashankprasanna/kubeflow-pipelines-sagemaker-examples/blob/master/kfp-sagemaker-custom-container.ipynb
I've recently started understanding the Spring Cloud Data Flow, also called as SCDF. I've just started looking at https://codenotfound.com/spring-batch-admin-example.html which seems very nice example, also would need more examples to really understand the use of Spring Cloud Data Flow with Spring Batch, as I've good experience with Spring Batch.
What's the difference between spring-cloud-starter-dataflow-server (Data Flow Server Starter) and spring-cloud-starter-dataflow-server-local (Local Data Flow Server Starter) ?
We used to ship spring-cloud-starter-dataflow-server-local as a standalone uber-jar for local deployments a few years ago. Similarly, we used to have spring-cloud-starter-dataflow-server-kubernetes, spring-cloud-starter-dataflow-server-cloudfoundry, and others.
However, we have consolidated all the supported platform implementations of SCDF into a single uber-jar, and that is spring-cloud-starter-dataflow-server. Please only use this artifact for any development/deployment, even if it is only used locally.
As for feature capabilities, we have a dedicated page that lists them. Once you dig into the relevant sections ranging from developer guides [example: batch developer guide] to recipes, hopefully, you will have an idea.
And, likewise, you might find the architecture and concepts useful for your research, which will cover the broad set of capabilities that SCDF supports including first-class orchestration experience for Spring Batch workloads.
Is Server-less a subset or attribute of Cloud Native? Or is it another way round -- Is Cloud Native a subset or attribute of Server-less?
Nathan Aw (Singapore)
Cloud native is a more general approach to building and running applications that take advantage of cloud computing. Serverless is more of an execution model in the cloud.
A Cloud native stack will usually aim to make use of containers and microservices:
Each part of the stack is packaged in its own container. This promotes reproducibility, transparency, and resource isolation. Dynamically orchestrated containers are then actively scheduled and managed to optimize resource utilization.
Applications are also segmented or broken-down into microservices, which are more easily testable and maintainable, are loosely-coupled, and independently deployable.
Serverless describes a model of providing backend services on an as-used basis.The cloud provider (AWS Lambda/Google Cloud Functions/Azure Functions) is responsible for executing a piece of code by dynamically allocating the resources.
Many of today's apps apply elements of both.
I built a pipeline which takes an image and returns a number of persons. I want to make an API which takes an image and returns a JSON file with count using Kubeflow.
There are a few ways that you can deploy a model for inference from a pipeline:
You can use Kubeflow components like KFServing and the KFServing Deployer component for Kubeflow Pipelines
If you are using a cloud provider, they may have services you can use for inference. For example, there is a component that deploys trained models to Google Cloud AI Platform
Or, you could build a custom solution