Presto with Kubernetes - docker

We are trying to implement Presto with Kubernetes. We have a kubernetes cluster running on cloud as a service. I tried to google on this but could not find a conclusive result as to what may be the best practices to deploy Presto with Kubernetes. Though there exists the official github of Presto - but does not help. Below are the two questions I am trying to seek an answer for:
What should be the best approach to configure Presto with Kubernetes - metrics such as ideal worker replicas?
How can we go ahead and performance test this deployment?

You could install with the official helm chart from https://github.com/helm/charts/tree/master/stable/presto It provides an option to set the number of workers. With the official chart you should be able to ask questions in the Kubernetes charts slack channel (through http://slack.k8s.io) and raise issues in GitHub if you hit any. Or there are non-helm examples such as https://github.com/dharmeshkakadia/presto-kubernetes
The question of how many workers isn't specific to Kubernetes. It's a question of how much and what kind of load you will need the deployment to handle and will also depend on what hardware your Kubernetes cluster is using. If you're not sure then perhaps you can deploy with the defaults and adjust as needed. This is suggested by https://prestodb.io/presto-admin/docs/current/installation/presto-configuration.html You'll find some of the settings such as memory per node set in the Deployment parts of the kubenernetes yaml descriptors or in the values.yaml in the case of the helm chart.
To performance test your deployment you will need test data and can then run queries against the cluster. So the same process you would follow outside of Kubernetes. There are tools to help such as https://www.lewuathe.com/use-benchto-for-evaluation-of-presto.html or https://github.com/prestodb/tempto You may also want to look at https://kognitio.com/blog/presto-performance-powerful-or-problematic/

There are a couple of examples of how it could be achieved available, for example dharmeshkakadia/presto-kubernetes but I guess you might want to use a StatefulSet here, rather. Not sure concerning perf tests because much of it will depend on the kind of persistent volume you choose or better say by what it is backed, for example NFS, Ceph, or maybe you are in a cloud environment with native storage?

Related

Job-based cloud processing solution

I would like to do some cloud processing on a very small cluster of machines (<5).
This processing should be based on 'jobs', where jobs are parameterized scripts that run in a certain docker environment.
As an example for what a job could be:
Run in docker image "my_machine_learning_docker"
Download some machine learning dataset from an internal server
Train some neural network on the dataset
Produce a result and upload it to a server again.
My use cases are not limited to machine learning however.
A job could also be:
Run in docker image "my_image_processing_docker"
Download a certain amount of images from some folder on a machine.
Run some image optimization algorithm on each of the images.
Upload the processed images to another server.
Now what I am looking for is some framework/tool, that keeps track of the compute servers, that receives my jobs and dispatches them to an available server. Advanced priorization, load management or something is not really required.
It should be possible to query the status of jobs and of the servers via an API (I want to do this from NodeJS).
Potentially, I could imagine this framework/tool to dynamically spin up these compute servers in in AWS, Azure or something. That would not be a hard requirement though.
I would also like to host this solution myself. So I am not looking for a commercial solution for this.
Now I have done some research, and what I am trying to do has similarities with many, many existing projects, but I have not "quite" found what I am looking for.
Similar things I have found were (selection):
CI/CD solutions such as Jenkins/Gitlab CI. Very similar, but it seems to be tailored very much towards the CI/CD case, and I am not sure whether it is such a good idea to abuse a CI/CD solution for what I am trying to do.
Kubernetes: Appears to be able to do this somehow, but is said to be very complex. It also looks like overkill for what I am trying to do.
Nomad: Appears to be the best fit so far, but it has some proprietary vibes that I am not very much a fan of. Also it still feels a bit complex...
In general, there are many many different projects and frameworks, and it is difficult to find out what the simplest solution is for what I am trying to do.
Can anyone suggest anything or point me in a direction?
Thank you
I would use Jenkins for this use case even if it appears to you as a “simple” one. You can start with the simplest pipeline which can also deal with increasing complexity of your job. Jenkins has API, lots of plugins, it can be run as container for a spin up in a cloud environment.
Its possible you're looking for something like AWS Batch flows: https://aws.amazon.com/batch/ or google datalflow https://cloud.google.com/dataflow. Out of the box they do scaling, distribution monitoring etc.
But if you want to roll your own ....
Option A: Queues
For your job distribution you are really just looking for a simple message queue that all of the workers listen on. In most messaging platforms, a Queue supports deliver once semantics. For example
Active MQ: https://activemq.apache.org/how-does-a-queue-compare-to-a-topic
NATS: https://docs.nats.io/using-nats/developer/receiving/queues
Using queues for load distribution is a common pattern.
A queue based solution can use both with manual or atuomated load balancing as the more workers you spin up, the more instances of your workers you have consuming off the queue. The same messaging solution can be used to gather the results if you need to, using message reply semantics or a dedicated reply channel. You could use the resut channel to post progress reports back and then your main application would know the status of each worker. Alternatively they could drop status in database. It probably depends on your preference for collecting results and how large the result sets would be. If large enough, you might even just drop results in an S3 bucket or some kind of filesystem.
You could use something quote simple to mange the workers - Jenkins was already suggested is in defintely a solution I have seen used for running multiple instances accross many servers as you just need to install the jenkins agent on each of the workers. This can work quote easily if you own or manage the physical servers its running on. You could use TeamCity as well.
If you want something cloud hosted, it may depend on the technology you use. Kubernetties is probably overkill here, but certiabnly could be used to spin up N nodes and increase/decrease those number of workers. To auto scale you could publish out a single metric - the queue depth - and trigger an increase in the number of workers based on how deep the queue is and a metric you work out based on cost of spinning up new nodes vs. the rate at which they are processed.
You could also look at some of the lightweight managed container solutions like fly.io or Heroku which are both much easier to setup than K8s and would let you scale up easily.
Option 2: Web workers
Can you design your solution so that it can be run as a cloud function/web worker?
If so you could set them up so that scaling is fully automated. You would hit the cloud function end point to request each job. The hosting engine would take care of the distribution and scaling of the workers. The results would be passed back in the body of the HTTP response ... a json blob.
Your workload may be too large for these solutions, but if its actually fairly light weight quick it could be a simple option.
I don't think these solutions would let you query the status of tasks easily.
If this option seems appealing there are quite a few choices:
https://workers.cloudflare.com/
https://cloud.google.com/functions
https://aws.amazon.com/lambda/
Option 3: Google Cloud Tasks
This is a bit of a hybrid option. Essentially GCP has a queue distribution workflow where the end point is a cloud function or some other supported worker, including cloud run which uses docker images. I've not actually used it myself but maybe it fits the bill.
https://cloud.google.com/tasks
When I look at a problem like this, I think through the entirity of the data paths. The map between source image and target image and any metadata or status information that needs to be collected. Additionally, failure conditions need to be handled, especially if a production service is going to be built.
I prefer running Python, Pyspark with Pandas UDFs to perform the orchestration and image processing.
S3FS lets me access s3. If using Azure or Google, Databricks' DBFS lets me seamlessly read and write to cloud storage without 2 extra copy file steps.
Pyspark's binaryFile data source lets me list all of the input files to be processed. Spark lets me run this in batch or an incremental/streaming configuration. This design optimizes for end to end data flow and data reliability.
For a cluster manager I use Databricks, which lets me easily provision an auto-scaling cluster. The Databricks cluster manager lets users deploy docker containers or use cluster libraries or notebook scoped libraries.
The example below assumes the image is > 32MB and processes it out of band. If the image is in the KB range then dropping the content is not necessary and in-line processing can be faster (and simpler).
Pseudo code:
df = (spark.read
.format("binaryFile")
.option("pathGlobFilter", "*.png")
.load("/path/to/data")
.drop("content")
)
from typing import Iterator
def do_image_xform(path:str):
# Do image transformation, read from dbfs path, write to dbfs path
...
# return xform status
return "success"
#pandas_udf("string")
def do_image_xform_udf(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]:
for path in iterator:
yield do_image_xform(path)
df_status = df.withColumn('status',do_image_xform_udf(col(path)))
df_status.saveAsTable("status_table") # triggers execution, saves status.

Continuous delivery of a docker container to Google Cloud

Goal:
I have an application based inside a docker container. I want to be able to use continuous integration for that with push to deploy using Bitbucket Pipelines to Google Cloud. I need access to an SQL database (MariaDB preferably), and some kind of caching system (be it memcache, redis, or something else).
Problem:
I'm entirely unsure of what services I need from said cloud provider to facilitate this as simply as possible, while still being cost effective. I looked into using Google's AppEngine, but I don't know if I'm doing something weird or odd, but for 1 vCPU w/ 1 GB of RAM and 10GB of storage, it was $55/month USD. Which is far more than I really want to pay. But I'm not even sure this is what I need for this. I don't need much power (these are very small apps, used by a very small amount of people). Also, again, not sure if I'm doing something wrong, but I was unable to find a caching solution with Google that wasn't insanely expensive (MemoryStore). Basically, I'm completely overwhelmed by the # of options, and am just looking for a cost effective / simple solution for continuous delivery of a docker application to Google Cloud
Using the tools discussed here , you can set up an end-to-end continuous delivery pipeline covering code, build, test, deploy, and monitor phases of software development across multi-cloud, hybrid, or on-premise environment. Likely this is a way to start.

What to use to orchestrate a few long running web services on few machines?

Investigating the possibilities, i am quite confused what is the best tool for us.
We want to deploy a few web services, for start a gitlab and a wiki.
The plan is to use docker images for these services and to store the data externally.
This services need to be accessible from outside.
I looked into Marathon and kubernetes and both seemed like overkill.
A problem we face as academics is that most people only stay for about three years and it's not our main job to administrate stuff. So we would like an easy to use, easy to maintain solution.
We have 3-4 nodes we want to use for this, we'd like it to be fault tolerant (restarting the service on another node if one dies for example).
So to sum up:
3-4 nodes
gitlab with CI and runners
a wiki
possibly one or two services more
auto deployment, load balancing
as failsafe as possible
What would you recommend?
I would recommend a managed container service like https://aws.amazon.com/ecs/
Running your own container manager swarm/kubernetes comes with a whole host of issues that it sounds like you should avoid.

Heroku to AWS Migration Advice

From what i've gathered, there are many solutions to my problem but i'd appreciate some suggestions on where to start. here's the stack we're running on heroku currently:
Rails on puma
mongoDB
elasticsearch
redis
mini_magick
What goes into the decision of using Elastic Beanstalk vs OpsWorks vs CloudFormation vs just setting up everything manually myself? Also, I'd really prefer, for financial reasons, not to use some third party service like Docker if possible. The plethora of options leaves me a little confused as to where to begin or how to even choose. Background: right now I really like Heroku b/c i don't have to think too much about sysadmin (on my team i'm the only developer), but we were recently given a lot of annual AWS credits so it seems to make financial sense for us to shift over to AWS.
I want to expand on Mark’s great answer.
Available alternatives
Since you’re the sole developer, Cloud Formation and OpsWorks aren’t good options for you.
With OpsWorks you’ll need to write, or at least be aware of, the Chef automation code that configures your instances. On the other hand, Cloud Formation by itself isn’t enough. It will help you with AWS cloud resources creations, but you will still need to figure out how to orchestrate your applications deployment, just for starters.
Neither of these options can give you everything you need to run and deploy your code like Heroku does right out of the box. You’ll need to implement parts of it by yourself.
Since rolling your own automation on top of EC2 takes even more effort than the options above, I think you have two alternatives within AWS that will fit your needs:
1) Elastic Beanstalk
It’s the closest you can get to Heroku within AWS. You might have to spend some time getting to know the platform at first since it’s not as intuitive as Heroku, but eventually Elastic Beanstalk will provide you with all the tools you need to continue running your applications without spending time on sysadmin tasks.
2) ECS + Empire
Although you mentioned that using Docker is out of question for you, I still would like to highlight the option of using ECS, Amazon’s Docker orchestration service, as an alternative to Heroku.
By itself, ECS doesn’t provide enough automation to do everything you would expect from a PaaS. The service was intended to be used as a building block which you should extend to fit your needs.
Luckily, the guys from Remind have already done this for you. They have released an open source project called Empire, which according to its own description is “a control layer on top of ECS that provides a Heroku-like workflow”.
Empire is compatible with Heroku’s API, and its command line implements the most important features of Heroku.
Empire is an open source project, so if you choose to use it, you should be prepared to dig into its code from time to time. The documentation isn’t perfect, and although there is some traction around the project, the community isn’t very big.
Overall, it’s a good alternative to Heroku if you’re willing to run your applications using Docker -- and why shouldn’t you?
Addons
The main benefit I see to switching from Redis Labs to Amazon’s Redis service (ElastiCache), other than the fact that you have free AWS credit, is that it’s going to be easier (and cheaper) to secure access to your Redis instances when you also run your applications on AWS.
Overall, it’s relatively easy to replicate the addons you’re using with Heroku when you migrate to AWS. For the third-party addons like Elasticsearch you just continue pointing your application to the relevant endpoint. It’s a bit more complicated to replicate Heroku’s native addons like deploy hooks since you can’t continue to use them when you migrate to AWS. In these cases it’s usually possible to find alternative ways of replicating their functionality within AWS.
If you want to learn about how to migrate the most common addons, I’ve written an article that details how to do that, you can find it here: how to replicate Heroku’s addons on AWS.
Hope this helps.
For your Rails app, Elastic Beanstalk is going to be very similar to Heroku. I would suggest using Elastic Beanstalk if you are already familiar with a PaaS like Heroku. It's probably going to be a bit more difficult to configure at first (there are just a lot more options you can configure), but then it will be a very similar deployment process to what you are used to.
Of course Heroku and most (probably all) of those other services you are using run on top of AWS already, so you would really just be switching from one set of services built on AWS to Amazon's own version of those services. You could possibly continue using some of the same services you are using on Heroku. For example I believe MongoLab is the recommended service for MongoDB on Heroku, and it is my preferred MongoDB-as-a-Service on AWS as well. If you want to use those AWS credits for MongoDB you will have to setup the EC2 servers and install and manage MongoDB yourself.
For Redis you could use Amazon's ElastiCache service or RedisLabs. I've found the features and price to be better with RedisLabs than ElastiCache, but you can use your AWS credits with ElastiCache.
For Elasticsearch you would probably want to use Amazon's new managed Elasticsearch service.

setting up mongodb via AWS opsworks

I am trying to setup a rails stack on AWS Opsworks and i want to use mongodb as the database.
I think that you set this up by creating a new custom layer and adding your chef reciepts to the relevant life cycle hooks but i am unsure as to what receipts to put where.
Can anyone help with how to add mongodb via a chef to AWS Opsworks?
I have seen there is a community mongodb cookbook but from what i can see its not compatible with Opsworks.
Does anyone have any experience of setting this up ?
Please can anyone help with this.
thanks a lot
Rick
I tried setting up a MongoDB 3-node replica set in OpsWorks a few months back. I will share a bit of my experience:
1) How to install a single MongoDB:
It is possible and easy to install a single mongodb using the EDelight Chef MongoDB Cookbook. Just add it as a submodule in your custom opsworks chef repository.
To get it to work create a custom layer and call it MongoDB and excecute the following recipes
SETUP: mongodb:10gen_repo
CONFIGURE: mongodb:default
This will install the latest version of MongoDB.
NOTE: I used Ubuntu instances.
2) MongoDB Best Practices
If you talk to MongoDB engineers or customer service reps, they will all tell you that the recommended setup for MongoDB is a 3 node replica set. This means one master and two read replicas hopefully in different availability zones. Also an ideal setup will have lots of RAM, to give you an example: the smallest instance provided by MongoDB that you can find in the AWS market place is a standard large:
You also have to consider using EBS on RAID10, maybe reserved IOPS...
See the white paper MongoDB on AWS for more info.
3) Security Considerations
Ideally you want only application instances to access the DB instances. In AWS you could create a security group with custom rules and assign EC2 instances to the group you just created... Is not quite like this when it comes to OpsWorks as it forces you to have default security groups that have very lax restrictions. AWS will always adopt lax permissions over stricter ones.
4) Time and Money Considerations
If the recommended setup is a 3 node replica set using large instances you are looking at at least $600 (on demand) for the DB and this doesn't add reserved IOPS, EBS, and so on. Automating this setup is possible yet not simple. It will take time or an expert in the subject to get you going. If you have the resources and personnel to deal with this go for it. If you are part of a small development team that want's to code more and do less operations, read on.
5) Find a reliable Managed Solution
At first I was reluctant to the idea of using a third party company that offered MongoDB as a service. After much evaluation of the different options (Managed, AWS Marketplace, OpsWorks, Direct EC2 installation), I concluded that for our small team the best thing to do was to use either MongoLab or MongoHQ. They host and mange MongoDB instances of all sizes and prices. They even let you choose the hosting (AWS, Rackspace, etc), region and AZ. Price wise will be more expensive if you look at the hardware alone, but like I mentioned before you have to consider not only the price but the operational time MongoDB will require.
I have been there, done that, and ended up not using OpsWorks to host MongodDB. Hopefully this will save you some time and headaches.
I just tested this repo on github, it works for MongoDB / OpsWorks. Goto OpsWorks > Layers > In the "Custom Chef Recipes" section reference this github link "https://github.com/Cyclic/cookbooks" then add "yum::default" to the Setup lifecycle event. Then add "mongodb::10gen_repo" "mongodb::default" and "mongodb::10gen_remrepo" to the Setup lifecycle event

Resources