How to use Dask on Databricks - dask

I want to use Dask on Databricks. It should be possible (I cannot see why not). If I import it, one of two things happens, either I get an ImportError but when I install distributed to solve this DataBricks just says Cancelled without throwing any errors.

Anyone looking for an answer, check this medium blogpost. To prevent people from missing this in comments, I'm posting this as an answer.

I don't think we have heard of anyone using Dask under databricks, but so long as it's just python, it may well be possible.
The default scheduler for Dask is threads, and this is the most likely thing to work. In this case you don't even need to install distributed.
For the Cancelled error, it sounds like you are using distributed, and, at a guess, the system is not allowing you to start extra processes (you could test this with the subprocess module). To work around, you could do
client = dask.distributed.Client(processes=False)
Of course, if it is indeed the processes that you need, this would not be great. Also, I have no idea how you might expose the dashboard's port.

Related

Is there a way to restart an Exasol database instance automatically

For reasons outlined here: https://community.exasol.com/t5/discussion-forum/performance-on-premise-dropping/td-p/9029 we need to restart a database regularly (at least until al issues are resolved, and this can take some time). So the question arises: Can this be done on a regular bases without human interaction?
LUA is not a solution, but perhaps a cron job is possible, but we need OS access for that, which we do not have.
Try to use xmlrpc API: https://github.com/exasol/exaoperation-xmlrpc/blob/master/EXAoperation_XMLRPC.md#method-restartdatabase
Here is a nice example with explanations: https://community.exasol.com/t5/environment-management/starting-and-stopping-clusters-using-xml-rpc/ta-p/1579
Yes, this should be possible using the shudownDatabase() and startDatabase() methods from this GitHub repository. You might need to use stateDatabase() in between to determine when the database is actually stopped before you try to start it again.

How to set up logging on dask distributed workers?

After upgrading of dask distributed to version 1.15.0 my logging stopped working.
I've used logging.config.dictConfig to initialize python logging facilities, and previously these settings propagated to all workers. But after upgrade it doesn't work anymore.
If I do dictConfig right before every log call on every worker it works but it's not a proper solution.
So the question is how it initialize logging on every worker before my computation graph starts executing and do it only once per worker?
UPDATE:
This hack worked on a dummy example but didn't make a difference on my system:
def init_logging():
# logging initializing happens here
...
client = distributed.Client()
client.map(lambda _: init_logging, client.ncores())
UPDATE 2:
After digging through documentation this fixed the problem:
client.run(init_logging)
So the question now is: Is this a proper way to solve this problem?
As of version 1.15.0 we now fork workers from a clean process, so changes that you make to your process prior to calling Client() won't affect forked workers. For more information search for forkserver here: https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods
Your solution of using Client.run looks good to me. Client.run is currently (as of version 1.15.0) the best way to call a function on all currently active workers.
Distributed Systems
It is worth noting that here you're setting up clients forked from the same process on a single computer. The trick you use above will not work in a distributed setting. In case people come to this question asking about how to handle logging with Dask in a cluster context I'm adding this note.
Generally Dask does not move logs around. Instead, it is common that whatever mechanism you used to launch Dask handles this. Job schedulers like SGE/SLURM/Torque/PBS all do this. Cloud systems like YARN/Mesos/Marathon/Kubernetes all do this. The dask-ssh tool does this.

How critical is dumb-init for Docker?

I hope that this question will not be marked as primarily opinion-based, but that there is an objective answer to it.
I have read Introducing dumb-init, an init system for Docker containers, which extensively describes why and how to use dumb-init. To be honest, for someone not too experienced with how the Linux process structure works, this sounds pretty dramatic - and it feels as if you are doing things entirely wrong if you don't use dumb-init.
This is why I'm thinking about using it within my very own Docker images… what keeps me from doing this is the fact that I have not yet found an official Docker image that uses it.
Take mongo as an example: They call mongod directly.
Take postgres as an example: They call postgres directly.
Take node as an example: They call node directly.
…
If dumb-init is so important - why is apparently nobody using it? What am I missing here?
Something like dumb-init or tini can be used if you have a process that spawns new processes and you don't have good signal handlers implemented to catch child signals and stop your child if your process should be stopped etc.
If your process doesn't spawn new processes (e.g. Node.js), then this may not be necessary.
I guess that MongoDB, PostgreSQL, ... which may run child processes have good signal handlers implemented. Otherwise there would have been zombie processes and someone would have filed an issue to fix this.
Only problem may be the official language images, like node, ruby, golang. They don't have dumb-init/tini in it as you normally don't need them. But it's up to the developer which may implement bad child execution code to either fix the signal handlers or use helper as PID 1.

What makes erlang scalable?

I am working on an article describing fundamentals of technologies used by scalable systems. I have worked on Erlang before in a self-learning excercise. I have gone through several articles but have not been able to answer the following questions:
What is in the implementation of Erlang that makes it scalable? What makes it able to run concurrent processes more efficiently than technologies like Java?
What is the relation between functional programming and parallelization? With the declarative syntax of Erlang, do we achieve run-time efficiency?
Does process state not make it heavy? If we have thousands of concurrent users and spawn and equal number of processes as gen_server or any other equivalent pattern, each process would maintain a state. With so many processes, will it not be a drain on the RAM?
If a process has to make DB operations and we spawn multiple instances of that process, eventually the DB will become a bottleneck. This happens even if we use traditional models like Apache-PHP. Almost every business application needs DB access. What then do we gain from using Erlang?
How does process restart help? A process crashes when something is wrong in its logic or in the data. OTP allows you to restart a process. If the logic or data does not change, why would the process not crash again and keep crashing always?
Most articles sing praises about Erlang citing its use in Facebook and Whatsapp. I salute Erlang for being scalable, but also want to technically justify its scalability.
Even if I find answers to these queries on an existing link, that will help.
Regards,
Yash
Shortly:
It's unmutable. You have no variables, only terms, tuples and atoms. Program execution can be divided by breakpoint at any place. Fully transactional model.
Processes are even lightweight than .NET threads and isolated.
It's made for communications. Millions of connections? Fully asynchronous? Maximum thread safety? Big cross-platform environment, which built only for one purpose — scale&communicate? It's all Ericsson language — first in this sphere.
You can choose some impersonators like F#, Scala/Akka, Haskell — they are trying to copy features from Erlang, but only Erlang born from and born for only one purpose — telecom.
Answers to other questions you can find on erlang.com and I'm suggesting you to visit handbook. Erlang built for other aims, so it's not for every task, and if you asking about awful things like php, Erlang will not be your language.
I'm no Erlang developer (yet) but from what I have read about it some of the features that makes it very scalable is that Erlang has its own lightweight processes that are using message passing to communicate with each other. Because of this there is no such thing as shared state and locking which is the case when using for example a multi threaded Java application.
Another difference compared to Java is that the Erlang VM does garbage collection on every little process that is running which does not take any time at all compared to Java which does garbage collection only per VM.
If you get problem with bottlenecks from database connection you could start by using a database pooling app running against maybe a replicated PostgreSQL cluster or if you still have bottlenecks use a multi replicated NoSQL setup with Mnesia, Riak or CouchDB.
I think process restarts can be very useful when you are experiencing rare bugs that only appear randomly and only when specific criteria is fulfilled. Bugs that cause the application to crash as soon as you restart the app should optimally be fixed or taken care of with a circuit breaker so that it does not spread further.
Here is one way process restart helps. By not having to deal with all possible error cases. Say you have a program that divides numbers. Some guy enters a zero to divide by. Instead of checking for that possible error (and tons more), just code the "happy case" and let process crash when he enters 3/0. It just restarts, and he can figure out what he did wrong.
You an extend this into an infinite number of situations (attempting to read from a non-existent file because the user misspelled it, etc).
The big reason for process restart being valuable is that not every error happens every time, and checking that it worked is verbose.
Error handling is verbose typically, so writing it interspersed with the logic handling doing a task can make it harder to understand the code. Moving that logic outside of the task allows you to more clearly distinguish between "doing things" code, and "it broke" code. You just let the thing that had a problem fail, and handle it as needed by a supervising party.
Since most errors don't mean that the entire program must stop, only that that particular thing isn't working right, by just restarting the part that broke, you can keep operating in a state of degraded functionality, instead of being down, while you repair the problem.
It should also be noted that the failure recovery is bounded. You have to lay out the limits for how much failure in a certain period of time is too much. If you exceed that limit, the failure propagates to another level of supervision. Each restart includes doing any needed process initialization, which is sometimes enough to fix the problem. For example, in dev, I've accidentally deleted a database file associated with a process. The crashes cascaded up to the level where the file was first created, at which point the problem rectified itself, and everything carried on.

Are you using AWSDBProxy? Is there a performance hit when scaling out?

It seems that the only tutorials out there talking about using Amazon's SimpleDB in a rails site are using AWSDBProxy... Personally, I find this counter-intuitive to scaling out, considering the server layout of a typical Rails site below (using AWSDBProxy):
Plugin here: http://agilewebdevelopment.com/plugins/aws_sdb_proxy
Image here: http://www.freeimagehosting.net/uploads/91be4e0617.png
As you can see, even if we add more mongrels, we have two problems.
We have a single point of failure far less stable than our load balancer
We have to force all our information through this one WEBrick server
The solution is, of course, to add more AWSDBProxies... but why not then just use the following code in say, a class, skipping the proxy all together?
service = AwsSdb::Service.new(Logger.new(nil),
CONFIG['aws_access_key_id'],
CONFIG['aws_secret_access_key'])
service.query(domain, query)
So what I'm getting at, is if you are using AWSDBProxy, what are you justifications for it? And if you are indeed using it, what is your performance like? If you have hard numbers, this would be even more appreciated!
I'm not using it, nor have I ever heard of it, but this is what I would think are reasonable reasons.
You're running your main app server on EC2, so the chance of Internet FAIL doesn't really affect you more than once.
You run one proxy on each of your app servers. So it's connection going down is no worse than it's connection(s) to the database going down.
Because it can be done. This is as good a reason as any in an open source project. Sometimes it takes building a thing before you know whether said thing is a good/bad idea.
You don't have the traffic levels to need a load balancer. Then your diagram squashes down to a line, if not a single machine.

Resources