After upgrading of dask distributed to version 1.15.0 my logging stopped working.
I've used logging.config.dictConfig to initialize python logging facilities, and previously these settings propagated to all workers. But after upgrade it doesn't work anymore.
If I do dictConfig right before every log call on every worker it works but it's not a proper solution.
So the question is how it initialize logging on every worker before my computation graph starts executing and do it only once per worker?
UPDATE:
This hack worked on a dummy example but didn't make a difference on my system:
def init_logging():
# logging initializing happens here
...
client = distributed.Client()
client.map(lambda _: init_logging, client.ncores())
UPDATE 2:
After digging through documentation this fixed the problem:
client.run(init_logging)
So the question now is: Is this a proper way to solve this problem?
As of version 1.15.0 we now fork workers from a clean process, so changes that you make to your process prior to calling Client() won't affect forked workers. For more information search for forkserver here: https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods
Your solution of using Client.run looks good to me. Client.run is currently (as of version 1.15.0) the best way to call a function on all currently active workers.
Distributed Systems
It is worth noting that here you're setting up clients forked from the same process on a single computer. The trick you use above will not work in a distributed setting. In case people come to this question asking about how to handle logging with Dask in a cluster context I'm adding this note.
Generally Dask does not move logs around. Instead, it is common that whatever mechanism you used to launch Dask handles this. Job schedulers like SGE/SLURM/Torque/PBS all do this. Cloud systems like YARN/Mesos/Marathon/Kubernetes all do this. The dask-ssh tool does this.
Related
I want to use Dask on Databricks. It should be possible (I cannot see why not). If I import it, one of two things happens, either I get an ImportError but when I install distributed to solve this DataBricks just says Cancelled without throwing any errors.
Anyone looking for an answer, check this medium blogpost. To prevent people from missing this in comments, I'm posting this as an answer.
I don't think we have heard of anyone using Dask under databricks, but so long as it's just python, it may well be possible.
The default scheduler for Dask is threads, and this is the most likely thing to work. In this case you don't even need to install distributed.
For the Cancelled error, it sounds like you are using distributed, and, at a guess, the system is not allowing you to start extra processes (you could test this with the subprocess module). To work around, you could do
client = dask.distributed.Client(processes=False)
Of course, if it is indeed the processes that you need, this would not be great. Also, I have no idea how you might expose the dashboard's port.
When using Dask normally things work fine. However, when I use Dask with an adaptive cluster I find that sometimes all the tasks get assigned to a single worker. Why is this?
This should be considered a usability bug, and it would be reasonable to file an issue about it.
However, to explain what is going on (at least today 2018-08-09) probably what happens is that
Your scheduler first has no tasks and so has no workers assigned to it
You submit a lot of work from a client, the scheduler responds and asks for many workers
The first worker arrives and the scheduler hands it all of the work
Milliseconds later, several other workers arrive. The scheduler then proceeds to load balance between the available workers
Ideally, the load balancing heuristics should handle the situation. There were older versions of Dask where this performed less well, but usually this is fine. I recommend first updating your version of the dask and distributed packages to the newest possible releases and if that doesn't work, report an issue with a minimal example if possible.
I want to send email batch at specific time like CRON.
I think whenever gem (https://github.com/javan/whenever) is not to fit in Cloud Foundry Environment. Because Cloud Foundry can't use crontab.
Please inform me what options are available to me.
There's a node.js app here that you could use to schedule a specific rake task.
I haven't worked with cloudfare so I'm not sure if it'll serve your needs, but you can also try some of the batch job processing tools rails has available: Delayed job and sidekiq. Those store data for recurring jobs either on your database (DJ) or in a separate redis database (Sidekiq) and both need keeping extra processes up and running, so review them deeply and the changes you'd need for your deployment process before using each one. There's also resque, and here's a tutorial to use it with rails for scheduling tasks.
There are multiple solutions here, but the short answer is that whatever you end up doing needs to implement its own scheduler. This is because there is no cron service available to your application when it runs on CF. This means there is nothing to trigger or schedule your actions. Any project or solution that depends on cron will not work when deploying to CF. Any project that implements it's own scheduler should work fine.
Some specific things I've seen people do successfully:
Use a web service that sends HTTP requests to your app on predefined intervals. The requests trigger your action. It's the services responsibility to let you define when to trigger and to send the HTTP requests. I'm intentionally avoiding mentioning any specific services, but you can find them by searching for "cron http service" or something like that.
Importing a library that has cron like functionality. I'm not familiar with Ruby, so I don't know the landscape there. #mlabarca has mentioned a couple that you might try out. Again, look to see that they implement the scheduling functionality and do not depend on cron. I'm more familiar with Java where you have Quartz and Spring, which has some scheduling functionality too.
Implement a "clock" process or scheduler. This would generally be a second app that you deploy on CF. It would be lightweight and probably not have a web interface. It could be as simple as do something, sleep, loop for ever repeating those two steps. It really depends on your needs. You could even get fancy and implement something like the first option above where you're sending some sort of request to your other apps to trigger the actual events.
There are probably other solutions as well, those are just some examples to get you started.
Probably also worth mentioning that the Cloud Controller v3 API will have first class features to run tasks. In this case, the "task" is some job that runs in a finite amount of time and exits (like a batch job). This is opposed to the standard "app" that when run on CF should continue executing forever (i.e. if it exits, it's cause of a crash). That said, I do not believe it will include a scheduler so you'd still need something to trigger the task.
I am trying to create automated edits to the database in firebase. Is there a way to do that on the server-side? I am new to iOS development and swift so any help would be greatly appreciated.
Also, I've tried Zapier but the service is not specific enough for my needs.
Yes - Firebase has quite a flexible set of options for server-side updates and it is simple enough to schedule a cronjob to connect to firebase and perform some scheduled update or edits.
The most generic approach is to use the REST API to perform your updates although there are specific libraries to support Node and other platforms.
It is worth being aware of the recent major upgrade to version 3 of Firebase which introduced quite a few significant changes - it can be easy to confuse the older examples floating around with the new API so be aware of the differences as you put together your first proof of concept examples.
I assume that you are looking to run on your own server although another alternative is to use a container hosting environment ( Google Apps etc ).
If you have your own server and are looking to integrate I would suggest starting with:
https://firebase.google.com/docs/server/setup#prerequisites
Then perhaps a quick look at:
https://firebase.googleblog.com/docs/web/quickstart.html
and
https://www.firebase.com/docs/rest/
If you are just getting started I would suggest a first task being to authenticate, retrieve and update a Firebase record.
You can configure server auth keys through the FB console and use these as part of you authentication process.
If you are unfamiliar with JWT then it is worth spending a little time getting up to speed on this and working through the examples at https://www.firebase.com/docs/rest/guide/user-auth.html
Further to your comment:
So the first approach that comes to mind is to run some kind of scheduled job in your Cron which would connect using the REST API, perform some kind of query on the existing data to identify those records that require an update and remove or modify them.
Giving a little more though you could extend this approach without having to run at a recurring period less than the minimal anticipated deletion time you could run the scheduler just to clean up at some longer period but filter your results to the client so that you are not including stale data. This approach is discussed a little at Firebase chat - removing old messages
Getting the right solution to your particular scenario will depend a lot on how well you structure your data which can be counter-intuitive; particularly for users who have come from an RDBMS background.
There may be an inclination to keep the data slim and unpolluted with old irrelevant data however Firebase is quite good at managing large minimally structured data and the overhead of this bloat may not be as bad a thing as you may think.
If the filtering itself isn't sufficient and you don't have a server that you can CRON a cleanup process then you can implement a firebase worker process in Node or similar and have this running on a container service such as Heroku or Google Apps. See Firebase push notifications - node worker for some ideas on how to approach this.
When asked Google advised that they didn't advise on where best to host worker services but they did mention both Google App Engine and Heroku.
Another approach if you don't want to implement and host a watcher/worker process is to simply include some code in the client that checks for and removes stale data periodically.
The firebase Queue is very cool but may be a bit of an overkill for simply expiring stale data.
I'm coming from a PHP environment (at least in terms of web dev) and into the beautiful world of Ruby, so I may have some dumb questions. I imagine there are some fundamentally different options available when not using PHP.
In PHP, we use memcache to store alerts we want to display in a bar along the top of the page. When something happens that generates an alert (such as a new blog post being made), a cron script that runs once every 5 minutes or so puts that information into memcache.
Now when a user visits the site, we look in memcache to find any alerts that they haven't already dismissed and we display them.
What I'm guessing I can do differently in Rails, is to by-pass the need for a cron script, and also the need to look in memcache on every request, by using a Singleton and a polling process running in a separate thread to copy from memcache to this singleton. This would, in theory, be more optimized than checking memcache once-per-request and also encapsulate the polling logic into one place, rather than being split between a cron task and the lookup logic.
My question is: are there any caveats to having some sort of runloop in the background while a Rails app is running? I understand the implications of multithreading, from Objective-C/Java, but I'm asking specifically about the Rails (3) environment.
Basically something like:
class SiteAlertsMap < Hash
include Singleton
def initialize
super
begin_polling
end
# ... SNIP, any specific methods etc ...
private
def begin_polling
# Create some other Thread here, which polls at set intervals
end
end
This leads me into a similar question. We push (encrypted) tasks onto an SQS queue, for things related to e-commerce and for long-running background tasks. We don't use cron for this, but rather we have a worker daemon written in PHP, which runs in the background. Right now when we deploy, we have to shut down this worker and start it again from the new code-base. In Rails, could I somehow have this process start and stop with the rails server (unicorn) itself? I don't think that's something I'd running on the main process in a separate thread, since we often want to control it as a process by itself, but it would be nice if it just conveniently ran when the web application was running.
Threading for background processes in ruby would be a terrible mistake, especially since you're using a multi-process server. Using unicorn with say 4 worker processes would mean that you'd be polling from each of them, which is not what you want. Ruby doesn't really have real threads, it has green threads in 1.8 and a global interpreter lock in 1.9 IIRC. Many gems and libraries are also obnoxiously unthreadsafe.
Using memcache is still your best option and, if you have it set up correctly, you should only see it adding a millisecond or two to the request time. Another option which would give you the benefit of persisting these alerts while incurring minimal additional overhead would be to store these alerts in redis. This would better protect you against things like memcache crashing or server reboots.
For the background jobs you should use a similar approach to what you have now, but there are several off the shelf handlers for this like resque, delayed_job, and a few others. If you absolutely have to use SQS as the backend queue, you might be able to find some code to help you, but otherwise you could write it yourself. This still requires the other daemon to be rebooted whenever there is a code change. In practice this isn't a huge concern as best practices dictate using a deployment system like capistrano where a rule can easily be added to bounce the daemon on deploy. I use monit to watch the daemon process, so restarting it is as easy as telling monit to restart it.
In general, Ruby is not like Java/Objective-C when it comes to threads. It follows the more Unix-like model of process based isolation, but the community has come up with best practices and ways to make this less painful than in other languages. Ruby does require a bit more attention to setting up its stack as it is not as simple as enabling mod_php and copying some files around, but once the choices and architecture is understood, it is easier to reason about how your application works. The process model, in my opinion, is much better for web apps as it isolates code and state from the effects of other running operations. The isolation also makes the app easier to work with in a distributed system.