I have an issue with a rails worker that is consuming extreme amounts of processor time. Oddly I have not been able to trace it out so far. I've tried to use New Relic, however I can't seem to trace it down within the worker itself.
How can I profile and really explore performance issues in detail so as to find precise locations of performance problems such as this?
Have you tried Ruby-Prof? It's easy to use and seems to return accurate numbers. (Of course it's hard to argue with it when running an app that's busy.)
Related
I am working on an article describing fundamentals of technologies used by scalable systems. I have worked on Erlang before in a self-learning excercise. I have gone through several articles but have not been able to answer the following questions:
What is in the implementation of Erlang that makes it scalable? What makes it able to run concurrent processes more efficiently than technologies like Java?
What is the relation between functional programming and parallelization? With the declarative syntax of Erlang, do we achieve run-time efficiency?
Does process state not make it heavy? If we have thousands of concurrent users and spawn and equal number of processes as gen_server or any other equivalent pattern, each process would maintain a state. With so many processes, will it not be a drain on the RAM?
If a process has to make DB operations and we spawn multiple instances of that process, eventually the DB will become a bottleneck. This happens even if we use traditional models like Apache-PHP. Almost every business application needs DB access. What then do we gain from using Erlang?
How does process restart help? A process crashes when something is wrong in its logic or in the data. OTP allows you to restart a process. If the logic or data does not change, why would the process not crash again and keep crashing always?
Most articles sing praises about Erlang citing its use in Facebook and Whatsapp. I salute Erlang for being scalable, but also want to technically justify its scalability.
Even if I find answers to these queries on an existing link, that will help.
Regards,
Yash
Shortly:
It's unmutable. You have no variables, only terms, tuples and atoms. Program execution can be divided by breakpoint at any place. Fully transactional model.
Processes are even lightweight than .NET threads and isolated.
It's made for communications. Millions of connections? Fully asynchronous? Maximum thread safety? Big cross-platform environment, which built only for one purpose — scale&communicate? It's all Ericsson language — first in this sphere.
You can choose some impersonators like F#, Scala/Akka, Haskell — they are trying to copy features from Erlang, but only Erlang born from and born for only one purpose — telecom.
Answers to other questions you can find on erlang.com and I'm suggesting you to visit handbook. Erlang built for other aims, so it's not for every task, and if you asking about awful things like php, Erlang will not be your language.
I'm no Erlang developer (yet) but from what I have read about it some of the features that makes it very scalable is that Erlang has its own lightweight processes that are using message passing to communicate with each other. Because of this there is no such thing as shared state and locking which is the case when using for example a multi threaded Java application.
Another difference compared to Java is that the Erlang VM does garbage collection on every little process that is running which does not take any time at all compared to Java which does garbage collection only per VM.
If you get problem with bottlenecks from database connection you could start by using a database pooling app running against maybe a replicated PostgreSQL cluster or if you still have bottlenecks use a multi replicated NoSQL setup with Mnesia, Riak or CouchDB.
I think process restarts can be very useful when you are experiencing rare bugs that only appear randomly and only when specific criteria is fulfilled. Bugs that cause the application to crash as soon as you restart the app should optimally be fixed or taken care of with a circuit breaker so that it does not spread further.
Here is one way process restart helps. By not having to deal with all possible error cases. Say you have a program that divides numbers. Some guy enters a zero to divide by. Instead of checking for that possible error (and tons more), just code the "happy case" and let process crash when he enters 3/0. It just restarts, and he can figure out what he did wrong.
You an extend this into an infinite number of situations (attempting to read from a non-existent file because the user misspelled it, etc).
The big reason for process restart being valuable is that not every error happens every time, and checking that it worked is verbose.
Error handling is verbose typically, so writing it interspersed with the logic handling doing a task can make it harder to understand the code. Moving that logic outside of the task allows you to more clearly distinguish between "doing things" code, and "it broke" code. You just let the thing that had a problem fail, and handle it as needed by a supervising party.
Since most errors don't mean that the entire program must stop, only that that particular thing isn't working right, by just restarting the part that broke, you can keep operating in a state of degraded functionality, instead of being down, while you repair the problem.
It should also be noted that the failure recovery is bounded. You have to lay out the limits for how much failure in a certain period of time is too much. If you exceed that limit, the failure propagates to another level of supervision. Each restart includes doing any needed process initialization, which is sometimes enough to fix the problem. For example, in dev, I've accidentally deleted a database file associated with a process. The crashes cascaded up to the level where the file was first created, at which point the problem rectified itself, and everything carried on.
Or rather, why aren't there better tools for profiling memory in ruby, specifically rails apps?
Recently our rails app (hosted on heroku) has started seeing lots of R14 errors in the worker dynos. This means we're running out of memory. Bumping the dynos to 2x (512mb -> 1GB) only alleviated the problem temporarily, leading me to believe there is a memory leak somewhere. Naturally, my next step was to find a good profiling gem that can help me discover the source of the leak.
Maybe I'm just ignorant of the tools available, or maybe I just don't know how to use the ones I have. My wish is that I could install a gem and then run reports on the memory usage statistics. Hitting an endpoint to get a report is not really viable as my memory issues are isolated to worker dynos running delayed jobs.
I've looked at memprof, but it's 1.8 only.
I've looked at ruby-prof (awesome), but the memory profiling requires a patched ruby intepreter.
I've looked at GC::Profiler, but I don't understand how to find memory leaks with it.
So, is it just plain difficult to find memory leaks in ruby? Or am I missing the point somehow?
Depending on your "type" of leak, You can run valgrind against ruby. Might require recompilation again though. In general it's hard because ruby does method allocation without firing any events, by default, so it's tricky to track. See also perftools.rb project, which somewhat works around this limitation.
Recently I've been having some success with Skylight to profile web and background worker methods and then hunt down opportunities for optimisation. It probably wasn't around when you posted this question. The downside is that it only really helps you debug in staging or production, not development environments, so the dev loop can be very slow.
Make sure you install both skylight-ruby and sidekiq-skylight to get profiling on both your web server and background workers if you're using Sidekiq.
Good luck!
There is a nice way if your application is running on OS with dtrace or similar technology like SystemTap. In my case, we use RHEL/CentOS which has the latter:
https://lukas.zapletalovi.com/2016/08/probing-ruby-20-apps-with-systemtap-in-rhel7.html
You can easily connect to production application and "inject" profiling code for a moment and track calls, memory, CPU time or I/O and then "disconnect" at any time. It's very efficient so you will not likely notice any drastice slowdown (unless you screw your script up).
I disagree that memory profiling in Ruby is hard. The JVM has some of the best memory profiling tools on the planet, and you can run your Ruby programs on the JVM. Don't reinvent the wheel.
Browsing Memory the JRuby Way
Finding Leaks in Ruby Apps with Eclipse Memory Analyzer
Monitoring Memory with JRuby, Part 1: jhat and VisualVM
Monitoring Memory with JRuby, Part 2: Eclipse Memory Analyzer
My rails app, according to my heroku logs, is serving requests on average of about 1700 to 2500 milliseconds (this is the entire roundtrip). I used new relic to profile my app, and it seems that the majority of the request is not spent in my database but rather in the "Web Transaction" section of New Relic. It seems like the "Controller" category tends to be the slowest among requests, followed by the "SQL - SELECT" segment in the "Database" category.
I'm not quite sure what could be causing my performance bottleneck in my controllers, nor do I think I can dive deeper into new relic without paying for the premium version. I recently added indexes to the foreign keys of my application, although I do not think this made much of a difference in terms of database response times.
I know this is not enough information to figure out what is causing these bottlenecks, but I do not even know where to start or what info to give. If people could tell me what info is needed to diagnose these issues, then that would be helpful to me.
New Relic for Ruby includes a free, standalone developer mode. When running in RAILS_ENV=development, the New Relic gem adds a route that will show you a detailed profile for each request. Go to http://localhost:3000/newrelic after you hit your app a few times.
The profile includes time for each SQL query, as well as for components of your code. You can use custom instrumentation to break down big chunks of code into smaller segments (or individual methods) that get timed separately. This feature is a lot like the transaction traces you get in the paid Pro version, one major difference being that you wouldn't want to run the free dev mode in production.
(Full disclosure: I work for NR. Not many people know about the free dev mode, though, so I thought it was worth mentioning.)
You could potentially make Javascript loading appear even faster with something like head.js, which will load your JS files asynchronously and in parallel.
Take a look at this slide show:
http://www.slideshare.net/drhenner/optimize-the-obvious-7636674
Might not be enough but it goes through some common faults.
Digging a little deaper take a look at this video: http://windycityrails.org/videos2011/#2
It is longer but gives a lot of places to look.
On a different note. Do you use a CDN?
How is Erlang fault tolerant, or help in that regard?
I think I covered part of the answer in this reply to another thread.
Erlang is fault tolerant with the following things in mind:
Erlang knows that errors WILL happen, and things will break, so instead of guarding against errors, Erlang lets you have strong tools to minimize impact of errors and recover from them as they happen.
Erlang encourages you to program for success case, and crash if anything goes wrong without trying to recover partially broken data. The idea behind this is that partially incorrect data may propagate further in your system and may get written to database, and thus presents risk to your system. Better to get rid of it early and only keep fully correct data.
Process isolation in Erlang helps with minimizing impact of partially wrong data when it appears and then leads to process crash. System cleans up the crashed code and its memory but keeps working as a whole.
Supervision and restart strategies help keep your system fully functional if parts of it crashed by restarting vital parts of your system and bringing them back into service. If something goes very wrong such that restarts happen too much, the system is considered broken beyond repair and thus is shut down.
Caveat: I am an Erlang noob.
#Daniel's answer is essentially correct. I strongly suggest that you take the time to read Erlang creator Joe Armstrong's thesis (Making reliable distributed systems in the presence of software errors). The thesis provides a good explanation of the need for, and the solution to, developing robust distributed systems. I believe the paper will answer your question satisfactorily.
Erlang makes it easy to create many, small processes, and to monitor those processes. When one of those processes crashes, it may be possible to restart that part of the system without needing to bring the whole thing down.
You may have seen something like this in modern versions of Windows: the system can restart the graphics driver if it crashes; it doesn't kill the whole system.
To make it easier to write fault-tolerant applications, Erlang provides the concept of supervisor processes. These processes monitor a number of child processes, and know how to respond if a child dies. You might create a whole supervision tree, so that you have fine control about how different parts of the application behave. You can read more in the Erlang documentation.
I'm running a Rails app through Phusion Passenger (mod_rails) which will run smoothly for a while, then suddenly slow to a crawl (one or two requests per hour) and become unresponsive. CPU usage is low throughout the whole ordeal, although I'm not sure about memory.
Does anyone know where I should start to diagnose/fix the problem?
Update: restarting the app every now and then does fix the problem, although I'm looking for a more long-term solution. Memory usage gradually increases (initially ~30mb per instance, becomes 40mb after an hour, gets to 60 or 70mb by the time it crashes).
New Relic can show you combined memory usage. Engine Yard recommends tools like Rack::Bug, MemoryLogic or Oink. Here's a nice article on something similar that you might find useful.
If restarting the app cures the problem, looking at its resource usage would be a good place to start.
Sounds like you have a memory leak of some sort. If you'd like to bandaid the issue you can try setting the PassengerMaxRequests to something a bit lower until you figure out what's going on.
http://www.modrails.com/documentation/Users%20guide%20Apache.html#PassengerMaxRequests
This will restart your instances, individually, after they've served a set number of requests. You may have to fiddle with it to find the sweet spot where they are restarting automatically before they lock up.
Other tips are:
-Go through your plugins/gems and make sure they are up to date
-Check for heavy actions and requests where there is a lot of memory consumption (NewRelic is great for this)
-You may want to consider switching to REE as it has better garbage collection
Finally, you may want to set a cron job that looks at your currently running passenger instances and kills them if they are over a certain memory threshold. Passenger will handle restarting them.