Memory leak in stateless session - memory

I am working with Business Central version 7.61.0.Final and KIE Server version 7.61.0.Final. I created my project using Decision Tables and deployed it to KIE Server. I execute my rules through the KIE Server REST API. The KIE session is stateless, so the facts must be automatically removed at the end of the request.
kie base
kie session
My problem is that the memory is increasing and never eliminated. So, when I run stress tests (for example, 1000 requests) the server collapse.
The memory usage before tests is 31% and after tests (1000 requests) is 74%
memory before tests
memory after tests
I tried to change the deployment settings (Runtime Strategy equal to Per request, Persistence Mode equal to None, and Audit Mode equal to None) to these values, but the memory continues to increase
deployment settings 1
deployment settings
I hope you can help me, thanks!

Related

Heroku configuration for Ruby on Rails application

I’ve set a client up with Heroku for their Ruby on Rails application and have had a great deal of trouble over the years with their application not running well regardless of how much money we spend on additional resources, find their documentation highly confusing. I’ve never been able to understand their specific terminology and documentation. We are constantly getting "H12" errors and "R14" errors etc. The memory usage and dyno loads are constantly spiking. And yet this is a small to medium-sized business without a massive amount of traffic. Wondering if anybody out there who does understand the ins and outs of Heroku can look this configuration over and tell me if it makes sense:
DB_POOL: 10
MALLOC_ARENA_MAX: 2
RAILS_MAX_THREADS: 5
WEB_CONCURRENCY: 4
Ruby 2.7
Rails 6.0
Puma
8 2x web dynos
5 1x worker dynos
$50 Postgres standard 0 database
$15 Memcachier
$10 Rediscloud
...etc addons
Your WEB_CONCURRENCY is too high for your Standard-2x dynos. The recommended default is 2: https://devcenter.heroku.com/articles/deploying-rails-applications-with-the-puma-web-server#recommended-default-puma-process-and-thread-configuration
This is likely contributing to your R14 errors as higher web concurrency means more memory usage. So you need to either lower your web concurrency (which may mean you also need to increase the # of dynos to compensate) or you need to use bigger dynos.
You already have MALLOC_ARENA_MAX=2 but not sure if you are using jemalloc. You might want to try that too.
Of course, you may also have other memory issues in your app - check out some tips here. I also recommend adding a monitoring tool like AppSignal as it's capable of tracking memory allocations per transaction.
For mitigating H12s:
Ensure you have installed something like the rack-timeout gem, which ensures that a long-running request is dropped at the dyno-level and thus avoids the H12 error (you get a Rack::TimeoutError exception instead). Set the timeout to 15s so that it is well under the 30s for H12 timeout.
Investigate your slow transactions. A monitoring tool is key here, i.e. New Relic (start with lowest-priced paid plan - free plan does not allow transaction tracing). Here is their blog post on how to trace transactions
When you've identified the problem - fix it!
if the bottleneck is external:
check for external API limits and throttling
add timeouts and make app resilient to slow external responses
if the bottleneck is due to the database:
optimize slow queries
check cache hit rates
check for the # of waiting connections and db locks -> if the number of waiting connections is consistently above 0 for X minutes, that indicates you have some long locks that you'll need to investigate. Waiting connections is easiest to track over time with Librato (free plan should do fine)
if the bottleneck is other app code:
add more custom instrumentation to get more insights, i.e. New Relic instructions
address app code issues
I want to stress the importance of monitoring tools to help diagnose issues and help determine optimal resource usage. Doing things like figuring out the correct concurrency configs, the correct size and # of dynos to run are virtually impossible without proper monitoring tools. Hopefully you have some already that are covered by your etc add-ons that are not listed, but if you do not, I'll summarize my recommendations and mention a couple other tips:
To get more metrics info, ensure you have enabled log-runtime-metrics
Also enable Ruby language metrics
Add a monitoring tool that can track Ruby memory allocations like AppSignal. Scout APM can do this too but I think their plans capable of this are more expensive (requires Scout Insights feature)
Add the lowest-paid version of New Relic. This is my go-to tool for transaction tracing. AppSignal can do this too if you don't want to pay for another tool, but I find it easier with New Relic.
Add Librato. It offers some great charts out of the box, including a set of Postgres charts in its own dashboard.
Set alerts in your monitoring apps to warn you about things like response times so you can look into them!
And of course, make all your changes in staging first AND load test them to see the impacts of your changes before attempting in production!
Update: I also just noticed that you said you are using Standard-0 Postgres, which means it has a 120 connection limit. So if you end up lowering your WEB_CONCURRENCY and increasing the # of dynos, watch out for your total connections to that database. Beyond just the fact that there is a limit, more connections also mean more overhead for your db anyway so if you are close to your connection limit, you are more likely to see db performance suffer. You may want to upgrade to another plan that has a higher connection limit or use pgbouncer as your connection pooler to avoid connection limits.

Application Pool Occasionally Spiking Memory Consumption

We have just launched a new MVC5 web site. The site uses Entity Framework for its data and also implements a couple of WebApi services for some simple AngularJS pages used on the web site.
The site has gone through development and testing without a problem, but now it is installed on an IIS 8.5 production server we are seeing the following entries in the IIS (WAS) event logs:
Here is first error:
A worker process serving application pool 'xxx' has requested a recycle
because it reached its private bytes memory limit.
Around 90 seconds later we see this error:
A worker process '4880' serving application pool 'xxx' failed to stop
a listener channel for protocol 'http' in the allotted time. The data
field contains the error number.
Which is immediately (the same time to the second) followed by a third error:
A process serving application pool 'xxx' exceeded time limits during
shut down. The process id was '4880'.
Finally, we see another Application Pool reccycle event:
A worker process serving application pool 'xxx' has requested a recycle
because it reached its private bytes memory limit.
We are currently seeing this problem approximately once per day and it does not seem to be related to site traffic/loading.
The reason we set the Application Pool to recycle on a Private Bytes consumption exceeded 4,194,304 KB (4 GB) - it normally (for perhaps 36 hours) sits at less than 1 GB, was because we had noticed that occasionally the Application Pools Private Memory consumption would increase linearly. Again we did not see this during development or local testing.
We have tried running load tests of several hundred concurrent users across the application, but have been unable to replicate this error sequence.
We have also run the application locally for extended periods of time with ReSharper's dotMemory profiler and memory snapshots do not reveal any problems.
Are there any tools/techniques available that we can run on the production server that would give us more information on what is happening?

ActiveRecord::QueryCache#call taking over 70% of execution time

NewRelic is showing me that over 80% of execution time in the app server is taking place in "Middleware ActiveRecord::QueryCache#call"
Here is a gist of the relevant code tested (although I see similar results on other API endpoints).
Gist
I'm running the app server on AWS Elastic Beanstalk on a t2.medium instance and a t2.small Postgres RDS DB with max_connections set to 100. I'm testing this via loader.io, doing a test of 100 users with the maintain client load setting (this means about 6000 requests a minute).
Does anyone have an idea why the QueryCache is taking so much time?
Unfortunately, this issue with QueryCache is quite common and seems to have multiple causes, but the most common is that the connection between your EC2 app server and DB was temporarily severed, and QueryCache doesn't handle this particularly well.
Remedies include increasing your default connection pool size substantially (e.g. an order of magnitude higher), disabling QueryCache entirely, or increasing read_timeout in database.yml to 15 seconds or more depending on your environment.
If the read_timeout setting resolves the problem, you may want to investigate why there are so many disconnects between your app server and db.
Another path which might not be an option for you would be to run the app server on the same machine as the db, but that doesn't work for everyone due to their architecture. It certainly can be an effective test to see if eliminating the network variable helps. Good luck.

How can I get Jelastic to sleep?

Yesterday I got a trial account on webhosting.net's Jelastic v2.2.2 and configured an environment with a minimum of 0 cloudlets (max 8, i.e., all dynamic, no reserved). Then I deployed a Grails war which was using 3 cloudlets after it started up (around 350 MB). It worked great, and I was very impressed.
However, I did not access my app overnight, and the billing history shows it kept using 3 dynamic cloudlets every hour, even with 0 requests (i.e., 0 MB paid traffic) for 14 hours. Is there some way I can get my Jelastic environment to sleep (i.e., hibernation) after some period with no requests (e.g., after an hour or two)? Then, when it gets a request, I'd like it to automatically wake up (i.e., allocate some cloudlets and restore memory from disk). I see how to stop and restart it manually, but I would like it to work automatically, for any requester.
edit: I found the following documentation, but does it not work for Tomcat/Grails?
Hibernation
Jelastic’s hibernation feature delivers even better utilization of cluster resources. Optimal use of resources is achieved by suspending non-active containers and returning released resources back to the cluster.
Because they are in sleep mode, hibernated containers do not consume resources (only disk space). As a result you save money while your containers are in hibernate mode. If applications are needed again the platform returns them to a running state again in just a few seconds.
It takes a little time to awaken your environment from sleep, so it's not suitable to work how you describe for production use - you would effectively lose visitors because it would seem like your service is offline due to the delays for that first access.
For that reason the 'sleep' function is only active for trial accounts, and the inactivity time before sleep is set by the hosting provider (so you should contact them directly for help on that point).
Of course you should also remember that accesses from search engine spiders etc. may keep your environment awake.

Ruby on Rails: How would I handle 10 concurrent users? Do I need more CPU?

Sorry if this might seem obvious. I've monitored that a web request on my Rails app uses 30-33% of CPU every time. For example, if I load a web page, then 30% of CPU is used. Does that mean that my box can only handle 3 concurrent web requests, and will stall if there are more than 3 web requests (i.e. I'll get a 100% CPU)?
If so, does that also mean that if I want to handle more than 3 concurrent web requests, then I'll have to get more servers to handle the load using a load balancer? (e.g. to handle 6 concurrent web requests, I'll need 2 servers; for 9 concurrent requests, I'll need 3 servers; for 12, I'll need 4 servers -- and so on?)
I think you should start with load tests. I wouldn't trust manual testing that much.
Load tests tell you how long the response takes for each client, and how many clients
simply time-out.
Also you will be able to measure the improvements objectively for any changes that you make.
Look at ab, or httperf; there are many other tools available.
Stephan
Your Apache or Nginx in front of the Passenger will queue requests until a Passenger worker becomes available. You can limit the number of concurrent workers so your server never stalls (but new visitors will have to wait longer until it's their turn).
It's difficult to tell based on this information. It depends very much on the web server stack you're using and which environment you're running. Different servers (Mongrel, Webrick, Apache using various mechanisms, Unicorn) all have different memory characteristics. Different environments (development vs. test vs. production) all exhibit radically different memory usage characteristics.

Resources