Bulk publishing Issue in Fatwire - publishing

Currently performing a Bulk publish using Mirror Queue for around 11000 assets.
Have tried the same in lower environments and was successful.
No error logged # source and destination futuretense logs. The asset update got logged at destination for long time and no updates later. Publishing process still keeps rolling in source.
No stuck threads and memory consumption looks good on both environments.
Any inputs for the same, will be much appreciated.

What is the Fatwire version? Please note that mirror to server publishing is no longer supported in Oracle Webcenter Sites 11.1.1.8. Real time publishing is more robust and less error prone.

Related

Heroku configuration for Ruby on Rails application

I’ve set a client up with Heroku for their Ruby on Rails application and have had a great deal of trouble over the years with their application not running well regardless of how much money we spend on additional resources, find their documentation highly confusing. I’ve never been able to understand their specific terminology and documentation. We are constantly getting "H12" errors and "R14" errors etc. The memory usage and dyno loads are constantly spiking. And yet this is a small to medium-sized business without a massive amount of traffic. Wondering if anybody out there who does understand the ins and outs of Heroku can look this configuration over and tell me if it makes sense:
DB_POOL: 10
MALLOC_ARENA_MAX: 2
RAILS_MAX_THREADS: 5
WEB_CONCURRENCY: 4
Ruby 2.7
Rails 6.0
Puma
8 2x web dynos
5 1x worker dynos
$50 Postgres standard 0 database
$15 Memcachier
$10 Rediscloud
...etc addons
Your WEB_CONCURRENCY is too high for your Standard-2x dynos. The recommended default is 2: https://devcenter.heroku.com/articles/deploying-rails-applications-with-the-puma-web-server#recommended-default-puma-process-and-thread-configuration
This is likely contributing to your R14 errors as higher web concurrency means more memory usage. So you need to either lower your web concurrency (which may mean you also need to increase the # of dynos to compensate) or you need to use bigger dynos.
You already have MALLOC_ARENA_MAX=2 but not sure if you are using jemalloc. You might want to try that too.
Of course, you may also have other memory issues in your app - check out some tips here. I also recommend adding a monitoring tool like AppSignal as it's capable of tracking memory allocations per transaction.
For mitigating H12s:
Ensure you have installed something like the rack-timeout gem, which ensures that a long-running request is dropped at the dyno-level and thus avoids the H12 error (you get a Rack::TimeoutError exception instead). Set the timeout to 15s so that it is well under the 30s for H12 timeout.
Investigate your slow transactions. A monitoring tool is key here, i.e. New Relic (start with lowest-priced paid plan - free plan does not allow transaction tracing). Here is their blog post on how to trace transactions
When you've identified the problem - fix it!
if the bottleneck is external:
check for external API limits and throttling
add timeouts and make app resilient to slow external responses
if the bottleneck is due to the database:
optimize slow queries
check cache hit rates
check for the # of waiting connections and db locks -> if the number of waiting connections is consistently above 0 for X minutes, that indicates you have some long locks that you'll need to investigate. Waiting connections is easiest to track over time with Librato (free plan should do fine)
if the bottleneck is other app code:
add more custom instrumentation to get more insights, i.e. New Relic instructions
address app code issues
I want to stress the importance of monitoring tools to help diagnose issues and help determine optimal resource usage. Doing things like figuring out the correct concurrency configs, the correct size and # of dynos to run are virtually impossible without proper monitoring tools. Hopefully you have some already that are covered by your etc add-ons that are not listed, but if you do not, I'll summarize my recommendations and mention a couple other tips:
To get more metrics info, ensure you have enabled log-runtime-metrics
Also enable Ruby language metrics
Add a monitoring tool that can track Ruby memory allocations like AppSignal. Scout APM can do this too but I think their plans capable of this are more expensive (requires Scout Insights feature)
Add the lowest-paid version of New Relic. This is my go-to tool for transaction tracing. AppSignal can do this too if you don't want to pay for another tool, but I find it easier with New Relic.
Add Librato. It offers some great charts out of the box, including a set of Postgres charts in its own dashboard.
Set alerts in your monitoring apps to warn you about things like response times so you can look into them!
And of course, make all your changes in staging first AND load test them to see the impacts of your changes before attempting in production!
Update: I also just noticed that you said you are using Standard-0 Postgres, which means it has a 120 connection limit. So if you end up lowering your WEB_CONCURRENCY and increasing the # of dynos, watch out for your total connections to that database. Beyond just the fact that there is a limit, more connections also mean more overhead for your db anyway so if you are close to your connection limit, you are more likely to see db performance suffer. You may want to upgrade to another plan that has a higher connection limit or use pgbouncer as your connection pooler to avoid connection limits.

Icinga - check_yum - Socket Timeout?

I'm using the check_yum - Plugin in my Icinga-Monitoring-Environment to check if there are security critical updates available. This works quite fine but sometimes I get a " CHECK_NRPE: Socket timeout after xx seconds." while executing the check. Currently my NRPE-Timeout is 30 seconds.
If I re-schedule the check a few times or executing the check directly from my Icinga-Server with a higher nrpe-timeout-value everything works fine, at least after a few executions of the check. All other checks via NRPE are not throwing any errors. So I think there is no general error with my NRPE-config or the plugins I'm using. Is there some explanation for this strange behaviour of the check_yum - plugin? Maybe some caching issues on the monitored servers?
First, be sure you are using the 1.0 version of this check from: https://code.google.com/p/check-yum/downloads/detail?name=check_yum_1.0.0&can=2&q=
The changes I've seen in that version could fix this issue, depending on it's root cause.
Second, if your server(s) are not configured to use all 'local' cache repos, then this check will likely time out before the 30 second deadline. Because: 1> the amount of data from the refresh/update is pretty large and may be taking a long time to download from remote (include RH proper) servers and 2> most of the 'official' update servers tend to go off-line A LOT.
Best solution I've found is to have a cronjob to perform your update check at a set interval (I use weekly) and create a log file containing those security patches the system(s) require. Then use a Nagios check, via a simple shell script, to see if said file has any new items in it.

Rails Server Memory Leak/Bloating Issue

We are running 2 rails application on server with 4GB of ram. Both servers use rails 3.2.1 and when run in either development or production mode, the servers eat away ram at incredible speed consuming up-to 1.07GB ram each day. Keeping the server running for just 4 days triggered all memory alarms in monitoring and we had just 98MB ram free.
We tried active-record optimization related to bloating but still no effect. Please help us figure out how can we trace the issue that which of the controller is at fault.
Using mysql database and webrick server.
Thanks!
This is incredibly hard to answer, without looking into the project details itself. Though I am quite sure you won't be using Webrick in your target production build(right?), so check if it behaves the same under Passenger or whatever is your choice.
Also without knowing the details of the project I would suggest looking at features like generating pdfs, csv parsing, etc. Seen a case, where generating pdf files have been eating resources in a similar fashion, leaving like 5mb of not garbage collected memory for each run.
Good luck.

How are people solving app pool recycle issues on deployment with large apps?

Currently after a build/deployment of our app (58 projects, large asp.net MVC 3 front end) takes ~15-20secs to load as it goes through the whole 'recycling the app pool' (release configuration).
We do have a web farm if that alters people's answers, but the question really is:
What are people doing in large scale applications where a maintenance window isn't viable (we're a 24/7 very active website) to minimize that initial 'first hit' on the app pool recycle after a deploy?
We've used a number of tools to analyze that startup time and there doesn't really seem to be any way to bring it down so what I'm looking for are what techniques do people employ in order to minimize the impact of a large application deploy affecting users.
By default - if you change 15 files in an ASP.NET application at once (even via FTP) then the app pool is automatically recycled. You can change the number of files but as soon as web.config and bin files are changed then it needs to recycle. So in my opinion the ideal solution for an environment like yours would be as follows:
4 web servers (this is an arbitrary number)
each server has a status.aspx that the load balancer looks at - use TeamCity to take 2 of these servers "off line" (off the load balancer) and wait 20 seconds for the traffic to filter across. A distributed cache will help keep user experience problems
Use TeamCity to deploy to those 2 servers - run your automated tests etc. and once you are happy put those back into the farm and take the other 2 offline and deploy to those
This can all be scripted / automated. The only issue with this is any schema changes that are not backwards compatible may not allow running the new version site in parallel with old version of the site for the 20 seconds for the load balancer to kick back in
This is good old fashioned Canary Releasing - there are some patterns here http://continuousdelivery.com/patterns/ to help take into consideration. Id also suggest a copy of that continuous delivery book - its like a continuous delivery bible and has got me out of a few situations :)
At the very base you could run a tinyget script against the application after completion of deployment which will "warm up" the application however if a customer hits your site before the script can run, they will still face a delay. What do you currently have in place, what post deployment steps do you have in place?
In a farm environment you could stage deployments too, so take one server out of load balance, update it and then bring that online after deployment and take the other out, complete the deployment and then reintroduce into the farm. How is your SQL Server setup - clustered?
copy and paste from my post here
We operate a Blue/Green deployment strategy on a 4 tier architecture which has a web site over 4 servers at the top tier. Due to the complexity the architecture introduced for deployments, we needed a way to deploy without disturbing any traffic to the "live" site. Following Fowler's advice, but not quite in the same way, we came up with a solution that means we have 2 sites on each server (a blue and a green, or in our case site A and site B). The live site has the appropriate host header, and once we have deployed and tested to the non-live site, we then flip the headers of the 2 sites so that what was once live is now the non-live site, and vice-versa. The effect is, a robust deployment that can be done in business hours and with the highest level of confidence.
This of course complicates your configuration and deployment slightly, but it's worth the effort. I guess it kind of goes without saying that you want to script both the deployment, and the host header swapping.
Firstly, unless you're running Google or something bigger, does a 15-20s load time at 3am for a handful of users really impact that much? I'd say the effort invested in eliminating the occasional lag would far outweigh the 15-20s inconvenience of a couple of users.
I consider it a necessary evil of using ASP.NET unfortunately. Using a pre-compiled site (.DLLs instead of the code-behind files) will lessen the time but not necessarily eliminate it.
The best thing you can do is use something like a status notification bar to warn users they may experience some "issues" during "essential maintenance".
But even then, I'd say in terms of user experience it'd be better to keep quiet and have a handful of people blame their "slow internet" when your site takes 20s to load on one occasion, than announce to all and sundry that it will be slow.
You can also try this approach : http://weblogs.asp.net/scottgu/archive/2009/09/15/auto-start-asp-net-applications-vs-2010-and-net-4-0-series.aspx
without knowing anything about your site, my first thought is that you might be able to break it down into smaller sites so that they start faster individually.
second, with your web farm, i assume you have some sort of load balancing device in front of that from which you can pull machines out of the pool when they are being deployed. don't put them back in the pool until after you have sent a request against the site to get it started up. you should be able to script this such that you are pretty much clicking a button that takes a machine out, deploys to it, and sends a request after it's back up and happy.
You can consider using aspnet_compiler.exe to precompile your application, because I think the delay after deployment is caused by the compilation phase rather than "whole recycling the app pool".

Increase reliability of capistrano deploys

I've used capistrano for a long time, but always for sites that weren't critical. If something went wrong, a few minutes of downtime weren't a big problem.
Now I'm working on a more critical service and need to cover my edge cases. One of which is if my local connection to a server becomes interrupted in the middle of a deployment.
One solution I can think of is to do deployments directly from the server, inside of a screen session. This seems like a reasonable and obvious solution, but I'm surprised I've never read about it elsewhere or even seen it recommended in the capistrano documentation.
Any pointers / tips are welcome. Thanks!
There is only a very small time window during a typical Capistrano deploy where a dropped connection would cause trouble. This window is when the current release is linked to the new version and the server is told to restart. If your connection drops before or after that, you're fine.
If you positively need to be safe from disconnects 100%, you can log onto the server, open a screen session and do a cap deploy from the latest release folder.

Resources