This was originally an internal message and may refer to some of our projects, but the background information will be useful so have left references to these in.
We are having an issue with Google App Engine preventing us from making new deployments.
The error message is:
ERROR: (gcloud.app.deploy) Error Response: [4] Your deployment has failed to become healthy in the allotted time and therefore was rolled back. If you believe this was an error, try adjusting the 'app_start_timeout_sec' setting in the 'readiness_check' section.
This is a surprising error, especially as we haven't had issues with this until recently. It appears our changes earlier this year to prepare for the new Google App Engine split health checks didn't actually work, so when the system was deprecated on September 15th (mentioned here https://cloud.google.com/appengine/docs/flexible/custom-runtimes/migrating-to-split-health-checks), no deployments worked from that point on. Health checks specification is listed here: https://cloud.google.com/appengine/docs/flexible/python/reference/app-yaml#liveness_path.
The error message references the app_start_timout_sec setting, more details about this is found here: https://cloud.google.com/endpoints/docs/openapi/troubleshoot-aeflex-deployment. I didn’t think it was a timeout issue, since our system boots fairly quickly (less than the 5 minutes it defaults to) so I investigated the logs of a version of the app (from now on I’m talking about codeWOF production system unless specified). The versions only listed the ‘working’ versions, but when I looked in the Logs Viewer, all the different versions were listed, including those that had failed.
With the following app.yaml the logs were showing this error:
liveness_check:
path: "/gae/liveness_check"
readiness_check:
path: "/gae/readiness_check"
Ready for new connections
Compiling message files
Starting gunicorn 19.9.0
Listening at: http://0.0.0.0:8080 (13)
Using worker: gevent
Booting worker with pid: 16
Booting worker with pid: 17
Booting worker with pid: 18
GET 301 0 B 2 ms GoogleHC/1.0 /readiness_check
GET 301 0 B 3 ms GoogleHC/1.0 /liveness_check
This confirmed that the system had booted successfully and the checks were getting through but returning the wrong code, a 301 redirect instead of a 200. But also that the checks were going to the wrong URL, no prefix was shown.
I believed the redirect was caused by either the APPEND_SLASH setting, or the HTTP to HTTPS redirect. I tried the following configuration and got the following:
liveness_check:
path: "/liveness_check/"
readiness_check:
path: "/readiness_check/"
GET 301 0 B 2 ms GoogleHC/1.0 /readiness_check
GET 301 0 B 3 ms GoogleHC/1.0 /liveness_check
Same error as above, so it appears that setting the custom path does not affect where the health check is sent. Searching for the custom path in all logging messages returns exactly one message (summary below):
2019-11-06 16:24:14.288 NZDT App Engine Create Version default:20191106t032141
livenessCheck: { path: "/liveness_check/" }
readinessCheck: { path: "/readiness_check/" }
Resources: { cpu: 1 memoryGb: 3.75 }
So this is the first thing to look into, is setting the custom path correctly, I couldn’t get this to change.
I read all StackOverflow posts talking about App Engine and split health checks (there were less than 10 entries) and tried all suggested fixes. These included:
Checking the split health check was set correctly using gcloud app describe --project codewof.
Setting the split health checks (again) with gcloud app update --split-health-checks --project codewof.
The last thing I had tried resulted in something quite interesting. I deleted all health check settings in the app.yaml files.
The documentation (https://cloud.google.com/appengine/docs/flexible/custom-runtimes/configuring-your-app-with-app-yaml#updated_health_checks) states the following:
By default, HTTP requests from health checks are not forwarded to your application container. If you want to extend health checks to your application, then specify a path for liveness checks or readiness checks. A customized health check to your application is considered successful if it returns a 200 OK response code.
This sounded like the overall VM was being checked, rather than the docker image running inside of it, and the deployment worked!
GET 200 0 B 2 ms GoogleHC/1.0 /readiness_check
GET 200 0 B 3 ms GoogleHC/1.0 /liveness_check
But if the docker container fails for some reason, Google App Engine wouldn’t know there is an issue. We need to look into this scenario and see what it actually means, I couldn’t find anything specifying it exactly. However this allows us to do urgent deployments.
I also tested the following to skip HTTPS redirects.
settings/production.py
SECURE_REDIRECT_EXEMPT = [
r'^/?cron/.*',
r'^/?liveness_check/?$',
r'^/?readiness_check/?$',
]
liveness_check:
path: "/liveness_check/"
readiness_check:
path: "/readiness_check/"
GET 301 0 B 2 ms GoogleHC/1.0 /readiness_check
GET 301 0 B 3 ms GoogleHC/1.0 /liveness_check
The last confusing thing I discovered was to do with the codewof-dev website’s behaviour conflicting with documentation I had read. I can’t find the documentation again but I’m pretty sure it said that the App Engine instance will either run the old legacy or new split health checks. But the codewof-dev website is running both!
GET 200 0 B 2 ms GoogleHC/1.0 /readiness_check
GET 200 2 B 2 ms GoogleHC/1.0 /_ah/health
GET 200 2 B 2 ms GoogleHC/1.0 /_ah/health
GET 200 2 B 2 ms GoogleHC/1.0 /_ah/health
GET 200 2 B 2 ms GoogleHC/1.0 /_ah/health
GET 200 2 B 2 ms GoogleHC/1.0 /_ah/health
GET 200 2 B 2 ms GoogleHC/1.0 /_ah/health
GET 200 0 B 2 ms GoogleHC/1.0 /readiness_check
GET 200 2 B 2 ms GoogleHC/1.0 /_ah/health
GET 200 0 B 2 ms GoogleHC/1.0 /readiness_check
GET 200 2 B 2 ms GoogleHC/1.0 /_ah/health
GET 200 0 B 2 ms GoogleHC/1.0 /readiness_check
GET 200 2 B 2 ms GoogleHC/1.0 /_ah/health
GET 200 0 B 2 ms GoogleHC/1.0 /readiness_check
GET 200 0 B 3 ms GoogleHC/1.0 /liveness_check
Last discovery: I tested this morning by deleting all the health check configurations in the app.yaml files (as I had done previously) but also deleted all the custom health check URLs in our config URL routing file. The system deployed successfully with the following health checks
GET 200 0 B 2 ms GoogleHC/1.0 /readiness_check
GET 200 0 B 3 ms GoogleHC/1.0 /liveness_check
This seems to show that the App Engine VM instance has its own check, and it's not entering our Docker container. This would be fine for most GAE flexible instances, but not the custom runtime option we are using.
Related
Cephadm Pacific v16.2.7
Our Ceph cluster is stuck pgs degraded and osd are down
Reason:- OSD's got filled up
Things we tried
Changed vale to to maximum possible combination (not sure if done right ?)
backfillfull < nearfull, nearfull < full, and full < failsafe_full
ceph-objectstore-tool - tried to delte some pgs to recover space
tried to mount osd and delete pg's to recover some space, but not sure how to do it in bluestore .
Global Recovery Event - stuck for ever
ceph -s
cluster:
id: a089a4b8-2691-11ec-849f-07cde9cd0b53
health: HEALTH_WARN
6 failed cephadm daemon(s)
1 hosts fail cephadm check
Reduced data availability: 362 pgs inactive, 6 pgs down, 287 pgs peering, 48 pgs stale
Degraded data redundancy: 5756984/22174447 objects degraded (25.962%), 91 pgs degraded, 84 pgs undersized
13 daemons have recently crashed
3 slow ops, oldest one blocked for 31 sec, daemons [mon.raspi4-8g-18,mon.raspi4-8g-20] have slow ops.
services:
mon: 5 daemons, quorum raspi4-8g-20,raspi4-8g-25,raspi4-8g-18,raspi4-8g-10,raspi4-4g-23 (age 2s)
mgr: raspi4-8g-18.slyftn(active, since 3h), standbys: raspi4-8g-12.xuuxmp, raspi4-8g-10.udbcyy
osd: 19 osds: 15 up (since 2h), 15 in (since 2h); 6 remapped pgs
data:
pools: 40 pools, 636 pgs
objects: 4.28M objects, 4.9 TiB
usage: 6.1 TiB used, 45 TiB / 51 TiB avail
pgs: 56.918% pgs not active
5756984/22174447 objects degraded (25.962%)
2914/22174447 objects misplaced (0.013%)
253 peering
218 active+clean
57 undersized+degraded+peered
25 stale+peering
20 stale+active+clean
19 active+recovery_wait+undersized+degraded+remapped
10 active+recovery_wait+degraded
7 remapped+peering
7 activating
6 down
2 active+undersized+remapped
2 stale+remapped+peering
2 undersized+remapped+peered
2 activating+degraded
1 active+remapped+backfill_wait
1 active+recovering+undersized+degraded+remapped
1 undersized+peered
1 active+clean+scrubbing+deep
1 active+undersized+degraded+remapped+backfill_wait
1 stale+active+recovery_wait+undersized+degraded+remapped
progress:
Global Recovery Event (2h)
[==========..................] (remaining: 4h)
'''
Some versions of BlueStore were susceptible to BlueFS log growing extremely large - beyond the point of making booting OSD impossible. This state is indicated by booting that takes very long and fails in _replay function.
This can be fixed by::
ceph-bluestore-tool fsck –path osd path –bluefs_replay_recovery=true
It is advised to first check if rescue process would be successful::
ceph-bluestore-tool fsck –path osd path –bluefs_replay_recovery=true –bluefs_replay_recovery_disable_compact=true
If above fsck is successful fix procedure can be applied
Special Thank you to, this has been solved with the help of a dewDrive Cloud backup faculty Member
I've written a simulator, which is distributed over two hosts. When I launch a few thousand processes, after about 10 minutes and half a million events written, my main Erlang (OTP v22) virtual machine crashes with this message:
no next heap size found: 18446744071789822643, offset 0.
It's always that same number - 18446744071789822643.
Because my server is very capable, the crash dump is also huge and I can't view it on my headless server (no WX installed).
Are there any tips on what I can look at?
What would be the first things I can try out to debug this issue?
First, see what memory() says:
> memory().
[{total,18480016},
{processes,4615512},
{processes_used,4614480},
{system,13864504},
{atom,331273},
{atom_used,306525},
{binary,47632},
{code,5625561},
{ets,438056}]
Check which one is growing - processes, binary, ets?
If it's processes, try typing i(). in the Erlang shell while the processes are running. You'll see something like:
Pid Initial Call Heap Reds Msgs
Registered Current Function Stack
<0.0.0> otp_ring0:start/2 233 1263 0
init init:loop/1 2
<0.1.0> erts_code_purger:start/0 233 44 0
erts_code_purger erts_code_purger:wait_for_request 0
<0.2.0> erts_literal_area_collector:start 233 9 0
erts_literal_area_collector:msg_l 5
<0.3.0> erts_dirty_process_signal_handler 233 128 0
erts_dirty_process_signal_handler 2
<0.4.0> erts_dirty_process_signal_handler 233 9 0
erts_dirty_process_signal_handler 2
<0.5.0> erts_dirty_process_signal_handler 233 9 0
erts_dirty_process_signal_handler 2
<0.8.0> erlang:apply/2 6772 238183 0
erl_prim_loader erl_prim_loader:loop/3 5
Look for a process with a very big heap, and that's where you'd start looking for a memory leak.
(If you weren't running headless, I'd suggest starting Observer with observer:start(), and look at what's happening in the Erlang node.)
We have a java application running on Mule. We have the XMX value configured for 6144M, but are routinely seeing the overall memory usage climb and climb. It was getting close to 20 GB the other day before we proactively restarted it.
Thu Jun 30 03:05:57 CDT 2016
top - 03:05:58 up 149 days, 6:19, 0 users, load average: 0.04, 0.04, 0.00
Tasks: 164 total, 1 running, 163 sleeping, 0 stopped, 0 zombie
Cpu(s): 4.2%us, 1.7%sy, 0.0%ni, 93.9%id, 0.2%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 24600552k total, 21654876k used, 2945676k free, 440828k buffers
Swap: 2097144k total, 84256k used, 2012888k free, 1047316k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3840 myuser 20 0 23.9g 18g 53m S 0.0 79.9 375:30.02 java
The jps command shows:
10671 Jps
3840 MuleContainerBootstrap
The jstat command shows:
S0C S1C S0U S1U EC EU OC OU PC PU YGC YGCT FGC FGCT GCT
37376.0 36864.0 16160.0 0.0 2022912.0 1941418.4 4194304.0 445432.2 78336.0 66776.7 232 7.044 17 17.403 24.447
The startup arguments are (sensitive bits have been changed):
3840 MuleContainerBootstrap -Dmule.home=/mule -Dmule.base=/mule -Djava.net.preferIPv4Stack=TRUE -XX:MaxPermSize=256m -Djava.endorsed.dirs=/mule/lib/endorsed -XX:+HeapDumpOnOutOfMemoryError -Dmyapp.lib.path=/datalake/app/ext_lib/ -DTARGET_ENV=prod -Djava.library.path=/opt/mapr/lib -DksPass=mypass -DsecretKey=aeskey -DencryptMode=AES -Dkeystore=/mule/myStore -DkeystoreInstance=JCEKS -Djava.security.auth.login.config=/opt/mapr/conf/mapr.login.conf -Dmule.mmc.bind.port=1521 -Xms6144m -Xmx6144m -Djava.library.path=%LD_LIBRARY_PATH%:/mule/lib/boot -Dwrapper.key=a_guid -Dwrapper.port=32000 -Dwrapper.jvm.port.min=31000 -Dwrapper.jvm.port.max=31999 -Dwrapper.disable_console_input=TRUE -Dwrapper.pid=10744 -Dwrapper.version=3.5.19-st -Dwrapper.native_library=wrapper -Dwrapper.arch=x86 -Dwrapper.service=TRUE -Dwrapper.cpu.timeout=10 -Dwrapper.jvmid=1 -Dwrapper.lang.domain=wrapper -Dwrapper.lang.folder=../lang
Adding up the "capacity" items from jps shows that only my 6144m is being used for java heap. Where the heck is the rest of the memory being used? Stack memory? Native heap? I'm not even sure how to proceed.
If left to continue growing, it will consume all memory on the system and we will eventually see the system freeze up throwing swap space errors.
I have another process that is starting to grow. Currently at about 11g resident memory.
pmap 10746 > pmap_10746.txt
cat pmap_10746.txt | grep anon | cut -c18-25 | sort -h | uniq -c | sort -rn | less
Top 10 entries by count:
119 12K
112 1016K
56 4K
38 131072K
20 65532K
15 131068K
14 65536K
10 132K
8 65404K
7 128K
Top 10 entries by allocation size:
1 6291456K
1 205816K
1 155648K
38 131072K
15 131068K
1 108772K
1 71680K
14 65536K
20 65532K
1 65512K
And top 10 by total size:
Count Size Aggregate
1 6291456K 6291456K
38 131072K 4980736K
15 131068K 1966020K
20 65532K 1310640K
14 65536K 917504K
8 65404K 523232K
1 205816K 205816K
1 155648K 155648K
112 1016K 113792K
This seems to be telling me that because the Xmx and Xms are set to the same value, there is a single allocation of 6291456K for the java heap. Other allocations are NOT java heap memory. What are they? They are getting allocated in rather large chunks.
Expanding a bit more details on Peter's answer.
You can take a binary heap dump from within VisualVM (right click on the process in the left-hand side list, and then on heap dump - it'll appear right below shortly after). If you can't attach VisualVM to your JVM, you can also generate the dump with this:
jmap -dump:format=b,file=heap.hprof $PID
Then copy the file and open it with Visual VM (File, Load, select type heap dump, find the file.)
As Peter notes, a likely cause for the leak may be non collected DirectByteBuffers (e.g.: some instance of another class is not properly de-referencing buffers, so they are never GC'd).
To identify where are these references coming from, you can use Visual VM to examine the heap and find all instances of DirectByteByffer in the "Classes" tab. Find the DBB class, right click, go to instances view.
This will give you a list of instances. You can click on one and see who's keeping a reference each one:
Note the bottom pane, we have "referent" of type Cleaner and 2 "mybuffer". These would be properties in other classes that are referencing the instance of DirectByteBuffer we drilled into (it should be ok if you ignore the Cleaner and focus on the others).
From this point on you need to proceed based on your application.
Another equivalent way to get the list of DBB instances is from the OQL tab. This query:
select x from java.nio.DirectByteBuffer x
Gives us the same list as before. The benefit of using OQL is that you can execute more more complex queries. For example, this gets all the instances that are keeping a reference to a DirectByteBuffer:
select referrers(x) from java.nio.DirectByteBuffer x
What you can do is take a heap dump and look for object which are storing data off heap such as ByteBuffers. Those objects will appear small but are a proxy for larger off heap memory areas. See if you can determine why lots of those might be retained.
I'm having difficulty getting ipcluster to start all of the ipengines that I ask for. It appears to be some sort of timeout issue. I'm using IPython 2.0 on a linux cluster with 192 processors. I run a local ipcontroller, and start ipengines on my 12 nodes using SSH. It's not a configuration problem (at least I don't think it is) because I'm having no problems running about 110 ipengines. When I try for a larger amount, some of them seem to die during start up, and I do mean some of them - the final number I have varies a little. ipcluster reports that all engines have started. The only sign of trouble that I can find (other than not having use of all of the requested engines) is the following in some of the ipengine logs:
2014-06-20 16:42:13.302 [IPEngineApp] Loading url_file u'.ipython/profile_ssh/security/ipcontroller-engine.json'
2014-06-20 16:42:13.335 [IPEngineApp] Registering with controller at tcp://10.1.0.253:55576
2014-06-20 16:42:13.429 [IPEngineApp] Starting to monitor the heartbeat signal from the hub every 3010 ms.
2014-06-20 16:42:13.434 [IPEngineApp] Using existing profile dir: u'.ipython/profile_ssh'
2014-06-20 16:42:13.436 [IPEngineApp] Completed registration with id 49
2014-06-20 16:42:25.472 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (1 time(s) in a row).
2014-06-20 18:09:12.782 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (1 time(s) in a row).
2014-06-20 19:14:22.760 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (1 time(s) in a row).
2014-06-20 20:00:34.969 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (1 time(s) in a row).
I did some googling to see if I could find some wisdom, and the only thing I've come across is http://permalink.gmane.org/gmane.comp.python.ipython.devel/12228. The author seems to think it's a timeout of sorts.
I also tried tripling (90 seconds as opposed to the default 30) the IPClusterStart.early_shutdown and IPClusterEngines.early_shutdown times without any luck.
Thanks - in advance - for any pointers on getting the full use of my cluster.
When I try execute ipcluster start --n=200 I get: OSError: [Errno 24] Too many open files
This could be what happens to you too. Try raising the open file limit of the OS.
UPDATE
This was not supposed to be a benchmark, or a node vs ruby thing (I should left that more clear in the question, sorry). The point was to compare and demonstrate the diference between blocking and non blocking and how easy it is to write non blocking. I could compare using EventMachine for exemple, but node has this builtin, so it was the obvious choice.
I'm trying to demonstrate to some friends the advantage of nodejs (and it's frameworks) over other technologies, some way that is very simple to understand mainly the non blocking IO thing.
So I tried creating a (very little) Expressjs app and a Rails one that would do a HTTP request on google and count the resulting html length.
As expected (on my computer) Expressjs was 10 times faster than Rails through ab (see below). My questioon is if that is a "valid" way to demonstrate the main advantage that nodejs provides over other technologies. (or there is some kind of caching going on in Expressjs/Connect?)
Here is the code I used.
Expressjs
exports.index = function(req, res) {
var http = require('http')
var options = { host: 'www.google.com', port: 80, method: 'GET' }
var html = ''
var googleReq = http.request(options, function(googleRes) {
googleRes.on('data', function(chunk) {
html += chunk
})
googleRes.on('end', function() {
res.render('index', { title: 'Express', html: html })
})
});
googleReq.end();
};
Rails
require 'net/http'
class WelcomeController < ApplicationController
def index
#html = Net::HTTP.get(URI("http://www.google.com"))
render layout: false
end
end
This is the AB benchmark results
Expressjs
Server Software:
Server Hostname: localhost
Server Port: 3000
Document Path: /
Document Length: 244 bytes
Concurrency Level: 20
Time taken for tests: 1.718 seconds
Complete requests: 50
Failed requests: 0
Write errors: 0
Total transferred: 25992 bytes
HTML transferred: 12200 bytes
Requests per second: 29.10 [#/sec] (mean)
Time per request: 687.315 [ms] (mean)
Time per request: 34.366 [ms] (mean, across all concurrent requests)
Transfer rate: 14.77 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.1 0 0
Processing: 319 581 110.6 598 799
Waiting: 319 581 110.6 598 799
Total: 319 581 110.6 598 799
Percentage of the requests served within a certain time (ms)
50% 598
66% 608
75% 622
80% 625
90% 762
95% 778
98% 799
99% 799
100% 799 (longest request)
Rails
Server Software: WEBrick/1.3.1
Server Hostname: localhost
Server Port: 3001
Document Path: /
Document Length: 65 bytes
Concurrency Level: 20
Time taken for tests: 17.615 seconds
Complete requests: 50
Failed requests: 0
Write errors: 0
Total transferred: 21850 bytes
HTML transferred: 3250 bytes
Requests per second: 2.84 [#/sec] (mean)
Time per request: 7046.166 [ms] (mean)
Time per request: 352.308 [ms] (mean, across all concurrent requests)
Transfer rate: 1.21 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 180 387.8 0 999
Processing: 344 5161 2055.9 6380 7983
Waiting: 344 5160 2056.0 6380 7982
Total: 345 5341 2069.2 6386 7983
Percentage of the requests served within a certain time (ms)
50% 6386
66% 6399
75% 6402
80% 6408
90% 7710
95% 7766
98% 7983
99% 7983
100% 7983 (longest request)
To complement Sean's answer:
Benchmarks are useless. They show what you want to see. They don't show the real picture. If all your app does is proxy requests to google, then an evented server is a good choice indeed (node.js or EventMachine-based server). But often you want to do something more than that. And this is where Rails is better. Gems for every possible need, familiar sequential code (as opposed to callback spaghetti), rich tooling, I can go on.
When choosing one technology over another, assess all aspects, not just how fast it can proxy requests (unless, again, you're building a proxy server).
You're using Webrick to do the test. Off the bat the results are invalid because Webrick can only process on request at a time. You should use something like thin, which is built on top of eventmachine, which can process multiple requests at a time. Your time per request across all concurrent requests, transfer rate and connection times will improve dramatically making that change.
You should also keep in mind that request time is going to be different between each run because of network latency to Google. You should look at the numbers several times to get an average that you can compare.
In the end, you're probably not going to see a huge difference between Node and Rails in the benchmarks.