ipcluster - can't start more than about 110 ipengines - or maybe some of them die - ipython-parallel

I'm having difficulty getting ipcluster to start all of the ipengines that I ask for. It appears to be some sort of timeout issue. I'm using IPython 2.0 on a linux cluster with 192 processors. I run a local ipcontroller, and start ipengines on my 12 nodes using SSH. It's not a configuration problem (at least I don't think it is) because I'm having no problems running about 110 ipengines. When I try for a larger amount, some of them seem to die during start up, and I do mean some of them - the final number I have varies a little. ipcluster reports that all engines have started. The only sign of trouble that I can find (other than not having use of all of the requested engines) is the following in some of the ipengine logs:
2014-06-20 16:42:13.302 [IPEngineApp] Loading url_file u'.ipython/profile_ssh/security/ipcontroller-engine.json'
2014-06-20 16:42:13.335 [IPEngineApp] Registering with controller at tcp://10.1.0.253:55576
2014-06-20 16:42:13.429 [IPEngineApp] Starting to monitor the heartbeat signal from the hub every 3010 ms.
2014-06-20 16:42:13.434 [IPEngineApp] Using existing profile dir: u'.ipython/profile_ssh'
2014-06-20 16:42:13.436 [IPEngineApp] Completed registration with id 49
2014-06-20 16:42:25.472 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (1 time(s) in a row).
2014-06-20 18:09:12.782 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (1 time(s) in a row).
2014-06-20 19:14:22.760 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (1 time(s) in a row).
2014-06-20 20:00:34.969 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (1 time(s) in a row).
I did some googling to see if I could find some wisdom, and the only thing I've come across is http://permalink.gmane.org/gmane.comp.python.ipython.devel/12228. The author seems to think it's a timeout of sorts.
I also tried tripling (90 seconds as opposed to the default 30) the IPClusterStart.early_shutdown and IPClusterEngines.early_shutdown times without any luck.
Thanks - in advance - for any pointers on getting the full use of my cluster.

When I try execute ipcluster start --n=200 I get: OSError: [Errno 24] Too many open files
This could be what happens to you too. Try raising the open file limit of the OS.

Related

Ceph cluster down, Reason OSD Full - not starting up

Cephadm Pacific v16.2.7
Our Ceph cluster is stuck pgs degraded and osd are down
Reason:- OSD's got filled up
Things we tried
Changed vale to to maximum possible combination (not sure if done right ?)
backfillfull < nearfull, nearfull < full, and full < failsafe_full
ceph-objectstore-tool - tried to delte some pgs to recover space
tried to mount osd and delete pg's to recover some space, but not sure how to do it in bluestore .
Global Recovery Event - stuck for ever
ceph -s
cluster:
id: a089a4b8-2691-11ec-849f-07cde9cd0b53
health: HEALTH_WARN
6 failed cephadm daemon(s)
1 hosts fail cephadm check
Reduced data availability: 362 pgs inactive, 6 pgs down, 287 pgs peering, 48 pgs stale
Degraded data redundancy: 5756984/22174447 objects degraded (25.962%), 91 pgs degraded, 84 pgs undersized
13 daemons have recently crashed
3 slow ops, oldest one blocked for 31 sec, daemons [mon.raspi4-8g-18,mon.raspi4-8g-20] have slow ops.
services:
mon: 5 daemons, quorum raspi4-8g-20,raspi4-8g-25,raspi4-8g-18,raspi4-8g-10,raspi4-4g-23 (age 2s)
mgr: raspi4-8g-18.slyftn(active, since 3h), standbys: raspi4-8g-12.xuuxmp, raspi4-8g-10.udbcyy
osd: 19 osds: 15 up (since 2h), 15 in (since 2h); 6 remapped pgs
data:
pools: 40 pools, 636 pgs
objects: 4.28M objects, 4.9 TiB
usage: 6.1 TiB used, 45 TiB / 51 TiB avail
pgs: 56.918% pgs not active
5756984/22174447 objects degraded (25.962%)
2914/22174447 objects misplaced (0.013%)
253 peering
218 active+clean
57 undersized+degraded+peered
25 stale+peering
20 stale+active+clean
19 active+recovery_wait+undersized+degraded+remapped
10 active+recovery_wait+degraded
7 remapped+peering
7 activating
6 down
2 active+undersized+remapped
2 stale+remapped+peering
2 undersized+remapped+peered
2 activating+degraded
1 active+remapped+backfill_wait
1 active+recovering+undersized+degraded+remapped
1 undersized+peered
1 active+clean+scrubbing+deep
1 active+undersized+degraded+remapped+backfill_wait
1 stale+active+recovery_wait+undersized+degraded+remapped
progress:
Global Recovery Event (2h)
[==========..................] (remaining: 4h)
'''
Some versions of BlueStore were susceptible to BlueFS log growing extremely large - beyond the point of making booting OSD impossible. This state is indicated by booting that takes very long and fails in _replay function.
This can be fixed by::
ceph-bluestore-tool fsck –path osd path –bluefs_replay_recovery=true
It is advised to first check if rescue process would be successful::
ceph-bluestore-tool fsck –path osd path –bluefs_replay_recovery=true –bluefs_replay_recovery_disable_compact=true
If above fsck is successful fix procedure can be applied
Special Thank you to, this has been solved with the help of a dewDrive Cloud backup faculty Member

Split health checks cannot set custom path

This was originally an internal message and may refer to some of our projects, but the background information will be useful so have left references to these in.
We are having an issue with Google App Engine preventing us from making new deployments.
The error message is:
ERROR: (gcloud.app.deploy) Error Response: [4] Your deployment has failed to become healthy in the allotted time and therefore was rolled back. If you believe this was an error, try adjusting the 'app_start_timeout_sec' setting in the 'readiness_check' section.
This is a surprising error, especially as we haven't had issues with this until recently. It appears our changes earlier this year to prepare for the new Google App Engine split health checks didn't actually work, so when the system was deprecated on September 15th (mentioned here https://cloud.google.com/appengine/docs/flexible/custom-runtimes/migrating-to-split-health-checks), no deployments worked from that point on. Health checks specification is listed here: https://cloud.google.com/appengine/docs/flexible/python/reference/app-yaml#liveness_path.
The error message references the app_start_timout_sec setting, more details about this is found here: https://cloud.google.com/endpoints/docs/openapi/troubleshoot-aeflex-deployment. I didn’t think it was a timeout issue, since our system boots fairly quickly (less than the 5 minutes it defaults to) so I investigated the logs of a version of the app (from now on I’m talking about codeWOF production system unless specified). The versions only listed the ‘working’ versions, but when I looked in the Logs Viewer, all the different versions were listed, including those that had failed.
With the following app.yaml the logs were showing this error:
liveness_check:
path: "/gae/liveness_check"
readiness_check:
path: "/gae/readiness_check"
Ready for new connections
Compiling message files
Starting gunicorn 19.9.0
Listening at: http://0.0.0.0:8080 (13)
Using worker: gevent
Booting worker with pid: 16
Booting worker with pid: 17
Booting worker with pid: 18
GET 301 0 B 2 ms GoogleHC/1.0 /readiness_check
GET 301 0 B 3 ms GoogleHC/1.0 /liveness_check
This confirmed that the system had booted successfully and the checks were getting through but returning the wrong code, a 301 redirect instead of a 200. But also that the checks were going to the wrong URL, no prefix was shown.
I believed the redirect was caused by either the APPEND_SLASH setting, or the HTTP to HTTPS redirect. I tried the following configuration and got the following:
liveness_check:
path: "/liveness_check/"
readiness_check:
path: "/readiness_check/"
GET 301 0 B 2 ms GoogleHC/1.0 /readiness_check
GET 301 0 B 3 ms GoogleHC/1.0 /liveness_check
Same error as above, so it appears that setting the custom path does not affect where the health check is sent. Searching for the custom path in all logging messages returns exactly one message (summary below):
2019-11-06 16:24:14.288 NZDT App Engine Create Version default:20191106t032141
livenessCheck: { path: "/liveness_check/" }
readinessCheck: { path: "/readiness_check/" }
Resources: { cpu: 1 memoryGb: 3.75 }
So this is the first thing to look into, is setting the custom path correctly, I couldn’t get this to change.
I read all StackOverflow posts talking about App Engine and split health checks (there were less than 10 entries) and tried all suggested fixes. These included:
Checking the split health check was set correctly using gcloud app describe --project codewof.
Setting the split health checks (again) with gcloud app update --split-health-checks --project codewof.
The last thing I had tried resulted in something quite interesting. I deleted all health check settings in the app.yaml files.
The documentation (https://cloud.google.com/appengine/docs/flexible/custom-runtimes/configuring-your-app-with-app-yaml#updated_health_checks) states the following:
By default, HTTP requests from health checks are not forwarded to your application container. If you want to extend health checks to your application, then specify a path for liveness checks or readiness checks. A customized health check to your application is considered successful if it returns a 200 OK response code.
This sounded like the overall VM was being checked, rather than the docker image running inside of it, and the deployment worked!
GET 200 0 B 2 ms GoogleHC/1.0 /readiness_check
GET 200 0 B 3 ms GoogleHC/1.0 /liveness_check
But if the docker container fails for some reason, Google App Engine wouldn’t know there is an issue. We need to look into this scenario and see what it actually means, I couldn’t find anything specifying it exactly. However this allows us to do urgent deployments.
I also tested the following to skip HTTPS redirects.
settings/production.py
SECURE_REDIRECT_EXEMPT = [
r'^/?cron/.*',
r'^/?liveness_check/?$',
r'^/?readiness_check/?$',
]
liveness_check:
path: "/liveness_check/"
readiness_check:
path: "/readiness_check/"
GET 301 0 B 2 ms GoogleHC/1.0 /readiness_check
GET 301 0 B 3 ms GoogleHC/1.0 /liveness_check
The last confusing thing I discovered was to do with the codewof-dev website’s behaviour conflicting with documentation I had read. I can’t find the documentation again but I’m pretty sure it said that the App Engine instance will either run the old legacy or new split health checks. But the codewof-dev website is running both!
GET 200 0 B 2 ms GoogleHC/1.0 /readiness_check
GET 200 2 B 2 ms GoogleHC/1.0 /_ah/health
GET 200 2 B 2 ms GoogleHC/1.0 /_ah/health
GET 200 2 B 2 ms GoogleHC/1.0 /_ah/health
GET 200 2 B 2 ms GoogleHC/1.0 /_ah/health
GET 200 2 B 2 ms GoogleHC/1.0 /_ah/health
GET 200 2 B 2 ms GoogleHC/1.0 /_ah/health
GET 200 0 B 2 ms GoogleHC/1.0 /readiness_check
GET 200 2 B 2 ms GoogleHC/1.0 /_ah/health
GET 200 0 B 2 ms GoogleHC/1.0 /readiness_check
GET 200 2 B 2 ms GoogleHC/1.0 /_ah/health
GET 200 0 B 2 ms GoogleHC/1.0 /readiness_check
GET 200 2 B 2 ms GoogleHC/1.0 /_ah/health
GET 200 0 B 2 ms GoogleHC/1.0 /readiness_check
GET 200 0 B 3 ms GoogleHC/1.0 /liveness_check
Last discovery: I tested this morning by deleting all the health check configurations in the app.yaml files (as I had done previously) but also deleted all the custom health check URLs in our config URL routing file. The system deployed successfully with the following health checks
GET 200 0 B 2 ms GoogleHC/1.0 /readiness_check
GET 200 0 B 3 ms GoogleHC/1.0 /liveness_check
This seems to show that the App Engine VM instance has its own check, and it's not entering our Docker container. This would be fine for most GAE flexible instances, but not the custom runtime option we are using.

"no next heap size found: 18446744071789822643, offset 0"

I've written a simulator, which is distributed over two hosts. When I launch a few thousand processes, after about 10 minutes and half a million events written, my main Erlang (OTP v22) virtual machine crashes with this message:
no next heap size found: 18446744071789822643, offset 0.
It's always that same number - 18446744071789822643.
Because my server is very capable, the crash dump is also huge and I can't view it on my headless server (no WX installed).
Are there any tips on what I can look at?
What would be the first things I can try out to debug this issue?
First, see what memory() says:
> memory().
[{total,18480016},
{processes,4615512},
{processes_used,4614480},
{system,13864504},
{atom,331273},
{atom_used,306525},
{binary,47632},
{code,5625561},
{ets,438056}]
Check which one is growing - processes, binary, ets?
If it's processes, try typing i(). in the Erlang shell while the processes are running. You'll see something like:
Pid Initial Call Heap Reds Msgs
Registered Current Function Stack
<0.0.0> otp_ring0:start/2 233 1263 0
init init:loop/1 2
<0.1.0> erts_code_purger:start/0 233 44 0
erts_code_purger erts_code_purger:wait_for_request 0
<0.2.0> erts_literal_area_collector:start 233 9 0
erts_literal_area_collector:msg_l 5
<0.3.0> erts_dirty_process_signal_handler 233 128 0
erts_dirty_process_signal_handler 2
<0.4.0> erts_dirty_process_signal_handler 233 9 0
erts_dirty_process_signal_handler 2
<0.5.0> erts_dirty_process_signal_handler 233 9 0
erts_dirty_process_signal_handler 2
<0.8.0> erlang:apply/2 6772 238183 0
erl_prim_loader erl_prim_loader:loop/3 5
Look for a process with a very big heap, and that's where you'd start looking for a memory leak.
(If you weren't running headless, I'd suggest starting Observer with observer:start(), and look at what's happening in the Erlang node.)

Sphinx returns old data after indexer --rotate

I'm having sphinx version 2.0.4 fully working.
Whenever I want to reindex data, I'm using indexer
/usr/bin/indexer --config /etc/sphinxsearch/sphinx.conf XXX --rotate
It gives output:
root#dsphinx:~# /usr/bin/indexer --config /etc/sphinxsearch/sphinx.conf XXX --rotate
using config file '/etc/sphinxsearch/sphinx.conf'...
indexing index 'XXX'...
collected 9536 docs, 55.8 MB
sorted 4.7 Mhits, 100.0% done
WARNING: 2 duplicate document id pairs found
total 9536 docs, 55758410 bytes
total 3.930 sec, 14187197 bytes/sec, 2426.34 docs/sec
total 4 reads, 0.005 sec, 2926.5 kb/call avg, 1.3 msec/call avg
total 262 writes, 0.062 sec, 311.5 kb/call avg, 0.2 msec/call avg
rotating indices: succesfully sent SIGHUP to searchd (pid=14068).
The problem is that process 14068 gives old indexed data.
If I reload service (/etc/inid.d/sphinxsearch reload) this process ID is changed and sphinx returns new indexed data.
Is this a bug or I'm not doing something right?
How are you running queries?
Are you using any sort of persistant connection manager in your client? If so, it might be holding connections open, that doesnt give searchd a chance to actully restart.
(ie the restart will be delayed until all connections are closed)

reducing jitter of serial ntp refclock

I am currently trying to connect my DIY DC77 clock to ntpd (using Ubuntu). I followed the instructions here: http://wiki.ubuntuusers.de/Systemzeit.
With ntpq I can see the DCF77 clock
~$ ntpq -c peers
remote refid st t when poll reach delay offset jitter
==============================================================================
+dispatch.mxjs.d 192.53.103.104 2 u 6 64 377 13.380 12.608 4.663
+main.macht.org 192.53.103.108 2 u 12 64 377 33.167 5.008 4.769
+alvo.fungus.at 91.195.238.4 3 u 15 64 377 16.949 7.454 28.075
-ns1.blazing.de 213.172.96.14 2 u - 64 377 10.072 14.170 2.335
*GENERIC(0) .DCFa. 0 l 31 64 377 0.000 5.362 4.621
LOCAL(0) .LOCL. 12 l 927 64 0 0.000 0.000 0.000
So far this looks OK. However I have two questions.
What exactly is the sign of the offset? Is .DCFa. ahead of the system clock or behind the system clock?
.DCFa. points to refclock-0 which is a DIY DCF77 clock emulating a Meinberg clock. It is connected to my Ubuntu Linux box with an FTDI usb-serial adapter running at 9600 7e2. I verified with a DSO that it emits the time with jitter significantly below 1ms. So I assume the jitter is introduced by either the FTDI adapter or the kernel. How would I find out and how can I reduce it?
Part One:
Positive offsets indicate time in the client is behind time on the server.
Negative offsets indicate that time in the client is ahead of time on the server.
I always remember this as "what needs to happen to my clock?"
+0.123 = Add 0.123 to me
-0.123 = Subtract 0.123 from me
Part Two:
Yes the USB serial converters add jitter. Get a real serial port:) You can also use setserial and tell it that the serial port needs to be low_latency. Just apt-get setserial.
Bonus Points:
Lose the unreferenced local clock entry. NO LOCL!!!!

Resources