Ceph cluster down, Reason OSD Full - not starting up - storage

Cephadm Pacific v16.2.7
Our Ceph cluster is stuck pgs degraded and osd are down
Reason:- OSD's got filled up
Things we tried
Changed vale to to maximum possible combination (not sure if done right ?)
backfillfull < nearfull, nearfull < full, and full < failsafe_full
ceph-objectstore-tool - tried to delte some pgs to recover space
tried to mount osd and delete pg's to recover some space, but not sure how to do it in bluestore .
Global Recovery Event - stuck for ever
ceph -s
cluster:
id: a089a4b8-2691-11ec-849f-07cde9cd0b53
health: HEALTH_WARN
6 failed cephadm daemon(s)
1 hosts fail cephadm check
Reduced data availability: 362 pgs inactive, 6 pgs down, 287 pgs peering, 48 pgs stale
Degraded data redundancy: 5756984/22174447 objects degraded (25.962%), 91 pgs degraded, 84 pgs undersized
13 daemons have recently crashed
3 slow ops, oldest one blocked for 31 sec, daemons [mon.raspi4-8g-18,mon.raspi4-8g-20] have slow ops.
services:
mon: 5 daemons, quorum raspi4-8g-20,raspi4-8g-25,raspi4-8g-18,raspi4-8g-10,raspi4-4g-23 (age 2s)
mgr: raspi4-8g-18.slyftn(active, since 3h), standbys: raspi4-8g-12.xuuxmp, raspi4-8g-10.udbcyy
osd: 19 osds: 15 up (since 2h), 15 in (since 2h); 6 remapped pgs
data:
pools: 40 pools, 636 pgs
objects: 4.28M objects, 4.9 TiB
usage: 6.1 TiB used, 45 TiB / 51 TiB avail
pgs: 56.918% pgs not active
5756984/22174447 objects degraded (25.962%)
2914/22174447 objects misplaced (0.013%)
253 peering
218 active+clean
57 undersized+degraded+peered
25 stale+peering
20 stale+active+clean
19 active+recovery_wait+undersized+degraded+remapped
10 active+recovery_wait+degraded
7 remapped+peering
7 activating
6 down
2 active+undersized+remapped
2 stale+remapped+peering
2 undersized+remapped+peered
2 activating+degraded
1 active+remapped+backfill_wait
1 active+recovering+undersized+degraded+remapped
1 undersized+peered
1 active+clean+scrubbing+deep
1 active+undersized+degraded+remapped+backfill_wait
1 stale+active+recovery_wait+undersized+degraded+remapped
progress:
Global Recovery Event (2h)
[==========..................] (remaining: 4h)
'''

Some versions of BlueStore were susceptible to BlueFS log growing extremely large - beyond the point of making booting OSD impossible. This state is indicated by booting that takes very long and fails in _replay function.
This can be fixed by::
ceph-bluestore-tool fsck –path osd path –bluefs_replay_recovery=true
It is advised to first check if rescue process would be successful::
ceph-bluestore-tool fsck –path osd path –bluefs_replay_recovery=true –bluefs_replay_recovery_disable_compact=true
If above fsck is successful fix procedure can be applied
Special Thank you to, this has been solved with the help of a dewDrive Cloud backup faculty Member

Related

"no next heap size found: 18446744071789822643, offset 0"

I've written a simulator, which is distributed over two hosts. When I launch a few thousand processes, after about 10 minutes and half a million events written, my main Erlang (OTP v22) virtual machine crashes with this message:
no next heap size found: 18446744071789822643, offset 0.
It's always that same number - 18446744071789822643.
Because my server is very capable, the crash dump is also huge and I can't view it on my headless server (no WX installed).
Are there any tips on what I can look at?
What would be the first things I can try out to debug this issue?
First, see what memory() says:
> memory().
[{total,18480016},
{processes,4615512},
{processes_used,4614480},
{system,13864504},
{atom,331273},
{atom_used,306525},
{binary,47632},
{code,5625561},
{ets,438056}]
Check which one is growing - processes, binary, ets?
If it's processes, try typing i(). in the Erlang shell while the processes are running. You'll see something like:
Pid Initial Call Heap Reds Msgs
Registered Current Function Stack
<0.0.0> otp_ring0:start/2 233 1263 0
init init:loop/1 2
<0.1.0> erts_code_purger:start/0 233 44 0
erts_code_purger erts_code_purger:wait_for_request 0
<0.2.0> erts_literal_area_collector:start 233 9 0
erts_literal_area_collector:msg_l 5
<0.3.0> erts_dirty_process_signal_handler 233 128 0
erts_dirty_process_signal_handler 2
<0.4.0> erts_dirty_process_signal_handler 233 9 0
erts_dirty_process_signal_handler 2
<0.5.0> erts_dirty_process_signal_handler 233 9 0
erts_dirty_process_signal_handler 2
<0.8.0> erlang:apply/2 6772 238183 0
erl_prim_loader erl_prim_loader:loop/3 5
Look for a process with a very big heap, and that's where you'd start looking for a memory leak.
(If you weren't running headless, I'd suggest starting Observer with observer:start(), and look at what's happening in the Erlang node.)

Ceph too many pgs per osd: all you need to know

I am getting both of these errors at the same time. I can't decrease the pg count and I can't add more storage.
This is a new cluster, and I got these warning when I uploaded about 40GB to it. I guess because radosgw created a bunch of pools.
How can ceph have too many pgs per osd, yet have more object per pg than average with a too few pgs suggestion?
HEALTH_WARN too many PGs per OSD (352 > max 300);
pool default.rgw.buckets.data has many more objects per pg than average (too few pgs?)
osds: 4 (2 per site 500GB per osd)
size: 2 (cross site replication)
pg: 64
pgp: 64
pools: 11
Using rbd and radosgw, nothing fancy.
I'm going to answer my own question in hopes that it sheds some light on the issue or similar misconceptions of ceph internals.
Fixing HEALTH_WARN too many PGs per OSD (352 > max 300) once and for all
When balancing placement groups you must take into account:
Data we need
pgs per osd
pgs per pool
pools per osd
the crush map
reasonable default pg and pgp num
replica count
I will use my set up as an example and you should be able to use it as a template for your own.
Data we have
num osds : 4
num sites: 2
pgs per osd: ???
pgs per pool: ???
pools per osd: 10
reasonable default pg and pgp num: 64 (... or is it?)
replica count: 2 (cross site replication)
the crush map
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
root ourcompnay
site a
rack a-esx.0
host prdceph-strg01
osd.0 up 1.00000 1.00000
osd.1 up 1.00000 1.00000
site b
rack a-esx.0
host prdceph-strg02
osd.2 up 1.00000 1.00000
osd.3 up 1.00000 1.00000
Our goal is to fill in the '???' above with what we need to serve a HEALTH OK cluster. Our pools are created by the rados gateway when it initialises.
We have a single default.rgw.buckets.data where all data is being stored the rest of the pools are adminitrastive and internal to cephs metadata and book keeping.
PGs per osd (what is a reasonable default anyway???)
The documentation would have us use this calculation to determine our pg count per osd:
(osd * 100)
----------- = pgs UP to nearest power of 2
replica count
It is stated that to round up is optimal. So with our current setup it would be:
(4 * 100)
----------- = (200 to the nearest power of 2) 256
2
osd.1 ~= 256
osd.2 ~= 256
osd.3 ~= 256
osd.4 ~= 256
This is the recommended max number of pgs per osd. So... what do you actually have currently? And why isn't it working? And if you set a
'reasonable default' and understand the above WHY ISN'T IT WORKING!!! >=[
Likely, a few reasons. We have to understand what those 'reasonable defaults' above actually mean, how ceph applies them and to where. One might misunderstand from the above that I could create a new pool like so:
ceph osd pool create <pool> 256 256
or I might even think I could play it safe and follow the documentation which states that (128 pgs for < 5 osds) can use:
ceph osd pool create <pool> 128 128
This is wrong, flat out. Because it in no way explains the relationship or balance between what ceph is actaully doing with these numbers
technically the correct answer is:
ceph osd pool create <pool> 32 32
And let me explain why:
If like me you provisioned your cluster with those 'reasonable defaults' (128 pgs for < 5 osds) as soon as you tried to do anything with rados it created a whole bunch of pools and your cluster spazzed out.
The reason is because I misunderstood the relationship between everything mentioned above.
pools: 10 (created by rados)
pgs per pool: 128 (recommended in docs)
osds: 4 (2 per site)
10 * 128 / 4 = 320 pgs per osd
This ~320 could be a number of pgs per osd on my cluster. But ceph might distribute these differently. Which is exactly what's happening and
is way over the 256 max per osd stated above. My cluster's HEALTH WARN is HEALTH_WARN too many PGs per OSD (368 > max 300).
Using this command we're able to see better the relationship between the numbers:
pool :17 18 19 20 21 22 14 23 15 24 16 | SUM
------------------------------------------------< - *total pgs per osd*
osd.0 35 36 35 29 31 27 30 36 32 27 28 | 361
osd.1 29 28 29 35 33 37 34 28 32 37 36 | 375
osd.2 27 33 31 27 33 35 35 34 36 32 36 | 376
osd.3 37 31 33 37 31 29 29 30 28 32 28 | 360
-------------------------------------------------< - *total pgs per pool*
SUM :128 128 128 128 128 128 128 128 128 128 128
There's a direct correlation between the number of pools you have and the number of placement groups that are assigned to them.
I have 11 pools in the snippet above and they each have 128 pgs and that's too many!! My reasonable defaults are 64! So what happened??
I was misunderstandning how the 'reasonable defaults' were being used. When I set my default to 64, you can see ceph has taking my crush map into account where
I have a failure domain between site a and site b. Ceph has to ensure that everything that's on site a is at least accessible on site b.
WRONG
site a
osd.0
osd.1 TOTAL of ~ 64pgs
site b
osd.2
osd.3 TOTAL of ~ 64pgs
We needed a grand total of 64 pgs per pool so our reasonable defaults should've actually been set to 32 from the start!
If we use ceph osd pool create <pool> 32 32 what this amounts to is that the relationship between our pgs per pool and pgs per osd with those 'reasonable defaults' and our recommened max pgs per osd start to make sense:
So you broke your cluster ^_^
Don't worry we're going to fix it. The procedure here I'm afraid might vary in risk and time depending on how big your cluster. But the only way
to get around altering this is to add more storage, so that the placement groups can redistribute over a larger surface area. OR we have to move everything over to
newly created pools.
I'll show an example of moving the default.rgw.buckets.data pool:
old_pool=default.rgw.buckets.data
new_pool=new.default.rgw.buckets.data
create a new pool, with the correct pg count:
ceph osd pool create $new_pool 32
copy the contents of the old pool the new pool:
rados cppool $old_pool $new_pool
remove the old pool:
ceph osd pool delete $old_pool $old_pool --yes-i-really-really-mean-it
rename the new pool to 'default.rgw.buckets.data'
ceph osd pool rename $new_pool $old_pool
Now it might be a safe bet to restart your radosgws.
FINALLY CORRECT
site a
osd.0
osd.1 TOTAL of ~ 32pgs
site b
osd.2
osd.3 TOTAL of ~ 32pgs
As you can see my pool numbers have incremented since they are added by pool id and are new copies. And our total pgs per osd is way under the ~256 which gives us room to add custom pools if required.
pool : 26 35 27 36 28 29 30 31 32 33 34 | SUM
-----------------------------------------------
osd.0 15 18 16 17 17 15 15 15 16 13 16 | 173
osd.1 17 14 16 15 15 17 17 17 16 19 16 | 179
osd.2 17 14 16 18 12 17 18 14 16 14 13 | 169
osd.3 15 18 16 14 20 15 14 18 16 18 19 | 183
-----------------------------------------------
SUM : 64 64 64 64 64 64 64 64 64 64 64
Now you should test your ceph cluster with whatever is at your disposal. Personally I've written a bunch of python over boto that tests the infrastructure and return buckets stats and metadata rather quickly. They have ensured to me that the cluster is back to working order without any of the issues it suffered from previously. Good luck!
Fixing pool default.rgw.buckets.data has many more objects per pg than average (too few pgs?) once and for all
This quite literally means, you need to increase the pg and pgp num of your pool. So... do it. With everything mentioned above in mind. When you do this however note that the cluster will start backfilling and you can watch this process %: watch ceph -s in another terminal window or screen.
ceph osd pool set default.rgw.buckets.data pg_num 128
ceph osd pool set default.rgw.buckets.data pgp_num 128
Armed with the knowledge and confidence in the system provided in the above segment we can clearly understand the relationship and the influence of such a change on the cluster.
pool : 35 26 27 36 28 29 30 31 32 33 34 | SUM
----------------------------------------------
osd.0 18 64 16 17 17 15 15 15 16 13 16 | 222
osd.1 14 64 16 15 15 17 17 17 16 19 16 | 226
osd.2 14 66 16 18 12 17 18 14 16 14 13 | 218
osd.3 18 62 16 14 20 15 14 18 16 18 19 | 230
-----------------------------------------------
SUM : 64 256 64 64 64 64 64 64 64 64 64
Can you guess which pool id is default.rgw.buckets.data? haha ^_^
In Ceph Nautilus (v14 or later), you can turn on "PG Autotuning". See this documentation and this blog entry for more information.
I accidentally created pools with live data that I could not migrate to repair the PGs. It took some days to recover, but the PGs were optimally adjusted with zero problems.

How to debug leak in native memory on JVM?

We have a java application running on Mule. We have the XMX value configured for 6144M, but are routinely seeing the overall memory usage climb and climb. It was getting close to 20 GB the other day before we proactively restarted it.
Thu Jun 30 03:05:57 CDT 2016
top - 03:05:58 up 149 days, 6:19, 0 users, load average: 0.04, 0.04, 0.00
Tasks: 164 total, 1 running, 163 sleeping, 0 stopped, 0 zombie
Cpu(s): 4.2%us, 1.7%sy, 0.0%ni, 93.9%id, 0.2%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 24600552k total, 21654876k used, 2945676k free, 440828k buffers
Swap: 2097144k total, 84256k used, 2012888k free, 1047316k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3840 myuser 20 0 23.9g 18g 53m S 0.0 79.9 375:30.02 java
The jps command shows:
10671 Jps
3840 MuleContainerBootstrap
The jstat command shows:
S0C S1C S0U S1U EC EU OC OU PC PU YGC YGCT FGC FGCT GCT
37376.0 36864.0 16160.0 0.0 2022912.0 1941418.4 4194304.0 445432.2 78336.0 66776.7 232 7.044 17 17.403 24.447
The startup arguments are (sensitive bits have been changed):
3840 MuleContainerBootstrap -Dmule.home=/mule -Dmule.base=/mule -Djava.net.preferIPv4Stack=TRUE -XX:MaxPermSize=256m -Djava.endorsed.dirs=/mule/lib/endorsed -XX:+HeapDumpOnOutOfMemoryError -Dmyapp.lib.path=/datalake/app/ext_lib/ -DTARGET_ENV=prod -Djava.library.path=/opt/mapr/lib -DksPass=mypass -DsecretKey=aeskey -DencryptMode=AES -Dkeystore=/mule/myStore -DkeystoreInstance=JCEKS -Djava.security.auth.login.config=/opt/mapr/conf/mapr.login.conf -Dmule.mmc.bind.port=1521 -Xms6144m -Xmx6144m -Djava.library.path=%LD_LIBRARY_PATH%:/mule/lib/boot -Dwrapper.key=a_guid -Dwrapper.port=32000 -Dwrapper.jvm.port.min=31000 -Dwrapper.jvm.port.max=31999 -Dwrapper.disable_console_input=TRUE -Dwrapper.pid=10744 -Dwrapper.version=3.5.19-st -Dwrapper.native_library=wrapper -Dwrapper.arch=x86 -Dwrapper.service=TRUE -Dwrapper.cpu.timeout=10 -Dwrapper.jvmid=1 -Dwrapper.lang.domain=wrapper -Dwrapper.lang.folder=../lang
Adding up the "capacity" items from jps shows that only my 6144m is being used for java heap. Where the heck is the rest of the memory being used? Stack memory? Native heap? I'm not even sure how to proceed.
If left to continue growing, it will consume all memory on the system and we will eventually see the system freeze up throwing swap space errors.
I have another process that is starting to grow. Currently at about 11g resident memory.
pmap 10746 > pmap_10746.txt
cat pmap_10746.txt | grep anon | cut -c18-25 | sort -h | uniq -c | sort -rn | less
Top 10 entries by count:
119 12K
112 1016K
56 4K
38 131072K
20 65532K
15 131068K
14 65536K
10 132K
8 65404K
7 128K
Top 10 entries by allocation size:
1 6291456K
1 205816K
1 155648K
38 131072K
15 131068K
1 108772K
1 71680K
14 65536K
20 65532K
1 65512K
And top 10 by total size:
Count Size Aggregate
1 6291456K 6291456K
38 131072K 4980736K
15 131068K 1966020K
20 65532K 1310640K
14 65536K 917504K
8 65404K 523232K
1 205816K 205816K
1 155648K 155648K
112 1016K 113792K
This seems to be telling me that because the Xmx and Xms are set to the same value, there is a single allocation of 6291456K for the java heap. Other allocations are NOT java heap memory. What are they? They are getting allocated in rather large chunks.
Expanding a bit more details on Peter's answer.
You can take a binary heap dump from within VisualVM (right click on the process in the left-hand side list, and then on heap dump - it'll appear right below shortly after). If you can't attach VisualVM to your JVM, you can also generate the dump with this:
jmap -dump:format=b,file=heap.hprof $PID
Then copy the file and open it with Visual VM (File, Load, select type heap dump, find the file.)
As Peter notes, a likely cause for the leak may be non collected DirectByteBuffers (e.g.: some instance of another class is not properly de-referencing buffers, so they are never GC'd).
To identify where are these references coming from, you can use Visual VM to examine the heap and find all instances of DirectByteByffer in the "Classes" tab. Find the DBB class, right click, go to instances view.
This will give you a list of instances. You can click on one and see who's keeping a reference each one:
Note the bottom pane, we have "referent" of type Cleaner and 2 "mybuffer". These would be properties in other classes that are referencing the instance of DirectByteBuffer we drilled into (it should be ok if you ignore the Cleaner and focus on the others).
From this point on you need to proceed based on your application.
Another equivalent way to get the list of DBB instances is from the OQL tab. This query:
select x from java.nio.DirectByteBuffer x
Gives us the same list as before. The benefit of using OQL is that you can execute more more complex queries. For example, this gets all the instances that are keeping a reference to a DirectByteBuffer:
select referrers(x) from java.nio.DirectByteBuffer x
What you can do is take a heap dump and look for object which are storing data off heap such as ByteBuffers. Those objects will appear small but are a proxy for larger off heap memory areas. See if you can determine why lots of those might be retained.

reducing jitter of serial ntp refclock

I am currently trying to connect my DIY DC77 clock to ntpd (using Ubuntu). I followed the instructions here: http://wiki.ubuntuusers.de/Systemzeit.
With ntpq I can see the DCF77 clock
~$ ntpq -c peers
remote refid st t when poll reach delay offset jitter
==============================================================================
+dispatch.mxjs.d 192.53.103.104 2 u 6 64 377 13.380 12.608 4.663
+main.macht.org 192.53.103.108 2 u 12 64 377 33.167 5.008 4.769
+alvo.fungus.at 91.195.238.4 3 u 15 64 377 16.949 7.454 28.075
-ns1.blazing.de 213.172.96.14 2 u - 64 377 10.072 14.170 2.335
*GENERIC(0) .DCFa. 0 l 31 64 377 0.000 5.362 4.621
LOCAL(0) .LOCL. 12 l 927 64 0 0.000 0.000 0.000
So far this looks OK. However I have two questions.
What exactly is the sign of the offset? Is .DCFa. ahead of the system clock or behind the system clock?
.DCFa. points to refclock-0 which is a DIY DCF77 clock emulating a Meinberg clock. It is connected to my Ubuntu Linux box with an FTDI usb-serial adapter running at 9600 7e2. I verified with a DSO that it emits the time with jitter significantly below 1ms. So I assume the jitter is introduced by either the FTDI adapter or the kernel. How would I find out and how can I reduce it?
Part One:
Positive offsets indicate time in the client is behind time on the server.
Negative offsets indicate that time in the client is ahead of time on the server.
I always remember this as "what needs to happen to my clock?"
+0.123 = Add 0.123 to me
-0.123 = Subtract 0.123 from me
Part Two:
Yes the USB serial converters add jitter. Get a real serial port:) You can also use setserial and tell it that the serial port needs to be low_latency. Just apt-get setserial.
Bonus Points:
Lose the unreferenced local clock entry. NO LOCL!!!!

How to interpret the memory usage figures?

Can someone explain this in a practical way? Sample represents usage for one, low-traffic Rails site using Nginx and 3 Mongrel clusters. I ask because I am aiming to learn about page caching, wondering if these figures have significant meaning to that process. Thank you. Great site!
me#vps:~$ free -m
total used free shared buffers cached
Mem: 512 506 6 0 15 103
-/+ buffers/cache: 387 124
Swap: 1023 113 910
Physical memory is all used up. Why? Because it's there, the system should be using it.
You'll note also that the system is using 113M of swap space. Bad? Good? It depends.
See also that there's 103M of cached disk; this means that the system has decided that it's better to cache 103M of disk and swap out these 113M; maybe you have some processes using memory that are not being used and thus are paged out to disk.
As the other poster said, you should be using other tools to see what's happening:
Your perception: is the site running appropiately when you use it?
Benchmarking: what response times are your clients seeing?
More fine-grained diagnostics:
top: you can see live which processes are using memory and CPU
vmstat: it produces this kind of output:
alex#armitage:~$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
2 1 71184 156520 92524 316488 1 5 12 23 362 250 13 6 80 1
0 0 71184 156340 92528 316508 0 0 0 1 291 608 10 1 89 0
0 0 71184 156364 92528 316508 0 0 0 0 308 674 9 2 89 0
0 0 71184 156364 92532 316504 0 0 0 72 295 723 9 0 91 0
1 0 71184 150892 92532 316508 0 0 0 0 370 722 38 0 62 0
0 0 71184 163060 92532 316508 0 0 0 0 303 611 17 2 81 0
which will show you whether swap is hurting you (high numbers on si, so) and a more easier to see performance-over-time statistic.
by my reading of this, you have used almost all your memory, have 6 M free, and are going into about 10% of your swap. A more useful tools is to use top or perhaps ps to see how much each of your individual mongrels are using in RAM. Because you're going into swap, you're probably getting more slowdowns. you might find having only 2 mongrels rather than 3 might actually respond faster because it likely wouldn't go into swap memory.
Page caching will for sure help a tonne on response time, so if your pages are cachable (eg, they don't have content that is unique to the individual user) I would say for sure check it out

Resources