Token balancing in a blank new Cassandra-Cluster - docker

My setup consists of 3 Cassandra nodes. every node runs as part of a docker container.
One seednode and two normal nodes.
I use cassandra:latest which mean at this time version 3.11.4.
All nodes run in one cluster.
All nodes run in one datacenter.
I use the the following setup in my docker-compose.yml
- "CASSANDRA_ENDPOINT_SNITCH=GossipingPropertyFileSnitch"
- "CASSANDRA_NUM_TOKENS=8"
- "MAX_HEAP_SIZE=512M"
- "HEAP_NEWSIZE=128M"
Heap-size are so small because it tests only the start of the cluster and my notebook has not enough ram.
The partitioner is the default Murmur3Partitioner of cassandra.
I start only the cluster, no keyspace-creation or other things a going through the thing.
In every documentation that i found there is the statement of the balanced token-range and unbalanced token distribution is bad etc. etc.
But what ist a balanced token-range?
When I start the cluster, first the seedcontainer, with an intervall of 1 minute each other node comes up and ready.
The cluster is healthy and there are no errors in log. As the result of docker-compose ps describes:
Name Command State Ports
----------------------------------------------------------------------------------------------------------------------------------
docker_cassandra-seed_1 docker-entrypoint.sh bash ... Up 7000/tcp, 7001/tcp, 7199/tcp, 0.0.0.0:23232->9042/tcp, 9160/tcp
docker_cassandra1_1 docker-entrypoint.sh bash ... Up 7000/tcp, 7001/tcp, 7199/tcp, 9042/tcp, 9160/tcp
docker_cassandra2_1 docker-entrypoint.sh bash ... Up 7000/tcp, 7001/tcp, 7199/tcp, 9042/tcp, 9160/tcp
If the cluster is up, there are 3 nodes with 8 vnodes runs on each node.
It is a cluster distribution of 24 with always 24 token ranges.
The token range in Cassandra is -2^63 till +2^63 - 1 (java long).
If I call a
docker exec -ti docker_cassandra-seed_1 nodetool ring
i receive the following result.
docker exec -ti docker_cassandra-seed_1 nodetool ring
Datacenter: tc1
==========
Address Rack Status State Load Owns Token
172.27.0.3 rack1 Up Normal 254.57 KiB 88.87% -8870864291163548206
172.27.0.4 rack1 Up Normal 231.07 KiB 55.89% -8804151848356105327
172.27.0.2 rack1 Up Normal 220.44 KiB 55.24% -8578084366820530367
172.27.0.4 rack1 Up Normal 231.07 KiB 55.89% -7746741366682664202
172.27.0.4 rack1 Up Normal 231.07 KiB 55.89% -7013522326538302096
172.27.0.3 rack1 Up Normal 254.57 KiB 88.87% -6994428155886831685
172.27.0.2 rack1 Up Normal 220.44 KiB 55.24% -6650863707982675450
172.27.0.4 rack1 Up Normal 231.07 KiB 55.89% -5995004048488281144
172.27.0.4 rack1 Up Normal 231.07 KiB 55.89% -5683587085031530885
172.27.0.4 rack1 Up Normal 231.07 KiB 55.89% -5274940575732780430
172.27.0.3 rack1 Up Normal 254.57 KiB 88.87% -5184169415607375486
172.27.0.2 rack1 Up Normal 220.44 KiB 55.24% -2082614198258325552
172.27.0.3 rack1 Up Normal 254.57 KiB 88.87% -1084866128895283137
172.27.0.2 rack1 Up Normal 220.44 KiB 55.24% 2495470503021543046
172.27.0.3 rack1 Up Normal 254.57 KiB 88.87% 3043280549254813456
172.27.0.4 rack1 Up Normal 231.07 KiB 55.89% 3058642754102082410
172.27.0.2 rack1 Up Normal 220.44 KiB 55.24% 3117172086630093502
172.27.0.3 rack1 Up Normal 254.57 KiB 88.87% 3405798334726690865
172.27.0.2 rack1 Up Normal 220.44 KiB 55.24% 3829479365384141235
172.27.0.2 rack1 Up Normal 220.44 KiB 55.24% 4124513942316551627
172.27.0.2 rack1 Up Normal 220.44 KiB 55.24% 4807293191442647176
172.27.0.4 rack1 Up Normal 231.07 KiB 55.89% 4911525338969505185
172.27.0.3 rack1 Up Normal 254.57 KiB 88.87% 8068956543491535994
172.27.0.3 rack1 Up Normal 254.57 KiB 88.87% 8197176123795617738
Which means the difference between every token range in the ring is totally different.
Or in other words, the perfect case for the calculation where ((2^63 * 2) - 1) / (3 * 8) = 768.614.336.404.564.000 tokens per node in an ideal token distribution.
Sorry I have only excel at the fast here for calculating (round the 10000s):
-9.223.372.036.854.770.000 Long Min
-8.870.864.291.163.540.000 352.507.745.691.229.000
-8.804.151.848.356.100.000 66.712.442.807.440.400
-8.578.084.366.820.530.000 226.067.481.535.570.000
-7.746.741.366.682.660.000 831.343.000.137.870.000
-7.013.522.326.538.300.000 733.219.040.144.359.000
-6.994.428.155.886.830.000 19.094.170.651.470.800
-6.650.863.707.982.670.000 343.564.447.904.160.000
-5.995.004.048.488.280.000 655.859.659.494.390.000
-5.683.587.085.031.530.000 311.416.963.456.750.000
-5.274.940.575.732.780.000 408.646.509.298.750.000
-5.184.169.415.607.370.000 90.771.160.125.410.300
-2.082.614.198.258.320.000 3.101.555.217.349.050.000
-1.084.866.128.895.280.000 997.748.069.363.040.000
2.495.470.503.021.540.000 3.580.336.631.916.820.000
3.043.280.549.254.810.000 547.810.046.233.270.000
3.058.642.754.102.080.000 15.362.204.847.269.900
3.117.172.086.630.090.000 58.529.332.528.010.200
3.405.798.334.726.690.000 288.626.248.096.600.000
3.829.479.365.384.140.000 423.681.030.657.450.000
4.124.513.942.316.550.000 295.034.576.932.410.000
4.807.293.191.442.640.000 682.779.249.126.090.000
4.911.525.338.969.500.000 104.232.147.526.860.000
8.068.956.543.491.530.000 3.157.431.204.522.030.000
8.197.176.123.795.610.000 128.219.580.304.080.000
9.223.372.036.854.770.000 Long Max
The right column describes the distribution of each token range. And here is a big gap between the biggest and the smallest token range.
Or a little bit consolidated (from the middle of the result) that is totally uneven or unbalanced or undistributed:
-5184169415607375486
-2082614198258325552
-1084866128895283137
After some tesing, i´ve setup a super simple thing.
One pc (with ubuntu 18.04, java 1.8.0_201, cassandra version 3.6).
Install, let all the parameters to default, drive the cassandra service up and look at the token-distribution.
Here is the result:
tokendistribution on a new cluster
So my question is: What means balanced token-range in a Cassandra cluster?

As described in this link
https://thelastpickle.com/blog/2019/02/21/set-up-a-cluster-with-even-token-distribution.html
it seems to be the solution, at least for the distribution of tokens and data for a keyspace.
The following steps i take to get a balanced system:
Setup cassandra.yaml for the seed-node with (for my testcase num_tokens=8) let the other parameter as default
startup the seednode, wait until ready
connect via cqlsh or programmatic solution and create the keyspace (for my test-case with replication-factor=1).
shutdown the seed-node
edit the cassandra.yaml of the seed-node and outcomment/add the parameter for allocate_tokens_for_keyspace: [your_keyspace_name_from_step_3]
startup the seed-node and wait until the node is ready
edit the cassandra.yaml for the second node in the cluster take the step 5. in this file and the num_token equal to num_token of the seed-node.
run the second node an wait until it is ready
do the steps 7-8 for any other node in your cluster.
With that and e.g. a testrun with adding 2.000.000 datarows in a testtable in the keyspace i see the following result:
docker exec -ti docker_cassandra-seed_1 nodetool status
Datacenter: tc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 172.30.10.4 36.03 MiB 8 33.3% 1e0d781f-d71f-4704-bcd1-efb5d4caff0e rack1
UN 172.30.10.2 36.75 MiB 8 33.3% 56287b3c-b0f1-489f-930e-c7b00df896f3 rack1
UN 172.30.10.3 36.03 MiB 8 33.3% 943acc5f-7257-414a-b36c-c06dcb53e67d rack1
Even the tokendistribution ist better then before:
172.30.10.2 6.148.914.691.236.510.000
172.30.10.3 6.148.914.691.236.520.000
172.30.10.4 5.981.980.531.853.070.000
At the moment, there is some clarification about the problem with the uneven distribution, so thank you again Chris Lohfink for the link with the solution.

I´ve testing a little bit around the above scenario.
My testcluster consists of 5 nodes (1 seed, 4 normal nodes).
The first 5 steps from above remains valid:
Setup cassandra.yaml for the seed-node with (for my testcase
num_tokens=8) let the other parameter as default
startup the seednode, wait until ready
connect via cqlsh or programmatic
solution and create the keyspace (for my test-case with
replication-factor=1).
shutdown the seed-node, edit the cassandra.yaml of the seed-node and outcomment/add the parameter for allocate_tokens_for_keyspace: [your_keyspace_name_from_step_3]
startup the seed-node and wait until the node is ready
Then, you can startup all the other nodes (in my case 4) at same time (or with 1 minute delay between startup of each node), but automated. Important is, that all nodes have the allocate_tokens_for_keyspace: [your_keyspace....] set.
After all nodes are up, and fill with 1.000.000 rows, there´s an even balance of 20%.
That scenario makes the life easier, if you start a cluster with a lot of nodes.

Related

CassandraDB nodes crash when 4 noes are present

I am trying to set up an Apache CassandraDB cluster via docker.
I've managed to create up to and including 3 nodes using docker run --network cassandra -e CASSANDRA_ENDPOINT_SNITCH=GossipingPropertyFileSnitch -e CASSANDRA_SEEDS=node0 --name "node0" -d cassandra.
Whenever I add a 4th node using the same command (just changing the container name), another random node in the cluster crashes and the node that was created also exits shortly after being on the UJ and DJ stages.
Here's what I've tried.
docker network create cassandra
docker run --network cassandra -e CASSANDRA_ENDPOINT_SNITCH=GossipingPropertyFileSnitch -e CASSANDRA_SEEDS=node0 --name "node0" -d cassandra
I then waited until the node was up and nodetool was showing this:
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 172.19.0.2 88.49 KiB 16 100.0% 15573541-fc19-4569-9a43-cb04e49e134f rack1
After that, I added two additional nodes to the cluster.
docker run --network cassandra -e CASSANDRA_ENDPOINT_SNITCH=GossipingPropertyFileSnitch -e CASSANDRA_SEEDS=node0 --name "node1" -d cassandra
docker run --network cassandra -e CASSANDRA_ENDPOINT_SNITCH=GossipingPropertyFileSnitch -e CASSANDRA_SEEDS=node0 --name "node2" -d cassandra
I waited until those two joined the cluster and nodetool showed this:
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 172.19.0.2 74.11 KiB 16 64.7% 15573541-fc19-4569-9a43-cb04e49e134f rack1
UN 172.19.0.4 98.4 KiB 16 76.0% 30afdc85-e863-452c-9031-59803e4b1f11 rack1
UN 172.19.0.3 74.04 KiB 16 59.3% 6d92cf62-65b4-4365-ab28-2d53872605e3 rack1
That seems good! After that, I wanted to add another node to test whether my replication factor was working properly. So, I added another node to the cluster using the same command:
docker run --network cassandra -e CASSANDRA_ENDPOINT_SNITCH=GossipingPropertyFileSnitch -e CASSANDRA_SEEDS=node0 --name "node3" -d cassandra
When I added this node, node1 crashed immediately. node3 (thats's the new one) was briefly on the UJ (UP-Joining) stage and then switched to DJ (DOWN-Joining) and was then removed from the nodelist.
Here are the results from nodetool status, in order:
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 172.19.0.2 74.11 KiB 16 64.7% 15573541-fc19-4569-9a43-cb04e49e134f rack1
UN 172.19.0.4 74.03 KiB 16 76.0% 30afdc85-e863-452c-9031-59803e4b1f11 rack1
DN 172.19.0.3 74.04 KiB 16 59.3% 6d92cf62-65b4-4365-ab28-2d53872605e3 rack1
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UJ 172.19.0.5 20.75 KiB 16 ? 2e4a25e4-3c81-4383-9c9f-6326e4043910 rack1
UN 172.19.0.2 74.11 KiB 16 64.7% 15573541-fc19-4569-9a43-cb04e49e134f rack1
UN 172.19.0.4 74.03 KiB 16 76.0% 30afdc85-e863-452c-9031-59803e4b1f11 rack1
DN 172.19.0.3 74.04 KiB 16 59.3% 6d92cf62-65b4-4365-ab28-2d53872605e3 rack1
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
DJ 172.19.0.5 20.75 KiB 16 ? 2e4a25e4-3c81-4383-9c9f-6326e4043910 rack1
UN 172.19.0.2 74.11 KiB 16 64.7% 15573541-fc19-4569-9a43-cb04e49e134f rack1
UN 172.19.0.4 74.03 KiB 16 76.0% 30afdc85-e863-452c-9031-59803e4b1f11 rack1
DN 172.19.0.3 74.04 KiB 16 59.3% 6d92cf62-65b4-4365-ab28-2d53872605e3 rack1
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 172.19.0.2 74.11 KiB 16 64.7% 15573541-fc19-4569-9a43-cb04e49e134f rack1
UN 172.19.0.4 74.03 KiB 16 76.0% 30afdc85-e863-452c-9031-59803e4b1f11 rack1
DN 172.19.0.3 74.04 KiB 16 59.3% 6d92cf62-65b4-4365-ab28-2d53872605e3 rack1
Here are the logs for node1:
As you can see, the first item in the log was the confirmation that node2 had connected to the cluster.
https://gist.github.com/janic0/7e464e5c819c37e6ed38819fb3c19eff
Here are the logs for node3 (again, that's the new node)
https://gist.github.com/janic0/0968b7136c3beb3ef76a2379f3cd9be5
I've investigated it and found that Docker kills these containers with exit code 137 - or "out of memory".
Yep, I thought something like that was happening.
Each of the nodes used up about 4GB of RAM and the forth node was just enough to force Docker to kill some of the containers. If you do want to host that many nodes on one machine for some reason, you can increase the memory limit:
So I've done something like this before. If you're just going to be doing some local testing and you want a multi-node cluster, I've used Minikube for that before. In fact, I put together a repo which has some resources for doing that: https://github.com/aploetz/cassandra_minikube
But another approach which might be a "quick fix" for you, would be to explicitly adjust the Java heap sizing to something much smaller for each of your nodes. In my Minikube example above, I'd set:
-Xms512M
-Xmx512M
-Xmn256M
This should create a 1/2 GB heap, which is plenty for local dev or some simple testing. You can set these values in your cassandra-env.sh or jvm-server.options file (depending on your Cassandra version).

GKE private cluster Django server timeout 502 after 64 API requests

So I have a GKE private cluster production ready environment, where I host my Django Rest Framework microservices.
It all works fine, but after 64 API requests, the server does a timeout and the pod is unreachable.
I am not sure why this is happening.
I use the following stack:
Django 3.2.13
GKE v1.22.8-hke.201
Postgresql Cloud SQL
Docker
My Django application is a simple one. No authentication on POST. Just a small json-body is sent, and the server saves it to the PostgreSQL database.
The server connects via cloud_sql_proxy to the database, but I also tried to use the IP and the PySQL library. It works, but same error/timeout.
The workloads that are having the issue are any workloads that do a DB call, it does not matter if I do a SELECT * or INSERT call.
However, when I do a load balancing test (locust python) and test the home page of any microserver within the cluster (readiness) I do not experience any API calls timeout and server restart.
Type Name # reqs # fails | Avg Min Max Med | req/s failures/s
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST /api/logserver/logs/ 64 0(0.00%) | 96 29 140 110 | 10.00 0.00
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
Aggregated 64 0(0.00%) | 96 29 140 110 | 10.00 0.00
Type Name # reqs # fails | Avg Min Max Med | req/s failures/s
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST /api/logserver/logs/ 77 13(19.48%) | 92 17 140 100 | 0.90 0.00
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
Aggregated 77 13(19.48%) | 92 17 140 100 | 0.90 0.00
So It looks like it has something to do with the way the DB is connected to the pods?
I use cloud_sql_proxy to connect to the DB. And this also results in a Timeout and restart of the pod.
I have tried updating gunicorn in the docker environment for Django to:
CMD gunicorn -b :8080 --log-level debug --timeout 90 --workers 6 --access-logfile '-' --error-logfile '-' --graceful-timeout 30 helloadapta.wsgi
And I have tried replacing gunicorn with uwsgi.
I also tried using plain python manage.py runserver 0.0.0.0:8080
They all serve the backend, and I can connect to it. But the issue on timeout persists.
This is the infrastructure:
Private GKE cluster which uses subnetwork in GCP.
Cloud NAT on network for outbound external static IP (needed to whitelist microservers in third party servers)
The Cluster has more than enough memory and cpu:
nodes: 3
total vCPUs: 24
total memory: 96GB
Each node has:
CPU allocatable: 7.91 CPU
Memory allocatable: 29.79 GB
The config in the yaml file states that the pod gets:
resources:
limits:
cpu: "1"
memory: "2Gi"
requests:
cpu: "1"
memory: "2Gi"
Only when I do a readiness call to the server, there is no Timeout.
So it really points to a direction that the Cloud SQL breaks after 64 API calls.
The Cloud SQL Database stack is:
1 sql instance
1 database within the instance
4 vCPUs
15 GB memory
max_connections 85000
The CPU utilisation never goes above 5%

K8s cannot schedule new pods to worker nodes even though there are enough resources

Currently, I am facing an issue when K8s scale up new pods on old deployment and Rancher shows stuck on scheduling pods into K8s worker nodes. It eventually will be scheduled but will take some time, as I understand is to wait for the scheduler to find the node which fits the resource request.
In the Event section of that deployment, it shows:
Warning FailedScheduling 0/8 nodes are available: 5 Insufficient memory, 3 node(s) didn't match node selector.
Then I go to the Nodes tab to check if there is any lack of memory on the worker nodes, and it shows my worker nodes like this:
STATE NAME ROLES VERSION CPU RAM POD
Active worker-01 Worker v1.19.5 14/19 Cores 84.8/86.2 GiB 76/110
Active worker-02 Worker v1.19.5 10.3/19 Cores 83.2/86.2 GiB 51/110
Active worker-03 Worker v1.19.5 14/19 Cores 85.8/86.2 GiB 63/110
Active worker-04 Worker v1.19.5 13/19 Cores 84.4/86.2 GiB 53/110
Active worker-05 Worker v1.19.5 12/19 Cores 85.2/86.2 GiB 68/110
But when I go into each server and check memory with top and free command, they output smimilar result like this one on worker-1 node:
top:
Tasks: 827 total, 2 running, 825 sleeping, 0 stopped, 0 zombie
%Cpu(s): 34.9 us, 11.9 sy, 0.0 ni, 51.5 id, 0.0 wa, 0.0 hi, 1.7 si, 0.0 st
KiB Mem : 98833488 total, 2198412 free, 81151568 used, 15483504 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 17101808 avail Mem
free -g:
total used free shared buff/cache available
Mem: 94 77 1 0 14 16
Swap: 0 0 0
So the memory available in the nodes are about 16-17 GB but still cannot schedule new pod into them. So my wonder is what causes this conflict of memory number, is it the amount between 86.2 (on Rancher GUI) and 94 (on server) is for the OS and other processes? And why Rancher shows K8s workload currently takes about 83-85 GB but in server the memory available is about 16-17GB. Is there any way to check deeper into this?
I'm still learning K8s so please explain in detail if you can or topics that talk about this.
Thanks in advance!
It doesn't matter what's actual resource consumption on worker nodes.
What's really important is resource requests.
Requests are what the container is guaranteed to get. If a container requests a resource, Kubernetes will only schedule it on a node that can give it that resource.
Read more about Resource Management for Pods and Containers
but my wonder is why it shows almost full of 86.2 GB when the actual
memory is 94GB
use kubectl describe node <node name> to see how much memory has been given available to kubelet on particular node
You will see something like
Capacity:
cpu: 8
ephemeral-storage: 457871560Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 32626320Ki
pods: 110
Allocatable:
cpu: 8
ephemeral-storage: 457871560Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 32626320Ki
pods: 110
......
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 100m (1%) 100m (1%)
memory 50Mi (0%) 50Mi (0%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
K8s workload currently takes about 83-85 GB but in server the memory
available is about 16-17GB.
From output of free in question, this is not really true
KiB Mem : 98833488 total, 2198412 free, 81151568 used, 15483504 buff/cache
2198412 free which is ~2GB and you have ~1.5GB in buff/cache.
You can use cat /proc/meminfo to get more details about OS-level memory info.

Cassandra: How to increase the number of node instances in localhost

I am able to start two cassandra node instances up and running through docker.
docker run --name n1 -d tobert/cassandra -dc DC1 -rack RAC1
docker run --name n2 -d tobert/cassandra -seeds 172.17.0.2 -dc DC2 -rack RAC1
When I try to start the new node instance n3, then it doesn't through any error but I do no see any n3 instance came up, I am seeing only 2 nodes.
$ docker run --name n3 -d tobert/cassandra -seeds 172.17.0.2 -dc DC1 -rack RAC2
XXX
$ docker ps (doesnt show the third cassandra node)
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
8ab64fa86819 tobert/cassandra "/bin/cassandra-docke" 41 minutes ago Up 41 minutes 7000/tcp, 7199/tcp, 9042/tcp, 9160/tcp, 61621/tcp n2
125fc4ffba4d tobert/cassandra "/bin/cassandra-docke" 42 minutes ago Up 42 minutes 7000/tcp, 7199/tcp, 9042/tcp, 9160/tcp, 61621/tcp n1
$ docker exec -it n1 nodetool status
Datacenter: DC2
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 172.17.0.3 82.43 KB 256 100.0% XXX RAC1
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 172.17.0.2 51.48 KB 256 100.0% XXX RAC1
Will anyone please let me know why is this happening. What conf need to be done to initiate more node instances. It is clear that the node instances more than 2 in my localhost is an issue here. Why?
Looks like sometimes we need to run the docker run command more than once in-order to start-up the new node. No idea why this happens.
I have assigned memory of 4GB to the docker container using command boot2docker --memory 4096 init which gave some space to add a new node which I was expecting to.
Finally here are the nodes that are up and running
$ docker exec -it n1 nodetool status
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 172.17.0.3 98.91 KB 256 64.5% 30156883-aafe-43b8-b8ee-fec2c9225778 RAC2
UN 172.17.0.2 51.51 KB 256 68.3% 486f457c-8be2-4844-9cd0-d5ef37b46cea RAC1
UN 172.17.0.4 98.97 KB 256 67.3% d19ad6a1-8138-4283-815c-3b223a33c987 RAC1

Rethinkdb container: rethinkdb process takes less RAM than the whole container

I'm running my rethinkdb container in Kubernetes cluster. Below is what I notice:
Running top in the host which is CoreOS, rethinkdb process takes about 3Gb:
$ top
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
981 root 20 0 53.9m 34.5m 20.9m S 15.6 0.4 1153:34 hyperkube
51139 root 20 0 4109.3m 3.179g 22.5m S 15.0 41.8 217:43.56 rethinkdb
579 root 20 0 707.5m 76.1m 19.3m S 2.3 1.0 268:33.55 kubelet
But running docker stats to check the rethinkdb container, it takes about 7Gb!
$ docker ps | grep rethinkdb
eb9e6b83d6b8 rethinkdb:2.1.5 "rethinkdb --bind al 3 days ago Up 3 days k8s_rethinkdb-3.746aa_rethinkdb-rc-3-eiyt7_default_560121bb-82af-11e5-9c05-00155d070266_661dfae4
$ docker stats eb9e6b83d6b8
CONTAINER CPU % MEM USAGE/LIMIT MEM % NET I/O
eb9e6b83d6b8 4.96% 6.992 GB/8.169 GB 85.59% 0 B/0 B
$ free -m
total used free shared buffers cached
Mem: 7790 7709 81 0 71 3505
-/+ buffers/cache: 4132 3657
Swap: 0 0 0
Can someone explain why the container is taking a lot more memory than the rethinkdb process itself?
I'm running docker v1.7.1, CoreOS v773.1.0, kernel 4.1.5
In top command, your are looking at physical memory amount. in stats command, this also include the disk cached ram, so it's always bigger than the physical amount of ram. When you really need more RAM, the disk cached will be released for the application to use.
In deed, the memmory usage is pulled via cgroup memory.usage_in_bytes, you can access it in /sys/fs/cgroup/memory/docker/long_container_id/memory.usage_in_bytes. And acording to linux doc https://www.kernel.org/doc/Documentation/cgroups/memory.txt section 5.5:
5.5 usage_in_bytes
For efficiency, as other kernel components, memory cgroup uses some
optimization to avoid unnecessary cacheline false sharing.
usage_in_bytes is affected by the method and doesn't show 'exact'
value of memory (and swap) usage, it's a fuzz value for efficient
access. (Of course, when necessary, it's synchronized.) If you want to
know more exact memory usage, you should use RSS+CACHE(+SWAP) value in
memory.stat(see 5.2).

Resources