Hadoop replication not working properly - memory

I have a cluster of two machines, one as master, and two as slave (the master as slave also). I have set the replication factor to 1 in both machines. Hive is also configured on master. After a few days, my hard disk become full (no space left), and then I ran the following command:
hadoop dfs -setrep -w 1 -R /
and after executing this command, considerable storage became available.
Why is this? I know the setrep command is used to set the replication factor of each block to 1. When I put a condition in configuration then what is this? How do I get rid of it?

You need to setrep in config file and restart the cluster
<property>
<name>dfs.replication</name>
<value>1</value>
</property>

Related

Slow install / upgrade through Helm (for Kubernetes)

Our application consists of circa 20 modules. Each module contains a (Helm) chart with several deployments, services and jobs. Some of those jobs are defined as Helm pre-install and pre-upgrade hooks. Altogether there are probably about 120 yaml files, which eventualy result in about 50 running pods.
During development we are running Docker for Windows version 2.0.0.0-beta-1-win75 with Docker 18.09.0-ce-beta1 and Kubernetes 1.10.3. To simplify management of our Kubernetes yaml files we use Helm 2.11.0. Docker for Windows is configured to use 2 CPU cores (of 4) and 8GB RAM (of 24GB).
When creating the application environment for the first time, it takes more that 20 minutes to become available. This seems far to slow; we are probably making an important mistake somewhere. We have tried to improve the (re)start time, but to no avail. Any help or insights to improve the situation would be greatly appreciated.
A simplified version of our startup script:
#!/bin/bash
# Start some infrastructure
helm upgrade --force --install modules/infrastructure/chart
# Start ~20 modules in parallel
helm upgrade --force --install modules/module01/chart &
[...]
helm upgrade --force --install modules/module20/chart &
await_modules()
Executing the same startup script again later to 'restart' the application still takes about 5 minutes. As far as I know, unchanged objects are not modified at all by Kubernetes. Only the circa 40 hooks are run by Helm.
Running a single hook manually with docker run is fast (~3 seconds). Running that same hook through Helm and Kubernetes regularly takes 15 seconds or more.
Some things we have discovered and tried are listed below.
Linux staging environment
Our staging environment consists of Ubuntu with native Docker. Kubernetes is installed through minikube with --vm-driver none.
Contrary to our local development environment, the staging environment retrieves the application code through a (deprecated) gitRepo volume for almost every deployment and job. Understandibly, this only seems to worsen the problem. Starting the environment for the first time takes over 25 minutes, restarting it takes about 20 minutes.
We tried replacing the gitRepo volume with a sidecar container that retrieves the application code as a TAR. Although we have not modified the whole application, initial tests indicate this is not particularly faster than the gitRepo volume.
This situation can probably be improved with an alternative type of volume that enables sharing of code between deployements and jobs. We would rather not introduce more complexity, though, so we have not explored this avenue any further.
Docker run time
Executing a single empty alpine container through docker run alpine echo "test" takes roughly 2 seconds. This seems to be overhead of the setup on Windows. That same command takes less 0.5 seconds on our Linux staging environment.
Docker volume sharing
Most of the containers - including the hooks - share code with the host through a hostPath. The command docker run -v <host path>:<container path> alpine echo "test" takes 3 seconds to run. Using volumes seems to increase runtime with aproximately 1 second.
Parallel or sequential
Sequential execution of the commands in the startup script does not improve startup time. Neither does it drastically worsen.
IO bound?
Windows taskmanager indicates that IO is at 100% when executing the startup script. Our hooks and application code are not IO intensive at all. So the IO load seems to originate from Docker, Kubernetes or Helm. We have tried to find the bottleneck, but were unable to pinpoint the cause.
Reducing IO through ramdisk
To test the premise of being IO bound further, we exchanged /var/lib/docker with a ramdisk in our Linux staging environment. Starting the application with this configuration was not significantly faster.
To compare Kubernetes with Docker, you need to consider that Kubernetes will run more or less the same Docker command on a final step. Before that happens many things are happening.
The authentication and authorization processes, creating objects in etcd, locating correct nodes for pods scheduling them and provisioning storage and many more.
Helm itself also adds an overhead to the process depending on size of chart.
I recommend reading One year using Kubernetes in production: Lessons learned. Author goes into explaining what have they achieved by switching to Kubernetes as well differences in overhead:
Cost calculation
Looking at costs, there are two sides to the story. To run Kubernetes, an etcd cluster is required, as well as a master node. While these are not necessarily expensive components to run, this overhead can be relatively expensive when it comes to very small deployments. For these types of deployments, it’s probably best to use a hosted solution such as Google's Container Service.
For larger deployments, it’s easy to save a lot on server costs. The overhead of running etcd and a master node aren’t significant in these deployments. Kubernetes makes it very easy to run many containers on the same hosts, making maximum use of the available resources. This reduces the number of required servers, which directly saves you money. When running Kubernetes sounds great, but the ops side of running such a cluster seems less attractive, there are a number of hosted services to look at, including Cloud RTI, which is what my team is working on.

Does kubernetes supports log retention?

How can one define log retention for kubernetes pods?
For now it seems like the log file size is not limited, and it is uses the host machine complete resources.
According to Logging Architecture from kubernetes.io there are some options
First option
Kubernetes currently is not responsible for rotating logs, but rather
a deployment tool should set up a solution to address that. For
example, in Kubernetes clusters, deployed by the kube-up.sh script,
there is a logrotate tool configured to run each hour. You can also
set up a container runtime to rotate application’s logs automatically,
e.g. by using Docker’s log-opt. In the kube-up.sh script, the latter
approach is used for COS image on GCP, and the former approach is used
in any other environment. In both cases, by default rotation is
configured to take place when log file exceeds 10MB.
Also
Second option
Sidecar containers can also be used to rotate log files that cannot be rotated by the application itself. An example of this approach is a small container running logrotate periodically. However, it’s recommended to use stdout and stderr directly and leave rotation and retention policies to the kubelet.
You can always set the logging retention policy on your docker nodes
See: https://docs.docker.com/config/containers/logging/json-file/#examples
I've just got this working by changing the ExecStart line in /etc/default/docker and adding the line --log-opt max-size=10m
Please note, that this will affect all containers running on a node, which makes it ideal for a Kubernetes setup (because my real-time logs are uploaded to an external ELK stack)

how to use cloudify to auto-heal/scale docker containers

In my project, I'm using cloudify to start and configure the docker containers.
Now I'm wondering how to write YAML files to auto-heal/scale those containers.
My topology is like this: a Compute node contains a Docker-Container node, and in the latter runs several containers.
I've noticed cloudify does the job of auto-healing on the base of the Compute node. So can't I trigger an auto-heal workflow by containers' statuses?
And for auto-scale, I installed the monitor agent and configured the basic collectors. The CPU use percent seems not able to trigger the workflow. cloudify docs about diamond plugin mentions some built-in collectors. Unfortunately, I failed to figure out how to config the collectors.
In hope of some inspirations. Any opinions are appreciated. Thanks~
The docker nodes should be in the right groups for scale and heal.
You can look at this example
scale-heal example
It does exactly what you are looking for

Unwanted revert back to original of dbms.memory.heap.max_size

I'm using Neo4j in a docker (v. 3.1.0). I tried to update the whole database with a single query when I faced error:
There is not enough memory to perform the current task. Please try
increasing 'dbms.memory.heap.max_size' in the neo4j configuration
(normally in 'conf/neo4j.conf' or, if you you are using Neo4j Desktop,
found through the user interface) or if you are running an embedded
installation increase the heap by using '-Xmx' command line flag, and
then restart the database.
So I went to set the config file entries:
dbms.memory.heap.initial_size=512M
dbms.memory.heap.max_size=512M
I gave them both 2048M (as I've read here that these two better to match). But after saving and restarting the docker, the entries are reverted back to their 512M original values. To make sure that it's not a docker issue, I wrote some comment line in the config, and it sticks. Which means the values are reverted by Neo4j intentionally. But why? Is it a limitation imposed by docker? Because my hardware has enough memory!
If you are using the standard docker image, the /docker_entrypoint.sh will set the memory based on environment variables or default it to 512M.
setting "dbms.memory.heap.initial_size" "${NEO4J_dbms_memory_heap_maxSize:-512M}"
setting "dbms.memory.heap.max_size" "${NEO4J_dbms_memory_heap_maxSize:-512M}"
When you instantiate your docker container add --env NEO4J_dbms_memory_heap_maxSize=2048 to the command.

How to prevent Jenkins-Swarm Plugin from setup swarm as a Label

I set up a Jenkins Server with the swarm plugin and write a batch to autostart slaves. My Batch file looks like:
java -jar swarm-client-2.2-jar-with-dependencies.jar -mode exclusive
-master http://localhost:8080 -disableClientsUniqueId -username
MyUser -password ***** -executors 1 -labels MySlave
My Problem is, the slave is always adding the Label swarm.
My Question is:
How can I prevent the plugin from setting up swarm as a Label ?
I sympathize with the desire to perfectly control the labels attached to a slave, whether it's connected through the Swarm Plugin or not. But the source code makes it look like that "swarm" label is a mandatory prefix to the list of labels: https://github.com/jenkinsci/swarm-plugin/blob/ef02020595d0546b527b84a2e47bc75cde1e6e1a/plugin/src/main/java/hudson/plugins/swarm/PluginImpl.java#L199
The answer may be that you cannot avoid that label without forking the Swarm Plugin and updating that line.

Resources