HBase + TestContainers - Port Remapping - docker

I am trying to use Test Containers to run an integration test against HBase launched in a Docker container. The problem I am running into may be a bit unique to how a client interacts with HBase.
When the HBase Master starts in the container, it stores its hostname:port in Zookeeper so that clients can find it. In this case, it stores "localhost:16000".
In my test case running outside the container, the client retrieves "localhost:16000" from Zookeeper and cannot connect. The connection fails because the port has been remapped by TestContainers to some other random port, other than 16000.
Any ideas how to overcome this?
(1) One idea is to find a way to tell the HBase Client to use the remapped port, ignoring the value it retrieved from Zookeeper, but I have yet to find a way to do this.
(2) If I could get the HBase Master to write the externally accessible host:port in Zookeeper that would also fix the problem. But I do not believe the container itself has any knowledge about how Test Containers is doing the port remapping.
(3) Perhaps there is a different solution that Test Containers provides for this sort of situation?

You can take a look at KafkaContainer's implementation where we start a Socat (fast tcp proxy) container first to acquire a semi-random port and use it later to configure the target container.
The algorithm is:
In doStart, first start Socat targetting the original container's network alias & port like 12345
Get mapped port (it will be something like 32109 pointing to 12345)
Make the original container (e.g. with environment variables) use the mapped port in addition to the original one, or, if only one port can be configured, see CouchbaseContainer for the more advanced option
Return Socat's host & port to the client

we build a new image of hbase to be compliant with test container.
Use this image:
docker run --env HBASE_MASTER_PORT=16000 --env HBASE_REGION_PORT=16020 jcjabouille/hbase-standalone:2.4.9
Then create this Container (in scala here)
private[test] class GenericHbase2Container
extends GenericContainer[GenericHbase2Container](
DockerImageName.parse("jcjabouille/hbase-standalone:2.4.9")
) {
private val randomMasterPort: Int = FreePortFinder.findFreeLocalPort(18000)
private val randomRegionPort: Int = FreePortFinder.findFreeLocalPort(20000)
private val hostName: String = InetAddress.getLocalHost.getHostName
val hbase2Configuration: Configuration = HBaseConfiguration.create
addExposedPort(randomMasterPort)
addExposedPort(randomRegionPort)
addExposedPort(2181)
withCreateContainerCmdModifier { cmd: CreateContainerCmd =>
cmd.withHostName(hostName)
()
}
waitingFor(Wait.forLogMessage(".*0 row.*", 1))
withStartupTimeout(Duration.ofMinutes(10))
withEnv("HBASE_MASTER_PORT", randomMasterPort.toString)
withEnv("HBASE_REGION_PORT", randomRegionPort.toString)
setPortBindings(Seq(s"$randomMasterPort:$randomMasterPort", s"$randomRegionPort:$randomRegionPort").asJava)
override protected def doStart(): Unit = {
super.doStart()
hbase2Configuration.set("hbase.client.pause", "200")
hbase2Configuration.set("hbase.client.retries.number", "10")
hbase2Configuration.set("hbase.rpc.timeout", "3000")
hbase2Configuration.set("hbase.client.operation.timeout", "3000")
hbase2Configuration.set("hbase.client.scanner.timeout.period", "10000")
hbase2Configuration.set("zookeeper.session.timeout", "10000")
hbase2Configuration.set("hbase.zookeeper.quorum", "localhost")
hbase2Configuration.set("hbase.zookeeper.property.clientPort", getMappedPort(2181).toString)
}
}
More details here: https://hub.docker.com/r/jcjabouille/hbase-standalone

Related

spark app socket communication between container on docker spark cluster

So I have a Spark cluster running in Docker using Docker Compose. I'm using docker-spark images.
Then i add 2 more containers, 1 is behave as server (plain python) and 1 as client (spark streaming app). They both run on the same network.
For server (plain python) i have something like
import socket
s.bind(('', 9009))
s.listen(1)
print("Waiting for TCP connection...")
while True:
# Do and send stuff
And for my client (spark app) i have something like
conf = SparkConf()
conf.setAppName("MyApp")
sc = SparkContext(conf=conf)
sc.setLogLevel("ERROR")
ssc = StreamingContext(sc, 2)
ssc.checkpoint("my_checkpoint")
# read data from port 9009
dataStream = ssc.socketTextStream(PORT, 9009)
# What's PORT's value?
So what is PORT's value? is it the IP Adress value from docker inspect of the container?
Okay so i found that i can use the IP of the container, as long as all my containers are on the same network.
So i check the IP by running
docker inspect <container_id>
and check the IP, and use that as host for my socket
Edit:
I know it's kinda late, but i just found out that i can actually use the container's name as long as they're in the same network
More edit:
i made changes in docker-compose like:
container-1:
image: image-1
container_name: container-1
networks:
- network-1
container-2:
image: image-2
container_name: container-2
ports:
- "8000:8000"
networks:
- network-1
and then in my script (container 2):
conf = SparkConf()
conf.setAppName("MyApp")
sc = SparkContext(conf=conf)
sc.setLogLevel("ERROR")
ssc = StreamingContext(sc, 2)
ssc.checkpoint("my_checkpoint")
# read data from port 9009
dataStream = ssc.socketTextStream("container-1", 9009) #Put container's name here
I also expose the socket port in Dockerfile, I don't know if that have effect or not

Expose port using DockerOperator

I am using DockerOperator to run a container. But I do not see any related option to publish required port. I need to publish a webserver port when the task is triggered. Any help or guide will be helpful. Thank you!
First, don't forget docker_operator is deprecated, replaced (now) with providers.docker.operators.docker.
Second, I don't know of a command to expose a port in a live (running) Docker container.
As described in this article from Sidhartha Mani
Specifically, I needed access to the filled mysql database. .
I could think of a few ways to do this:
Stop the container and start a new one with the added port exposure. docker run -p 3306:3306 -p 8080:8080 -d java/server.
The second option is to start another container that links to this, and knows how to port forward.
Setup iptables rules to forward a host port into the container.
So:
Following existing rules, I created my own rule to forward to the container
iptables -t nat -D DOCKER ! -i docker0 -p tcp --dport 3306-j DNAT \
--to-destination 172.17.0.2:3306
This just says that whenever a packet is destined to port 3306 on the host, forward it to the container with ip 172.17.0.2, and its port 3306.
Once I did this, I could connect to the container using host port 3306.
I wanted to make it easier for others to expose ports on live containers.
So, I created a small repository and a corresponding docker image (named wlan0/redirect).
The same effect as exposing host port 3306 to container 172.17.0.2:3306 can be achieved using this command.
This command saves the trouble of learning how to use iptables.
docker run --privileged -v /proc:/host/proc \
-e HOST_PORT=3306 -e DEST_IP=172.17.0.2 -e DEST_PORT=3306 \
wlan0/redirect:latest
In other words, this kind of solution would not be implemented from a command run in the container, through an Airflow Operator.
As per my understanding DockerOperator will create a new container, then why is there no way of exposing ports while create a new container.
First, the EXPOSE part is, as I mentioned here, just a metadata added to the image. It is not mandatory.
The runtime (docker run) -p option is about publishing, not exposing: publishing a port and mapping it to a host port (see above) or another container port.
That might be not needed with an Airflow environment, where there is a default network, and even the possibility to setup a custom network or subnetwork.
Which means other (Airflow) containers attached to the same network should be able to access a ports of any container in said network, without needing any -p (publication) or EXPOSE directive.
In order to accomplish this, you will need to subclass the DockerOperator and override the initializer and _run_image_with_mounts method, which uses the API client to create a container with the specified host configuration.
class DockerOperatorWithExposedPorts(DockerOperator):
def __init__(self, *args, **kwargs):
self.port_bindings = kwargs.pop("port_bindings", {})
if self.port_bindings and kwargs.get("network_mode") == "host":
self.log.warning("`port_bindings` is not supported in `host` network mode.")
self.port_bindings = {}
super().__init__(*args, **kwargs)
def _run_image_with_mounts(
self, target_mounts, add_tmp_variable: bool
) -> Optional[Union[List[str], str]]:
"""
NOTE: This method was copied entirely from the base class `DockerOperator`, for the capability
of performing port publishing.
"""
if add_tmp_variable:
self.environment['AIRFLOW_TMP_DIR'] = self.tmp_dir
else:
self.environment.pop('AIRFLOW_TMP_DIR', None)
if not self.cli:
raise Exception("The 'cli' should be initialized before!")
self.container = self.cli.create_container(
command=self.format_command(self.command),
name=self.container_name,
environment={**self.environment, **self._private_environment},
ports=list(self.port_bindings.keys()) if self.port_bindings else None,
host_config=self.cli.create_host_config(
auto_remove=False,
mounts=target_mounts,
network_mode=self.network_mode,
shm_size=self.shm_size,
dns=self.dns,
dns_search=self.dns_search,
cpu_shares=int(round(self.cpus * 1024)),
port_bindings=self.port_bindings if self.port_bindings else None,
mem_limit=self.mem_limit,
cap_add=self.cap_add,
extra_hosts=self.extra_hosts,
privileged=self.privileged,
device_requests=self.device_requests,
),
image=self.image,
user=self.user,
entrypoint=self.format_command(self.entrypoint),
working_dir=self.working_dir,
tty=self.tty,
)
logstream = self.cli.attach(container=self.container['Id'], stdout=True, stderr=True, stream=True)
try:
self.cli.start(self.container['Id'])
log_lines = []
for log_chunk in logstream:
log_chunk = stringify(log_chunk).strip()
log_lines.append(log_chunk)
self.log.info("%s", log_chunk)
result = self.cli.wait(self.container['Id'])
if result['StatusCode'] != 0:
joined_log_lines = "\n".join(log_lines)
raise AirflowException(f'Docker container failed: {repr(result)} lines {joined_log_lines}')
if self.retrieve_output:
return self._attempt_to_retrieve_result()
elif self.do_xcom_push:
if len(log_lines) == 0:
return None
try:
if self.xcom_all:
return log_lines
else:
return log_lines[-1]
except StopIteration:
# handle the case when there is not a single line to iterate on
return None
return None
finally:
if self.auto_remove == "success":
self.cli.remove_container(self.container['Id'])
elif self.auto_remove == "force":
self.cli.remove_container(self.container['Id'], force=True)
Explanation: The create_host_config method of the APIClient has an optional port_bindings keyword argument, and create_container method has an optional ports argument. These calls aren't exposed in the DockerOperator, so you have to copy the _run_image_with_mounts method and override it with a copy and supply those arguments with the port_bindings field set in the initializer. You can then supply the ports to publish as a keyword argument. Note that in this implementation, the expectation is argument is a dictionary:
t1 = DockerOperatorWithExposedPorts(image=..., task_id=..., port_bindings={5000: 5000, 8080:8080, ...})

Connecting to HDFS namenode running in docker container from outside host VM

I have a HBase + HDFS setup, in which each of the HBase master, regionservers, HDFS namenode and datanodes are containerized.
When running all of these containers on a single host VM, things work fine as I can use the docker container names directly, and set configuration variables as:
CORE_CONF_fs_defaultFS: hdfs://namenode:9000
for both the regionserver and datanode. The system works as expected in this configuration.
When attempting to distribute these across multiple host VMs however, I run into issue.
I updated the config variables above to look like:
CORE_CONF_fs_defaultFS: hdfs://hostname:9000
and make sure the namenode container is exposing port 9000 and mapping it to the host machine's port 9000.
It looks like the names are not resolving correctly when I use the hostname, and the error I see in the datanode logs looks like:
2019-08-24 05:46:08,630 INFO impl.FsDatasetAsyncDiskService: Deleted BP-1682518946-<ip1>-1566622307307 blk_1073743161_2337 URI file:/hadoop/dfs/data/current/BP-1682518946-<ip1>-1566622307307/current/rbw/blk_1073743161
2019-08-24 05:47:36,895 INFO datanode.DataNode: Receiving BP-1682518946-<ip1>-1566622307307:blk_1073743166_2342 src: /<ip3>:48396 dest: /<ip2>:9866
2019-08-24 05:47:36,897 ERROR datanode.DataNode: <hostname>-datanode:9866:DataXceiver error processing WRITE_BLOCK operation src: /<ip3>:48396 dst: /<ip2>:9866
java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:101)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:786)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:173)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:107)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:290)
at java.lang.Thread.run(Thread.java:748)
Where <hostname>-datanode is the name of the datanode container, and the IPs are various container IPs.
I'm wondering if I'm missing some configuration variable that would let containers from other VMs connect to the namenode, or some other change that'd allow this system to be distributed correctly. I'm wondering if the system is expecting the containers to be named a certain way, for example.

How do I get my IP address from inside an ECS container running with the awsvpc network mode?

From a regular ECS container running with the bridge mode, or from a standard EC2 instance, I usually run
curl http://169.254.169.254/latest/meta-data/local-ipv4
to retrieve my IP.
In an ECS container running with the awsvpc network mode, I get the IP of the underlying EC2 instance which is not what I want. I want the address of the ENI attached to my container. How do I do that?
A new convenience environment variable is injected by the AWS container agent into every container in AWS ECS: ${ECS_CONTAINER_METADATA_URI}
This contains the URL to the metadata endpoint, so now you can do
curl ${ECS_CONTAINER_METADATA_URI}
The output looks something like
{
"DockerId":"redact",
"Name":"redact",
"DockerName":"ecs-redact",
"Image":"redact",
"ImageID":"redact",
"Labels":{ },
"DesiredStatus":"RUNNING",
"KnownStatus":"RUNNING",
"Limits":{ },
"CreatedAt":"2019-04-16T22:39:57.040286277Z",
"StartedAt":"2019-04-16T22:39:57.29386087Z",
"Type":"NORMAL",
"Networks":[
{
"NetworkMode":"awsvpc",
"IPv4Addresses":[
"172.30.1.115"
]
}
]
}
Under the key Networks you'll find IPv4Address
You application code can then look something like this (python)
METADATA_URI = os.environ['ECS_CONTAINER_METADATA_URI']
container_metadata = requests.get(METADATA_URI).json()
ALLOWED_HOSTS.append(container_metadata['Networks'][0]['IPv4Addresses'][0])
import * as publicIp from 'public-ip';
const publicIpAddress = await publicIp.v4(); // your container's public IP

Docker apps logging with Filebeat and Logstash

I have a set of dockerized applications scattered across multiple servers and trying to setup production-level centralized logging with ELK. I'm ok with the ELK part itself, but I'm a little confused about how to forward the logs to my logstashes.
I'm trying to use Filebeat, because of its loadbalance feature.
I'd also like to avoid packing Filebeat (or anything else) into all my dockers, and keep it separated, dockerized or not.
How can I proceed?
I've been trying the following. My Dockers log on stdout so with a non-dockerized Filebeat configured to read from stdin I do:
docker logs -f mycontainer | ./filebeat -e -c filebeat.yml
That appears to work at the beginning. The first logs are forwarded to my logstash. The cached one I guess. But at some point it gets stuck and keep sending the same event
Is that just a bug or am I headed in the wrong direction? What solution have you setup?
Here's one way to forward docker logs to the ELK stack (requires docker >= 1.8 for the gelf log driver):
Start a Logstash container with the gelf input plugin to reads from gelf and outputs to an Elasticsearch host (ES_HOST:port):
docker run --rm -p 12201:12201/udp logstash \
logstash -e 'input { gelf { } } output { elasticsearch { hosts => ["ES_HOST:PORT"] } }'
Now start a Docker container and use the gelf Docker logging driver. Here's a dumb example:
docker run --log-driver=gelf --log-opt gelf-address=udp://localhost:12201 busybox \
/bin/sh -c 'while true; do echo "Hello $(date)"; sleep 1; done'
Load up Kibana and things that would've landed in docker logs are now visible. The gelf source code shows that some handy fields are generated for you (hat-tip: Christophe Labouisse): _container_id, _container_name, _image_id, _image_name, _command, _tag, _created.
If you use docker-compose (make sure to use docker-compose >= 1.5) and add the appropriate settings in docker-compose.yml after starting the logstash container:
log_driver: "gelf"
log_opt:
gelf-address: "udp://localhost:12201"
Docker allows you to specify the logDriver in use. This answer does not care about Filebeat or load balancing.
In a presentation I used syslog to forward the logs to a Logstash (ELK) instance listening on port 5000.
The following command constantly sends messages through syslog to Logstash:
docker run -t -d --log-driver=syslog --log-opt syslog-address=tcp://127.0.0.1:5000 ubuntu /bin/bash -c 'while true; do echo "Hello $(date)"; sleep 1; done'
Using filebeat you can just pipe docker logs output as you've described. Behavior you are seeing definitely sounds like a bug, but can also be the partial line read configuration hitting you (resend partial lines until newline symbol is found).
A problem I see with piping is possible back pressure in case no logstash is available. If filebeat can not send any events, it will buffer up events internally and at some point stop reading from stdin. No idea how/if docker protects from stdout becoming unresponsive. Another problem with piping might be restart behavior of filebeat + docker if you are using docker-compose. docker-compose by default reuses images + image state. So when you restart, you will ship all old logs again (given the underlying log file has not been rotated yet).
Instead of piping you can try to read the log files written by docker to the host system. The default docker log driver is the json log driver . You can and should configure the json log driver to do log-rotation + keep some old files (for buffering up on disk). See max-size and max-file options. The json driver puts one line of 'json' data for every line to be logged. On the docker host system the log files are written to /var/lib/docker/containers/container_id/container_id-json.log . These files will be forwarded by filebeat to logstash. If logstash or network becomes unavailable or filebeat is restarted, it continues forwarding log lines where it left of (given files have been not deleted due to log rotation). No events will be lost. In logstash you can use the json_lines codec or filter to parse the json lines and a grok filter to gain some more information from your logs.
There has been some discussion about using libbeat (used by filebeat for shipping log files) to add a new log driver to docker. Maybe it is possible to collect logs via dockerbeat in the future by using the docker logs api (I'm not aware of any plans about utilising the logs api, though).
Using syslog is also an option. Maybe you can get some syslog relay on your docker host load balancing log events. Or have syslog write log files and use filebeat to forward them. I think rsyslog has at least some failover mode. You can use logstash syslog input plugin and rsyslog to forward logs to logstash with failover support in case the active logstash instance becomes unavailable.
I created my own docker image using the Docker API to collect the logs of the containers running on the machine and ship them to Logstash thanks to Filebeat. No need to install or configure anything on the host.
Check it out and tell me if it suits your needs: https://hub.docker.com/r/bargenson/filebeat/.
The code is available here: https://github.com/bargenson/docker-filebeat
Just for helping others that need to do this, you can simply use Filebeat to ship the logs. I would use the container by #brice-argenson, but I needed SSL support so I went with a locally installed Filebeat instance.
The prospector from filebeat is (repeat for more containers):
- input_type: log
paths:
- /var/lib/docker/containers/<guid>/*.log
document_type: docker_log
fields:
dockercontainer: container_name
It sucks a bit that you need to know the GUIDs as they could change on updates.
On the logstash server, setup the usual filebeat input source for logstash, and use a filter like this:
filter {
if [type] == "docker_log" {
json {
source => "message"
add_field => [ "received_at", "%{#timestamp}" ]
add_field => [ "received_from", "%{host}" ]
}
mutate {
rename => { "log" => "message" }
}
date {
match => [ "time", "ISO8601" ]
}
}
}
This will parse the JSON from the Docker logs, and set the timestamp to the one reported by Docker.
If you are reading logs from the nginx Docker image, you can add this filter as well:
filter {
if [fields][dockercontainer] == "nginx" {
grok {
match => { "message" => "(?m)%{IPORHOST:targethost} %{COMBINEDAPACHELOG}" }
}
mutate {
convert => { "[bytes]" => "integer" }
convert => { "[response]" => "integer" }
}
mutate {
rename => { "bytes" => "http_streamlen" }
rename => { "response" => "http_statuscode" }
}
}
}
The convert/renames are optional, but fixes an oversight in the COMBINEDAPACHELOG expression where it does not cast these values to integers, making them unavailable for aggregation in Kibana.
I verified what erewok wrote above in a comment:
According to the docs, you should be able to use a pattern like this
in your prospectors.paths: /var/lib/docker/containers/*/*.log – erewok
Apr 18 at 21:03
The docker container guids, represented as the first '*', are correctly resolved when filebeat starts up. I do not know what happens as containers are added.

Resources