Temp files in Google Cloud Dataflow - google-cloud-dataflow

I'm trying to write temporary files on the workers executing Dataflow jobs, but it seems like the files are getting deleted while the job is still running. If I SSH into the running VM, I'm able to execute the exact same file-generating command and the files are not destroyed -- perhaps this is a cleanup that happens for the dataflow runner user only. Is it possible to use temp files or is this a platform limitation?
Specifically, I'm attempting to write to the location returned by Files.createTempDir(), which is /tmp/someidentifier.
Edit: Not sure what was happening when I posted, but Files.createTempDirectory() works...

We make no explicit guarantee about the lifetime of files you write to the local disk.
That said, writing to a temporary file inside ProcessElement will work. You can write and read from it within the same ProcessElement. Similarly, any files created in DoFn.startBundle will be visible in processElement and finishBundle.
You should avoid writing to /dataflow/logs/taskrunner/harness. Writing files there might conflict with Dataflow's logging. We encourage you to use the standard Java APIs File.createTempFile() and File.createTempDirectory() instead.
If you want to preserve data beyond finishBundle you should write data to durable storage such as GCS. You can do this by emitting data as a sideOutput and then using TextIO or one of the other writers. Alternatively, you could just write to GCS directly from inside your DoFn.
Since Dataflow runs inside containers you won't be able to see the files by ssh'ing into the VM. The container has some of the directories of the host VM mounted, but /tmp is not one of them. You would need to attach to the appropriate container e.g. by running
docker exec -t -i <CONTAINER ID> /bin/bash
That command would start a shell inside a running container.

Dataflow workers run in a Docker container on the VM, which has some of the directories of the host VM mounted, but apparently /tmp is not one of them.
Try writing your temp files, e.g., to /dataflow/logs/taskrunner/harness, which will be mapped to /var/log/dataflow/taskrunner/harness on the host VM.

Related

KStreams application - state.dir - No .checkpoint file

I have a KStreams application running inside a Docker container which uses a persistent key-value store. My runtime environment is Docker 1.13.1 on RHEL 7.
I have configured state.dir with a value of /tmp/kafka-streams (which is the default).
When I start this container using "docker run", I mount /tmp/kafka-streams to a directory on my host machine which is, say for example, /mnt/storage/kafka-streams.
My application.id is "myapp". I have 288 partitions in my input topic which means my state store / changelog topic will also have that many partitions. Accordingly, when start my Docker container, I see that there a folder with the number of the partition such as 0_1, 0_2....0_288 under /mnt/storage/kafka-streams/myapp/
When I shutdown my application, I do not see any .checkpoint file in any of the partition directories.
And when I restart my application, it starts fetching the records from the changelog topic rather than reading from local disk. I suspect this is because there is no .checkpoint file in any of the partition directories. (Note : I can see the .lock and rocksdb sub-directory inside the partition directories)
This is what I see in the startup log. It seems to be bootstrapping the entire state store from the changelog topic i.e. performing network I/O rather than reading from what is on disk :
2022-05-31T12:08:02.791 [mtx-caf-f6900c0a-50ca-43a0-8a4b-95eaad9e5093-StreamThread-122] WARN o.a.k.s.p.i.ProcessorStateManager - MSG=stream-thread [myapp-f6900c0a-50ca-43a0-8a4b-95eaa
d9e5093-StreamThread-122] task [0_170] State store MyAppRecordStore did not find checkpoint offsets while stores are not empty, since under EOS it has the risk of getting uncommitte
d data in stores we have to treat it as a task corruption error and wipe out the local state of task 0_170 before re-bootstrapping
2022-05-31T12:08:02.791 [myapp-f6900c0a-50ca-43a0-8a4b-95eaad9e5093-StreamThread-122] WARN o.a.k.s.p.internals.StreamThread - MSG=stream-thread [mtx-caf-f6900c0a-50ca-43a0-8a4b-95eaad
9e5093-StreamThread-122] Detected the states of tasks [0_170] are corrupted. Will close the task as dirty and re-create and bootstrap from scratch.
org.apache.kafka.streams.errors.TaskCorruptedException: Tasks [0_170] are corrupted and hence needs to be re-initialized
at org.apache.kafka.streams.processor.internals.ProcessorStateManager.initializeStoreOffsetsFromCheckpoint(ProcessorStateManager.java:254)
at org.apache.kafka.streams.processor.internals.StateManagerUtil.registerStateStores(StateManagerUtil.java:109)
at org.apache.kafka.streams.processor.internals.StreamTask.initializeIfNeeded(StreamTask.java:216)
at org.apache.kafka.streams.processor.internals.TaskManager.tryToCompleteRestoration(TaskManager.java:433)
at org.apache.kafka.streams.processor.internals.StreamThread.initializeAndRestorePhase(StreamThread.java:849)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:731)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:583)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:556)
Should I expect to see a .checkpoint file in each of the partition directories under /mnt/storage/kafka-streams/myapp/ when I shutdown my application ?
Is this an issue because I am running my KStreams app inside a docker container ? If there were permissions issues, then I would have expected to see issues in creating the other files such as .lock or rocksdb folder (and it's contents).
If I run this application as a standalone/runnable Springboot JAR on my Windows laptop i.e. not in a Docker container, I can see that it creates the .checkpoint file as expected.
My java application inside the Docker container is run via an entrypoint script. It seems that if I stop the container, then it does not send the TERM signal to my java process and hence does not have a clean shutdown of the java KStreams application.
So, all I needed to do was to find a way to somehow send a TERM signal to my java application inside the container.
For the moment, I just ssh'ed into the container and did a kill -s TERM <pid> for my java process.
Once I did that, it resulted in a clean shutdown and thus created the .checpoint file as well.

Separate shell scripts from application in docker container?

I have a ftp.sh script that downloads files from an external ftp to my host. And I have another (java) application that imports the downloaded content into a database. Currently, both runs on the host triggered by a cronjob as follows:
importer.sh:
#!/bin/bash
source ftp.sh
java -jar app.jar
Now I'd like to move my project to docker. From the design point of view: would the .sh script and the application reside both in a separate container? Or should both be bundled into one container?
I can think of the following approaches:
Run the ftp script on host, but java app in docker container.
Run the ftp script in its own docker container, and the java app in another docker container.
Bundle both the script and java app in a docker container. Then call a wrapper script with: ENTRYPOINT["wrapper.sh"]
So the underlying question is: should each docker container serve only on purpose (either download files, or import them)?
Sharing files between containers is tricky and I would try to avoid doing that. It sounds like you are trying to set up a one-off container that does the download, then does the import, then exits, so "orchestrating" this with a shell script in a single container will be much easier than trying to set up multiple containers with shared storage. If the ftp.sh script sets some environment variables then it will be basically impossible to export them to a second container.
From your workflow it doesn't sound like building an image with the file and the import tool is the right approach. If it was, I could envision a work flow where you ran ftp.sh on the host or as the first part of a multi-stage build and then COPY the file into the image. For a workflow that's "download a file then import it into a database, where the file changes routinely" I don't think that's what you're after.
The setup you have now should work fine just packaging it into a container. I'd give you some generic advice in a code-review context but your last option of stuffing it also into one image and running the wrapper script as the main container process makes sense.

Copy many large files from the host to a docker container and back

I am a beginner with Docker and I have been searching for 2 days now and I do not understand which would be a better solution.
I have a docker container on a Ubuntu server. I need to copy many large video files to the Ubuntu host via FTP. Docker via cron will process the videos using ffmpeg and save the result to the Ubuntu host somehow so the files are accessible via FTP.
What is the best solution:
create a bind drive - I understand the host may change files in the bind drive
create a volume but I do not understand how may I add files to the volume
create a folder on the Ubuntu and have a cron that will copy using "docker cp" command and after a video has been processed to copy it to the host?
Thank you in advance.
Bind-mounting a host directory for this is probably the best approach, for exactly the reasons you lay out: both the host and container can directly read and write to it, but the host can't easily write to a named volume. docker cp is tricky, you note the problem of knowing when the process is completed, and anyone who can run any docker command at all can pretty trivially root the host; you don't want to give this permission to something network-facing.
If you're designing a larger-scale system, you also might consider an approach where no files are actually shared at all. The upload server sends the files (maybe via HTTP POST) to an internal storage service, then posts a message to a message queue (maybe RabbitMQ). That then retrieves the files from the storage service, does its work, uploads the result, and posts a response message. The big advantages of this approach are being able to run it on multiple systems, easily being able to scale the individual components of it, and not needing to worry about filesystem permissions. But, it's a much more involved design.

Intro to Docker for FreeBSD Jail User - How and should I start the container with systemd?

We're currently migrating room server to the cloud for reliability, but our provider doesn't have the FreeBSD option. Although I'm prepared to pay and upload a custom system image for deployment, I nontheless want to learn how to start a application system instance using Docker.
in FreeBSD Jail, what I did was to extract an entire base.txz directory hierarchy as system content into /usr/jail/app, and pkg -r /usr/jail/app install apache24 php perl; then I configured /etc/jail.conf to start the /etc/rc script in the jail.
I followed the official FreeBSD Handbook, and this is generally what I've worked out so far.
But Docker is another world entirely.
To build a Docker image, there are two options: a) import from a tarball, b) use a Dockerfile. The latter of which lets you specify a "CMD", which is the default command to run, but
Q1. why isn't it available from a)?
Q2. where are information like "CMD ENV" stored? in the image? in the container?
Q3. How to start a GNU/Linux system in a container? Do I just run systemd and let it figure out the rest from configuration? Do I need to pass to it some special arguments or envvars?
You should think of a Docker container as a packaging around a single running daemon. The ideal Docker container runs one process and one process only. Systemd in particular is so heavyweight and invasive that it's actively difficult to run inside a Docker container; if you need multiple processes in a container then a lighter-weight init system like supervisord can work for you, but that's usually an exception more than a standard packaging.
Docker has an official tutorial on building and running custom images which is worth a read through; this is a pretty typical use case for Docker. In particular, best practice is to write a Dockerfile that describes how to build an image and check it into source control. Containers should avoid having persistent data if they can (storing everything in an external database is ideal); if you change an image, you need to delete and recreate any containers based on it. If local data is unavoidable then either Docker volumes or bind mounts will let you keep data "outside" the container.
While Docker has several other ways to create containers and images, none of them are as reproducible. You should avoid the import, export, and commit commands; and you should only use save and load if you can't use or set up a Docker registry and are forced to move images between systems via a tar file.
On your specific questions:
Q1. I suspect the best reason the non-docker build paths to create images don't easily let you specify things like CMD is just an implementation detail: if you look at the docker history of an image you'll see the CMD winds up being its own layer. Don't worry about it and use a Dockerfile.
Q2. The default CMD, any set ENV variables, and other related metadata are stored in the image alongside the filesystem tree. (Once you launch a container, it has a normal Unix process tree, with the initial process being pid 1.)
Q3. You don't "start a system in a container". Generally run one process or service in a container, and manage their lifecycles independently.

Docker separation of concerns / services

I have a laravel project which I am using with docker. Currently I am using a single container to host all the services (apache, mySQL etc) as well as the needed dependencies (project files, git, composer etc) I need for my project.
From what I am reading the current best practice is to put each service into a separate container. So far this seems simple enough since these services are designed to run at length (apache server, mySQL server). When I spin up these 'service' containers using -d they remain running (docker ps) since their main process continuously runs.
However, when I remove all the services from my project container, then there is no main process left to continuously run. This means my container immediately exits once spun up.
I have read the 'hacks' of running other processes like tail -f /dev/null, sleep infinity, using interactive mode, installing supervisord (which I assume would end up watching no processes in such containers?) and even leaving the container to run in the foreground (taking up a terminal console...).
How do I network such a container to keep it running like the abstracted services but detached without these hacks? I cannot seem to find much information on this in the official docker docs nor can I find any examples of other projects (please link any)
EDIT: I am not talking about volumes / storage containers to store the data my project processes, but rather how I can use a container to store the project itself and its dependencies that aren't services (project files, git, composer)
when you run the container try running with the flags ...
docker run -dt ..... etc
you might even try .....
docker run -dti ..... etc
let me know if this brings any joy. has certainly worked for me on occassions.
i know you wanted to avoid hacks but if the above fails then also add ...
CMD cat
to the end of your Dockerfile - it is a hack but is the cleanest hack :)
So after reading this a few times along with Joachim Isaksson's comment, I finally get it. Tools don't need the containers to run continuously to use. Proper separation of the project files, services (mySQL, apache) and tools (git, composer) are done differently.
The project files are persisted within a data volume container. The services are networked since they expose ports. The tools live in their own containers which share the project files data volume - they are not networked. Logs, databases and other output can be persisted in different volumes.
When you wish to run one of these tools, you spin up the tool container by passing the relevant command using docker run. The tool then manipulates the data within the directory persisted within the shared volume. The containers only persist as long as the command to manipulate the data within the shared volume takes to run and then the container stops.
I don't know why this took me so long to grasp, but this is the aha moment for me.

Resources