Moving a large file across mounts - docker

I have a very large file (>100GB) on one volume mount in a Docker container. I would like to move it to another volume mount in the same container. However, when I do so using rename, I get the following error: invalid cross-device link. The code I'm using to do this is the following:
fmt.Printf("Moving %s to %s\n", oo, outputPath)
err = fs.Rename(oo, strings.Replace(oo, inputPath, outputPath, 1))
if err != nil {
fmt.Printf("Error moving %s to %s: %v", oo, outputPath, err)
return
}
I understand why, but what's my alternative? Copying and deleting is really not an option with a file this big. Is there a better way to move across two volumes in a more space efficient way?
The good news is that they’re on the same disk in the underlying container. But the container is extremely locked down, so issuing underlying server commands would not be possible/really hard. That said, we COULD potentially mount them as a unified volume - interesting idea. Is it possible to mount two separate root directives as the same volume?

Related

Blocked when read from os.Stdin when PIPE the output in docker

I'm trying to pipe the output(logs) of a program to a Go program which aggregates/compress the output and uploads to S3. The command to run the program is "/program1 | /logShipper". The logShipper is written in Go and it's simply read from os.Stdin and write to a local file. The local file will be processed by another goroutine and upload to S3 periodically. There are some existing docker log drivers but we are running the container on a fully managed provider and the log processing charge is pretty expensive, so we want to bypass the existing solution and just upload to S3.
The main logic of the logShipper is simply read from the os.Stdin and write to some file. It's work correctly when running on the local machine but when running in docker the goroutine blocked at reader.ReadString('\n') and never return.
go func() {
reader := bufio.NewReader(os.Stdin)
mu.Lock()
output = openOrCreateOutputFile(&uploadQueue, workPath)
mu.Unlock()
for {
text, _ := reader.ReadString('\n')
now := time.Now().Format("2006-01-02T15:04:05.000000000Z")
mu.Lock()
output.file.Write([]byte(fmt.Sprintf("%s %s", now, text)))
mu.Unlock()
}
}()
I did some research online but not find why it's not working. One possibility I'm thinking is might docker redirect the stdout to somewhere so the PIPE not working the same way as it's running on a Linux box? (As looks like it can't read anything from program1) Any help or suggestion why it not working is welcome. Thanks.
Edit:
After doing more research I realized it's a bad practice to handle the logs in this way. I should more rely on the docker's log driver to handle the log aggregate and shipping. However, I'm still interested to find out why it's not read anything from the PIPE source program.
I'm not sure about the way the Docker handles output, but I suggest that you extract the file descriptor with os.Stdin.Fd() and then resort to using golang.org/x/sys/unix package as follows:
// Long way, for short one jump
// down straight to it.
//
// retrieve the file descriptor
// cast it to int, because Fd method
// returns uintptr
fd := int(os.Stdin.Fd())
// extract file descriptor flags
// it's safe to drop the error, since if it's there
// and it's not nil, you won't be able to read from
// Stdin anyway, unless it's a notice
// to try again, which mostly should not be
// the case
flags, _ := unix.FcntlInt(fd, unix.F_GETFL, 0)
// check if the nonblocking reading in enabled
nb := flags & unix.O_NONBLOCK != 0
// if this is the case, just enable it with
// unix.SetNonblock which is also a
// -- SHORT WAY HERE --
err = unix.SetNonblock(fd, true)
The difference between the long and a short way is that the long way will definitely tell you, if the problem is in the nonblocking state absence or not.
If this is not the case. Then I have no other ideas personally.

GitLab: How to list registry containers with size

I have a self-hosten GitLab CE Omnibus installation (version 11.5.2) running including the container registry.
Now, the disk size needed to host all those containers increase quite fast.
As an admin, I want to list all Docker images in this registry including their size, so I can maybe let those get deleted.
Maybe I haven't looked hard enough but currently, I couldn't find something in the Admin Panel of GitLab. Before I make myself the work of creating a script to compare that weird linking between repositories and blobs directories in /var/opt/gitlab/gitlab-rails/shared/registry/docker/registry/v2 and then aggregating the sizes based on the repositories, i wanted to ask:
Is there some CLI command or even a curl call to the registry to get the information I want?
Update: This answer is deprecated by now. Please see the accepted answer for a solution built into GitLab's Rails console directly.
Original Post:
Thanks to great comment from #Rekovni, my problem is somehwat solved.
First: The huge amount of used disk space by Docker Images was due to a bug in Gitlab/Docker Registry. Follow the link from Rekovni's comment below my question.
Second: In his link, there's also an experimental tool which is being developed by GitLab. It lists and optionally deletes those old unused Docker layers (related to the bug).
Third: If anyone wants do his own thing, I hacked together a pretty ugly script which lists the image size for every repo:
#!/usr/bin/env python3
# coding: utf-8
import os
from os.path import join, getsize
import subprocess
def get_human_readable_size(size,precision=2):
suffixes=['B','KB','MB','GB','TB']
suffixIndex = 0
while size > 1024 and suffixIndex < 4:
suffixIndex += 1
size = size/1024.0
return "%.*f%s"%(precision,size,suffixes[suffixIndex])
registry_path = '/var/opt/gitlab/gitlab-rails/shared/registry/docker/registry/v2/'
repos = []
for repo in os.listdir(registry_path + 'repositories'):
images = os.listdir(registry_path + 'repositories/' + repo)
for image in images:
try:
layers = os.listdir(registry_path + 'repositories/{}/{}/_layers/sha256'.format(repo, image))
imagesize = 0
# get image size
for layer in layers:
# get size of layer
for root, dirs, files in os.walk("{}/blobs/sha256/{}/{}".format(registry_path, layer[:2], layer)):
imagesize += (sum(getsize(join(root, name)) for name in files))
repos.append({'group': repo, 'image': image, 'size': imagesize})
# if folder doesn't exists, just skip it
except FileNotFoundError:
pass
repos.sort(key=lambda k: k['size'], reverse=True)
for repo in repos:
print("{}/{}: {}".format(repo['group'], repo['image'], get_human_readable_size(repo['size'])))
But please do note, it's really static, doesn't list specific tags for an image, doesn't take into account that some layers might be used by other images as well. But it will give you a rough estimate in case you don't want to use Gitlab's tool written above. You might use the ugly script as you like, but I do not take any liability whatsoever.
The current answer should now be marked as deprecated.
As posted in the comments, if your repositories are nested, you will miss projects. Additionally, from experience, it seems to under-count the disk space used by the the repositories it finds. It will also skip repositories that are created with Gitlab 14 and up.
I was made aware of that by using the Gitlab Rails Console that is now available: https://docs.gitlab.com/ee/administration/troubleshooting/gitlab_rails_cheat_sheet.html#registry-disk-space-usage-by-project
You can adapt that command to increase the number of projects it will find as it's only looking at the 100 last projects.

Uptodate list of running docker containers stated in an exported golang variable

I am trying to use the Golang SDK of Docker in order to maintain a slice variable with currently running containers on the local Docker instance. This slice is exported from a package and I want to use it to feed a web page.
I am not really used to goroutines and channels and that's why I am wondering if I have spotted a good solution for my problem.
I have a docker package as follows.
https://play.golang.org/p/eMmqkMezXZn
It has a Running variable containing the current state of running containers.
var Running []types.Container
I use a reload function to load the running containers in the Running variable.
// Reload the list of running containers
func reload() error {
...
Running, err = cli.ContainerList(context.Background(), types.ContainerListOptions{
All: false,
})
...
}
And then I start a goroutine from the init function to listen to Docker events and trigger the reload function accordingly.
func init() {
...
// Listen for docker events
go listen()
...
}
// Listen for docker events
func listen() {
filter := filters.NewArgs()
filter.Add("type", "container")
filter.Add("event", "start")
filter.Add("event", "die")
msg, errChan := cli.Events(context.Background(), types.EventsOptions{
Filters: filter,
})
for {
select {
case err := <-errChan:
panic(err)
case <-msg:
fmt.Println("reloading")
reload()
}
}
}
My question is, is it proper to update a variable from inside a goroutine (in terms of sync)? Maybe there is a cleaner way to achieve what I am trying to build?
Update
My concern here is not really about caching. It is more about hiding the "complexity" of the process of listening and update from the Docker SDK. I wanted to provide something like an index to easily let the end user loop and display currently running containers.
I was aware of data-races problems in threaded programs but I did not realize I was as actually in a context of concurrence here (I never wrote concurrent programs in Go before).
I effectively need to re-think the solution to be more idiomatic. As far as I can see, I have two options here: either protecting the variable with a mutex or re-thinking the design to integrate channels.
What means the most to me is to hide or encapsulate the method of synchronization used so the package users need not concern of how the shared state is protected.
Would you have any recommendations?
Thanks a lot for your help,
Loric
No, it is not idiomatic Go to share the Running variable between two goroutines. You do this by sharing it between the routine that runs your main function, and the listen function which is started with go—which spawns another goroutine.
Why, is because it breaks with
Do not communicate by sharing memory; instead, share memory by
communicating. ¹
So the design of the API needs to change in order to be idiomatic; you need to remove the Running variable and replace it with what? It depends on what you are trying to achieve. If you are trying to cache the cli.ContainerList because you need to call it often, and it might be expensive, you should implement a cache which is invalidated on each cli.Events.
What is your motivation?

Apache Drill in Docker container: java.net.BindException: Address already in use

I'm using Apache Drill to convert csv data to parquet.
I want to do this in a distributed manner, so I spin up a Docker container, run code similar the example below to convert to CSV.
When I run one instance at a time, this works well. But when I spin up several containers simultaneously, the operation often fails with this stack trace:
Error: Failure in starting embedded Drillbit: java.net.BindException: Address already in use (state=,code=0)
java.sql.SQLException: Failure in starting embedded Drillbit: java.net.BindException: Address already in use
at org.apache.drill.jdbc.impl.DrillConnectionImpl.<init>(DrillConnectionImpl.java:131)
at org.apache.drill.jdbc.impl.DrillJdbc41Factory.newDrillConnection(DrillJdbc41Factory.java:72)...
I don't know much about Drill - I haven't used it for anything before this.
I get the big idea that multiple instances of Drill cannot be running simultaneously, but these Docker containers shouldn't know about each other.
The one thing they have in common is that they write to a common (shared) output folder. But each file name is unique.
Can anyone shed some light on this?
Are there configuration settings I should look at?
The code I'm running is similar to this:
alter session set `store.format`='parquet';
CREATE TABLE dfs.tmp.`/fp9gr34f/parquet_tmp_output` AS
SELECT
CASE when columns[0]='source_file' or columns[0]='' then CAST(NULL AS VARCHAR(100)) else CAST(columns[0] as VARCHAR(100)) end as `source_file`,
CASE when columns[1]='column1' or columns[1]='' then CAST(NULL AS INT) else CAST(columns[1] as INT) end as `msg_command`,
CASE when columns[2]='column2' or columns[2]='' then CAST(NULL AS INT) else CAST(columns[2] as INT) end as `msg_length`
FROM dfs.`/path/to/my/file.csv`
OFFSET 1

SSIS foreach loop takes wrong file

I'm developing a SSIS Package that copies contents of specific files to a database. In this package I mak heavy use of the foreach container. Today I came across a strange behavior and have no clue whats wrong. In one of the containers I filter for "VBFA*.txt". But for some reason the container also gets triggered for a file called "VBAP.D2014211.T204008397.R000564.txt". When I change any part of that filename it doesn't trigger the container anymore. Additionally there are plenty of other files that start with "VBAP" and don't trigger the container. What could be the reason for this behavior?
Here is the enumerators implementation:
<DTS:ForEachEnumerator>
<DTS:Property DTS:Name="ObjectName">{6E07E755-700D-4D7D-9550-E08DA5B81264}
</DTS:Property>
<DTS:Property DTS:Name="DTSID">
{f0ceed84-f95c-404c-8794-2eec0155d1a6}</DTS:Property>
<DTS:Property DTS:Name="Description"></DTS:Property>
<DTS:Property DTS:Name="CreationName">DTS.ForEachFileEnumerator.2</DTS:Property>
<DTS:ObjectData>
<ForEachFileEnumeratorProperties>
<FEFEProperty Folder="\\desoswi0204vs\etldata\transfers\out\DP"/>
<FEFEProperty FileSpec="VBFA*.txt"/>
<FEFEProperty FileNameRetrievalType="0"/>
<FEFEProperty Recurse="0"/>
</ForEachFileEnumeratorProperties>
</DTS:ObjectData>
</DTS:ForEachEnumerator>
I've checked the paths contents with dir /x and the short name of my file is wrong. For the file "VBAP.D2014211.T204008397.R000564.txt" the shortname is "VBFA08~1.TXT". The full result is:
01.08.2014 11:02 1.067.169 VBFA08~1.TXT VBAP.D2014211.T204008397.R000564.txt
I have absolutely no clue, what is happening here and how to stop it. This violates every rule I've found regarding the short filename creation. I leave this as the answer for everybody else who is comming accross this beahvior, which is also the case for c# Directory.GetFiles

Resources